# Evals and benchmarks: measuring the edge of AI capability
> Benchmark scores determine which AI models win enterprise contracts and regulatory approval, making the tests as contested as the systems they measure.

**Meta:** type: reference · date: 2026-07-03 · heads:  · 4 takes · 2 lenses · 2 regions

## What it is

The evals and benchmarks beat tracks how the AI industry, governments, and independent researchers measure what AI systems can actually do. Evaluations ("evals") are structured tests that probe specific capabilities: reasoning, coding, long-horizon task completion, factual knowledge, and resistance to misuse. "Benchmark" is the shorthand for any standardised test suite whose scores frontier labs publish and market.

The beat matters because scores drive decisions. Enterprises use benchmark rankings to pick vendors. Regulators use eval results to set deployment rules. Frontier labs use leaderboard wins to justify valuations. When a number is wrong, or gamed, the downstream decisions built on it are wrong too.

## History

Modern AI benchmarking emerged in 2020 when researchers launched MMLU (Massive Multitask Language Understanding), a 57-subject multiple-choice test covering law, medicine, mathematics, and history. MMLU became the first widely shared yardstick for comparing large language models. Stanford's Center for Research on Foundation Models published HELM in 2022, moving from single-number accuracy to a profile spanning calibration, fairness, toxicity, and efficiency.

The coding-specific track accelerated with SWE-bench in 2023, tasking models to resolve real GitHub issues from open-source Python projects. By 2024 SWE-bench Verified scores were the primary capability claim frontier labs made about their coding ability. ARC Evals, a nonprofit spun out of the Alignment Research Center, became METR in 2024 and specialised in dangerous-capability evals, testing whether a model could assist with bioweapons synthesis or conduct autonomous cyberattacks. A parallel preference track emerged through LMSYS Chatbot Arena, a crowdsourced pairwise comparison platform launched by UC Berkeley researchers in 2023 that generated millions of human preference signals and produced an Elo ranking; by 2026 its open-source successor, LMArena, was the most consumer-realistic leaderboard in the field.

The benchmark-saturation crisis broke into the open in February 2026 when OpenAI's Frontier Evals team disclosed it had stopped reporting SWE-bench Verified scores. An internal audit found that 59.4% of the hardest problems had fundamentally broken test cases, tests that pass even when the underlying bug is unfixed, implying 5-15 points of score inflation across post-2023 models. Scores that Anthropic, Google DeepMind, and OpenAI had published as headline capability claims were undermined overnight.

## Current state

As of July 2026, the field is mid-transition. MMLU is functionally saturated above 88% for frontier models, making differences at the top statistically meaningless. SWE-bench Verified is discredited. Terminal-Bench, which tests AI agents on real command-line tasks rather than static patches, emerged in May 2025 and is now the de facto standard for coding-agent evaluation: as noted in the [코딩 에이전트가 AI 최전선의 주요 검증 무대로 부상](/ko/n/coding-agents-terminal-bench-2026) story, Codex with [GPT-5.5](/ko/n/gpt-55-codex-computer-use-2026) leads at approximately 83.4%, with Claude Code close behind.

METR conducts pre-deployment evals for OpenAI, Anthropic, Google DeepMind, and Meta, publishing capability summaries before each major model release. The UK AI Safety Institute runs parallel pre-deployment testing under a voluntary agreement with the same labs. In February 2026, METR launched a pilot to assess misalignment risk from AI agents deployed inside the frontier labs themselves, with participation from Anthropic, Google, Meta, and OpenAI.

## Relationships

The beat intersects the [coding-agents](/ko/n/coding-agents-terminal-bench-2026) and [benchmark-rot](/ko/n/swe-bench-audit-benchmark-rot-2026) stories directly: the unit of evaluation has shifted from bare model scores to tool-plus-model agent performance. When the benchmark changes, which lab appears to lead changes with it.

Labs have a structural incentive to score well on whatever test the industry watches, creating pressure toward benchmark-specific tuning, an instance of Goodhart's Law. Safety organisations such as METR test capabilities that labs have no incentive to advertise, where a high score is a risk signal rather than a marketing claim. Governments are the third actor. The EU AI Act's high-risk provisions require documented capability testing. The UK AI Safety Institute formalises voluntary pre-deployment agreements. The US AI Safety Institute conducts equivalent testing under a memorandum of understanding with frontier labs.

## What to watch

- Whether Terminal-Bench or a successor solidifies as the durable post-SWE-bench coding standard, and whether frontier labs converge on a single shared leaderboard.
- Publication of METR's full results from the February 2026 misalignment pilot, which remained unpublished as of July 2026.
- Whether the EU AI Act's high-risk provisions shift voluntary pre-deployment evals into legal requirements, and whether US legislation follows.
- Whether any open-weight model achieves frontier-level performance across both capability and safety axes, pressuring closed-model labs to make their eval methods public.

## Regional takes (batched by bias / lens)

### official record
- **METR** (United States, en) — METR's published research on autonomous-capability and dangerous-capability evaluations of frontier models, including RE-Bench and pre-deployment engagements with OpenAI, Anthropic, Google DeepMind, and Meta.
  Source: https://metr.org/research/
- **Stanford CRFM HELM** (United States, en) — Stanford's Holistic Evaluation of Language Models framework, covering accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency across dozens of scenarios as a multi-dimensional alternative to single-number scores.
  Source: https://crfm.stanford.edu/helm/
- **UK AI Safety Institute** (United Kingdom, en) — UK AISI's methodology for pre-deployment frontier-model evaluations, covering dangerous-capability uplift testing and the rationale for independent safety testing outside the frontier labs themselves.
  Source: https://www.gov.uk/government/publications/ai-safety-institute-approach-to-evaluations/ai-safety-institute-approach-to-evaluations

### industry analysis
- **MIT Technology Review** (United States, en) — January 2026 analysis of the frontier landscape, including the benchmark saturation crisis and the shift toward agentic evaluation as static scores lose credibility across the industry.
  Source: https://www.technologyreview.com/2026/01/05/1130662/whats-next-for-ai-in-2026/

## Across the graph
- Related: [[coding-agents-terminal-bench-2026]], [[gpt-55-codex-computer-use-2026]], [[swe-bench-audit-benchmark-rot-2026]]

---
Canonical: https://rbtfl.xyz/ko/n/compute-frontier-evals-benchmarks-backgrounder