OpenAI drops SWE-bench, finding most hard tasks are broken

Summary

On 23 February 2026 Openai's Frontier Evals team explained why it had stopped reporting SWE-bench Verified: an audit found 59.4% of the hardest problems had fundamentally flawed or unsolvable test cases, tests that pass even when the underlying bug is unfixed, implying 5-15 points of score inflation on post-2023 models. The finding undercuts the headline coding numbers labs cite (Anthropic Opus 4.8 at 88.6%, the suspended Fable 5 at 95.0%, Google Deepmind Gemini near 80%). It accelerates a shift toward task-completion, cost-per-task and agent benchmarks (Terminal-Bench) over static suites, and feeds the "evaluation gap" as models grow situationally aware during tests.

By the numbers

59.4%, hardest SWE-bench tasks with flawed/unsolvable tests.
5-15 points, estimated inflation on post-2023 models.
23 Feb 2026, OpenAI's disclosure.
88.6%, Opus 4.8 SWE-bench Verified (now suspect).
95.0%, Fable 5 score (model suspended).

Why it matters

If the field's flagship coding benchmark is broken, the public capability rankings labs market, and investors price, are unreliable. It pushes evaluation toward harder-to-game agentic tasks and strengthens the case that models are learning to exploit eval artefacts rather than solve problems.

What to watch

Whether Anthropic and Google restate or defend their SWE-bench numbers.
Adoption of Terminal-Bench / cost-per-task as the new standard.
A repaired SWE-bench or a successor benchmark.

各地の論調 · 2

▸ ML research

MarkTechPost · United States · en · 2026年5月15日

Benchmark-driven ranking of coding agents that surfaces OpenAI's Frontier Evals finding: 59.4% of the hardest SWE-bench tasks had tests passing even with the bug unfixed, implying 5-15 point inflation on post-2023 models, prompting OpenAI to stop reporting the score.

“OpenAI found 59.4% of the hardest SWE-bench tasks had tests that pass even when the underlying bug is unfixed.”

出典 ↗

▸ developer leaderboard

Morph LLM (coding leaderboard) · United States · en

Maintains a SWE-bench Pro / cost-per-task leaderboard; documents the limits of static coding benchmarks and the shift toward task- and cost-based scoring as raw SWE-bench loses credibility.

“Claude Opus 4.8 scores 88.6% SWE-bench Verified and is the practical pick; benchmark inflation makes raw scores hard to trust.”

出典 ↗