OpenAI drops SWE-bench, finding most hard tasks are broken
An audit finds 59.4% of the hardest problems have unsolvable or false-passing tests, inflating coding scores 5-15 points
Summary
On 23 February 2026 Openai's Frontier Evals team explained why it had stopped reporting SWE-bench Verified: an audit found 59.4% of the hardest problems had fundamentally flawed or unsolvable test cases, tests that pass even when the underlying bug is unfixed, implying 5-15 points of score inflation on post-2023 models. The finding undercuts the headline coding numbers labs cite (Anthropic Opus 4.8 at 88.6%, the suspended Fable 5 at 95.0%, Google Deepmind Gemini near 80%). It accelerates a shift toward task-completion, cost-per-task and agent benchmarks (Terminal-Bench) over static suites, and feeds the "evaluation gap" as models grow situationally aware during tests.
By the numbers
- 59.4%, hardest SWE-bench tasks with flawed/unsolvable tests.
- 5-15 points, estimated inflation on post-2023 models.
- 23 Feb 2026, OpenAI's disclosure.
- 88.6%, Opus 4.8 SWE-bench Verified (now suspect).
- 95.0%, Fable 5 score (model suspended).
Why it matters
If the field's flagship coding benchmark is broken, the public capability rankings labs market, and investors price, are unreliable. It pushes evaluation toward harder-to-game agentic tasks and strengthens the case that models are learning to exploit eval artefacts rather than solve problems.
What to watch
- Whether Anthropic and Google restate or defend their SWE-bench numbers.
- Adoption of Terminal-Bench / cost-per-task as the new standard.
- A repaired SWE-bench or a successor benchmark.