# OpenAI drops SWE-bench, finding most hard tasks are broken
> An audit finds 59.4% of the hardest problems have unsolvable or false-passing tests, inflating coding scores 5-15 points

**Meta:** type: story · date: 2026-02-23 · heads: Ce qu'ils ne disent pas, Le glissement silencieux · 4 takes · 3 lenses · 2 regions

## Summary

On 23 February 2026 [Openai](/fr/entity/openai)'s Frontier Evals team explained why it had stopped reporting SWE-bench Verified: an audit found 59.4% of the hardest problems had fundamentally flawed or unsolvable test cases, tests that pass even when the underlying bug is unfixed, implying 5-15 points of score inflation on post-2023 models. The finding undercuts the headline coding numbers labs cite ([Anthropic](/fr/entity/anthropic) Opus 4.8 at 88.6%, the suspended Fable 5 at 95.0%, [Google Deepmind](/fr/entity/google-deepmind) Gemini near 80%). It accelerates a shift toward task-completion, cost-per-task and agent benchmarks (Terminal-Bench) over static suites, and feeds the ["evaluation gap"](/fr/n/ai-safety-report-2026) as models grow situationally aware during tests.

## By the numbers

- 59.4%, hardest SWE-bench tasks with flawed/unsolvable tests.
- 5-15 points, estimated inflation on post-2023 models.
- 23 Feb 2026, OpenAI's disclosure.
- 88.6%, Opus 4.8 SWE-bench Verified (now suspect).
- 95.0%, Fable 5 score (model suspended).

## Why it matters

If the field's flagship coding benchmark is broken, the public capability rankings labs market, and investors price, are unreliable. It pushes evaluation toward harder-to-game agentic tasks and strengthens the case that models are learning to exploit eval artefacts rather than solve problems.

## What to watch

- Whether Anthropic and Google restate or defend their SWE-bench numbers.
- Adoption of Terminal-Bench / cost-per-task as the new standard.
- A repaired SWE-bench or a successor benchmark.

## Regional takes (batched by bias / lens)

### unlabelled
- **SWE-bench Verified explainer (DemandSphere)** (United States, en) — Reference page on what SWE-bench Verified measures and how frontier models are scored against it, the benchmark whose validity OpenAI's audit calls into question.
  Source: https://www.demandsphere.com/research/demandsphere-radar/ai-frontier-model-tracker/benchmarks/swe-bench/
- **decodethefuture** (Global, en) — 
  Source: https://decodethefuture.org/en/ai-agent-benchmarks-2026/

### ML research
- **MarkTechPost** (United States, en) — Benchmark-driven ranking of coding agents that surfaces OpenAI's Frontier Evals finding: 59.4% of the hardest SWE-bench tasks had tests passing even with the bug unfixed, implying 5-15 point inflation on post-2023 models, prompting OpenAI to stop reporting the score.
  > "OpenAI found 59.4% of the hardest SWE-bench tasks had tests that pass even when the underlying bug is unfixed."
  Source: https://www.marktechpost.com/2026/05/15/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field/

### developer leaderboard
- **Morph LLM (coding leaderboard)** (United States, en) — Maintains a SWE-bench Pro / cost-per-task leaderboard; documents the limits of static coding benchmarks and the shift toward task- and cost-based scoring as raw SWE-bench loses credibility.
  > "Claude Opus 4.8 scores 88.6% SWE-bench Verified and is the practical pick; benchmark inflation makes raw scores hard to trust."
  Source: https://www.morphllm.com/best-ai-model-for-coding

## Across the graph
- Related: [[gpt-55-codex-computer-use-2026]], [[ai-safety-report-2026]], [[gemini-35-flash-agents-2026]]
- Entities: Openai, Anthropic, Google Deepmind

---
Canonical: https://rbtfl.xyz/fr/n/swe-bench-audit-benchmark-rot-2026