# Coding agents become the frontier's main proving ground
> Codex+GPT-5.5 and Claude Code+Opus 4.8 top agent leaderboards as static benchmarks lose credibility

**Meta:** type: story · date: 2026-05-15 · heads: Cómo cambia la vida, El juego largo · 4 takes · 3 lenses · 2 regions

## Summary

By mid-2026 coding agents, tool-plus-model combinations, became the frontier's main proving ground. Codex with [GPT-5.5](/es/entity/openai) leads Terminal-Bench (~83.4%); [Anthropic](/es/entity/anthropic)'s Claude Code with Opus 4.8 (and the suspended Fable 5) sits close behind (~83.1%); [Google's](/es/entity/google-deepmind) [Antigravity](/es/n/gemini-35-flash-agents-2026) agents enter the field. The shift follows [OpenAI's SWE-bench audit](/es/n/swe-bench-audit-benchmark-rot-2026): with static benchmarks discredited, evaluation moves to long-horizon, tool-use and cost-per-task metrics where performance is a property of the whole agent, not the bare model. Codex alone now serves 2m+ weekly users.

## By the numbers

- ~83.4%, Codex+GPT-5.5 on Terminal-Bench.
- ~83.1%, Claude Code+Opus/Fable 5 on Terminal-Bench.
- 2m+, weekly Codex users.
- Long-horizon, tool-use, cost, the new evaluation axes.

## Why it matters

When the unit of capability is an agent that operates real tools and software, the moat shifts from model weights to the harness, sandbox and integration around them, favouring labs with full agent stacks ([Openai](/es/entity/openai) Codex, [Anthropic](/es/entity/anthropic) Claude Code, [Google Deepmind](/es/entity/google-deepmind) Antigravity). It also changes how enterprises buy: by task completion and cost, not benchmark headline.

## What to watch

- Agent benchmarks resistant to gaming as the new standard.
- Safety/error regimes for agents acting across live systems.
- Whether open-weight agents (DeepSeek, Qwen) close the agent gap.

## Regional takes (batched by bias / lens)

### unlabelled
- **Morph LLM (coding-agents leaderboard)** (United States, en) — Scored leaderboard of AI coding agents (tool + model combinations) for June 2026, with Terminal-Bench and cost-per-task figures, the kind of agent-level evaluation displacing static model benchmarks.
  Source: https://www.morphllm.com/best-ai-coding-agents-2026
- **Morph LLM (coding models)** (United States, en) — 
  Source: https://www.morphllm.com/best-ai-model-for-coding

### ML research
- **MarkTechPost** (United States, en) — Ranks software-development agents by benchmark, putting Codex+GPT-5.5 atop Terminal-Bench (~83.4%) with Claude Code+Fable 5 close behind (~83.1%), and stresses that agent performance is a tool-plus-model property, not a raw model score.
  > "Codex + GPT-5.5 leads Terminal-Bench at 83.4%; Claude Code + Fable 5 is 83.1%."
  Source: https://www.marktechpost.com/2026/05/15/best-ai-agents-for-software-development-ranked-a-benchmark-driven-look-at-the-current-field/

### developer
- **decodethefuture** (Global, en) — Surveys the six AI-agent benchmarks that matter in 2026, arguing long-horizon, tool-use and cost metrics now define capability more than single-shot accuracy.
  > "Six tests now matter for agents, long-horizon, tool-use and cost, not single-shot accuracy."
  Source: https://decodethefuture.org/en/ai-agent-benchmarks-2026/

## Across the graph
- Related: [[gpt-55-codex-computer-use-2026]], [[gemini-35-flash-agents-2026]], [[swe-bench-audit-benchmark-rot-2026]]
- Entities: Openai, Anthropic, Google Deepmind

---
Canonical: https://rbtfl.xyz/es/n/coding-agents-terminal-bench-2026