rbtfl.
Coding agents become the frontier's main proving ground

Coding agents become the frontier's main proving ground

Codex+GPT-5.5 and Claude Code+Opus 4.8 top agent leaderboards as static benchmarks lose credibility

AI·agents· active How Life Changes·The Long Game ·4 takes · ·rbtfl upd Jun 25, 2026

Summary

By mid-2026 coding agents, tool-plus-model combinations, became the frontier's main proving ground. Codex with GPT-5.5 leads Terminal-Bench (~83.4%); Anthropic's Claude Code with Opus 4.8 (and the suspended Fable 5) sits close behind (~83.1%); Google's Antigravity agents enter the field. The shift follows OpenAI's SWE-bench audit: with static benchmarks discredited, evaluation moves to long-horizon, tool-use and cost-per-task metrics where performance is a property of the whole agent, not the bare model. Codex alone now serves 2m+ weekly users.

By the numbers

  • ~83.4%, Codex+GPT-5.5 on Terminal-Bench.
  • ~83.1%, Claude Code+Opus/Fable 5 on Terminal-Bench.
  • 2m+, weekly Codex users.
  • Long-horizon, tool-use, cost, the new evaluation axes.

Why it matters

When the unit of capability is an agent that operates real tools and software, the moat shifts from model weights to the harness, sandbox and integration around them, favouring labs with full agent stacks (Openai Codex, Anthropic Claude Code, Google Deepmind Antigravity). It also changes how enterprises buy: by task completion and cost, not benchmark headline.

What to watch

  • Agent benchmarks resistant to gaming as the new standard.
  • Safety/error regimes for agents acting across live systems.
  • Whether open-weight agents (DeepSeek, Qwen) close the agent gap.