Claude Opus 4.7 hit 87.6% on SWE-bench Verified on the day it launched (April 16, 2026), 6.8 points above Opus 4.6 and 2.6 points above the next non-Anthropic model, GPT-5.3-Codex at 85.0%. Anthropic published the number; the Vellum independent breakdown and the TNW launch coverage walk through it. Cursor’s first-party CursorBench moved from 58% on Opus 4.6 to 70% on 4.7, the biggest one-release jump that benchmark has shown. The first 48 hours of community reaction (r/ClaudeAI, r/cursor) tracked the same shape: agent loops feel different, tool errors are visibly down, and the JSON output drift on deep arrays is still there. This is the review that takes the public numbers, the community sentiment, and our 14-task editorial scoring and tells you when Opus 4.7 is the right pick.
Quick answer: if you refactor multi-package TypeScript or run long-horizon agent loops, Opus 4.7 is the model to beat right now. If you ship strict-JSON pipelines at 100k runs a day, GPT-5.3-Codex is still tighter and noticeably cheaper. The rest is the why.
Public benchmarks (April 2026)
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 87.6% | 80.8% | n/p | 80.6% |
| SWE-bench Pro | 64.3% | 53.4% | 57.7% | 54.2% |
| Terminal-Bench 2.0 | 69.4% | 65.4% | n/p | n/p |
| GPQA Diamond | 94.2% | 91.3% | 94.4% | 94.3% |
| Finance Agent | 64.4% | 60.7% | n/p | 59.7% |
| CursorBench | 70% | 58% | n/p | n/p |
Sources: Anthropic launch post for the headline numbers, Vellum’s third-party breakdown for cross-checks, and the llm-stats Opus 4.7 launch summary for the migration notes. Anthropic’s memorization screen on SWE-bench is documented; Opus 4.7’s lead holds when flagged items are excluded.
The single biggest move is SWE-bench Pro at +10.9 points. Pro is the harder multi-language variant, and the Opus 4.7 lead there is the strongest signal that the SWE-bench Verified gain generalizes outside the benchmark’s training distribution. If you only look at one number, look at the Pro score, not the Verified one.
Where it wins, in our 14-task editorial scoring
The 14-task scorecard is our own; it grades models against the same six domains every release. We re-ran the suite at the shipped claude-opus-4-7 endpoint on the Max plan, with the default coding preset, thinking = {"type":"adaptive"} and output_config.effort = "high". Every score below is the median of 5 runs, scoring rubric on the methodology page.
| Domain | Opus 4.7 (median) | Week delta vs Opus 4.6 | vs GPT-5.3-Codex |
|---|---|---|---|
| Refactor (cross-package rename) | 9.0 | +0.4 | +0.6 |
| Test-gen (property-based) | 8.4 | +0.2 | -0.3 |
| Debug (deep stack) | 8.8 | +0.6 | +0.4 |
| Agent & tool use (bounded) | 9.1 | +0.9 | +0.5 |
| Structured output (40-prop schema) | 8.2 | flat | -0.8 |
| RAG (retrieval over 1M context) | 8.6 | +0.3 | +0.1 |
The +0.9 jump on agent and tool use mirrors what Anthropic claims (a 14% gain on multi-step workflows with one third the tool errors). On the bounded planner task it triggered the give-up cleanly on 5 of 5 runs, which is the change that will show up in your CI bill if you run long agent loops. Opus 4.6 hit the give-up on 3 of 5.
Where it loses: strict JSON at scale
Our structured output task is 100 adversarial inputs against a 40-property schema. GPT-5.3-Codex with response_format={"type":"json_schema","strict":true} hits 99-100 of 100. Opus 4.7 with the tool-use interface and tool_choice forced hits 94-96. The failures are the same three modes every run: an unescaped quote inside a nested string, a stray trailing comma on a deep array, and the case where the model returns the schema instead of an instance of the schema. The first two can be patched by a constrained-decoding path; the third is a prompt-side fix and the wording is in the strict-JSON prompt.
If your production pipeline runs 100k+ LLM calls a day and parse errors are a real cost line, keep GPT-5.3-Codex in the strict-JSON slot. Opus 4.7 will do the job; it just needs the guard rails.
What changed under the hood
Two migration items from Anthropic’s own migration guide matter for cost forecasting:
- Tokenizer change. The same input now maps to roughly 1.0-1.35x more tokens. Plan for a token-bill increase of ~15% on average even before higher reasoning effort kicks in.
- More thinking on later turns. Opus 4.7 thinks more at higher effort levels in agentic settings. The reliability gain is real (the Pro score and tool-use error rate both confirm it). The output-token cost increase is also real. Tune
output_config.effortandmax_tokenson the second pass.
The recurring r/ClaudeAI threads the week of the launch flag both items in the same shape: people are happy with the agent reliability but watching the bill. Several engineers on Hacker News flagged the Max plan price unchanged at $200 (it was; the change was the tokenizer, not the seat).
Opus 4.7 effort levels: when xhigh is worth it
At effort: "xhigh", our refactor task moves from 9.0 to 9.5 (+5%) at a 4.8x latency increase and a 3.1x cost increase per task. On the deep-debug task, xhigh is flat. On the bounded-planner task, xhigh actually hurts because the model overthinks and blows the step budget.
Rule of thumb: turn xhigh on for one-shot refactors that cross barrel files and for long-context retrieval. Leave it off for agent loops with tight budgets and for structured-output work.
Pricing and economics
| Model | Input $/M | Output $/M | Context | Median cost on our 14-task suite |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | 1M | $0.41/task |
| GPT-5.5 | $5.00 | $30.00 | 1M | $0.46/task |
| GPT-5.4 | $2.50 | $15.00 | 400K | $0.27/task |
| GPT-5.3-Codex | $1.25 | $10.00 | 400K | $0.21/task |
| Gemini 3.1 Pro | varies | varies | varies | $0.21/task |
Pricing per Anthropic and OpenAI’s GPT-5.4 launch. Opus 4.7 is the most expensive frontier model right now, but it ships a 1M context window where most competitors top out at 400K. If your workload uses that context, the $/successful-task math changes.
What the threads are saying
Two patterns dominate the first-week community discussion. The r/ClaudeAI top threads land on agent loops feeling qualitatively different inside an hour of use, especially on Claude Code with Auto mode enabled. The recurring complaint, on both Reddit and the active Anthropic Python SDK issue tracker, is the JSON drift on deep nested structures, which our scorecard measures at -0.8 vs GPT-5.3-Codex.
If you read one thing outside this review, read the Anthropic migration guide on tokenizer changes. The bill goes up before you change anything else.
How it compares
| TCC editorial score | Claude Opus 4.7 | GPT-5.5 | GPT-5.3-Codex | GPT-5.4 | Gemini 3.1 Pro | Cursor Composer 2 |
|---|---|---|---|---|---|---|
| Refactor | 9.0 | 8.7 | 8.4 | 8.5 | 7.8 | 8.1 |
| Test-gen | 8.4 | 8.6 | 8.7 | 8.6 | 7.6 | 8.0 |
| Debug | 8.8 | 8.7 | 8.4 | 8.5 | 7.9 | 8.2 |
| Agent & tool use | 9.1 | 9.3 | 8.6 | 8.7 | 7.4 | 8.3 |
| Strict JSON | 8.2 | 8.9 | 9.0 | 8.8 | 7.9 | 8.0 |
| Cost per successful task | $0.41 | $0.46 | $0.21 | $0.27 | $0.31 | included in plan |
What about GPT-5.5?
OpenAI shipped GPT-5.5 on April 23, 2026 — the first fully retrained base model since GPT-4.5, 1M-token context, $5/$30 per M tokens. The headline reframe: GPT-5.5 leads Opus 4.7 by 13 points on Terminal-Bench 2.0 (82.7% vs 69.4%) and by 3 points on the Artificial Analysis Intelligence Index (60 vs 57). For agentic terminal automation and long tool-use chains, the slot Opus 4.7 held for two months now belongs to GPT-5.5.
Where Opus 4.7 still wins outright: SWE-Bench Pro at 64.3% vs 58.6% (the 5.7-point gap is the largest non-saturated coding gap on the leaderboard), cross-package TypeScript refactor on our scorecard at 9.0 vs 8.7, and bounded agent loops where Anthropic’s give-up training keeps step budgets clean. Pricing is identical on input ($5/M); GPT-5.5 charges $30/M output vs Opus’s $25/M, so blended cost-per-task runs ~12% higher. The new mental model: Opus 4.7 for hard refactor and bounded agents, GPT-5.5 for terminal automation and unbounded tool chains, GPT-5.3-Codex for strict JSON at scale.
Verdict
Opus 4.7 is the best generally available coding model on the public leaderboards in April 2026, and the +10.9-point jump on SWE-bench Pro is the number that says the gain is real and not a Verified-only artifact. Put strict-JSON pipelines on GPT-5.3-Codex, keep Opus 4.7 on the hard refactor and the bounded agent loop, and rerun your own evals against the new tokenizer before you sign a quarter-sized commit.
If this is the first review you read on TCC, the 14-task methodology page explains why we do not show a single averaged score on the leaderboard.