Claude Opus 4.7 hit 87.6% on SWE-bench Verified on the day it launched (April 16, 2026), 6.8 points above Opus 4.6 and 2.6 points above the next non-Anthropic model, GPT-5.3-Codex at 85.0%. Anthropic published the number; the Vellum independent breakdown and the TNW launch coverage walk through it. Cursor’s first-party CursorBench moved from 58% on Opus 4.6 to 70% on 4.7, the biggest one-release jump that benchmark has shown. The first 48 hours of community reaction (r/ClaudeAI, r/cursor) tracked the same shape: agent loops feel different, tool errors are visibly down, and the JSON output drift on deep arrays is still there. This is the review that takes the public numbers, the community sentiment, and our 14-task editorial scoring and tells you when Opus 4.7 is the right pick.
Quick answer: if you refactor multi-package TypeScript, run long-horizon agent loops, or process image-heavy workflows, Opus 4.7 is the model to beat right now. If you ship strict-JSON pipelines at 100k runs a day, GPT-5.3-Codex is still tighter and noticeably cheaper. If your workload is web research and source synthesis, Opus 4.6 is actually stronger on that specific task. The rest of this review explains why.
What changed from Opus 4.6
Anthropic optimized hard for three things: agentic coding reliability, vision resolution, and instruction precision. Several other capabilities got worse as a result. This is not a universal upgrade.
Three structural changes matter most for anyone running production workloads:
New tokenizer. Opus 4.7 processes the same input text using 1.0-1.35x more tokens depending on content type. The rate card stays at $5/$25 per million tokens, but your actual spend rises 10-35% for identical prompts. Code-heavy inputs tend toward 1.0x; natural language with unusual formatting goes higher. For teams spending $10,000/month on Opus 4.6 API calls, migrating without prompt optimization could mean $11,000-$13,500 for the same work. Non-Latin scripts (Mandarin, Japanese, Korean, Arabic, Hindi) are 20-35% more token-efficient under the new tokenizer, so multilingual workloads benefit.
Adaptive thinking only. Fixed budget_tokens is gone. Opus 4.7 uses adaptive thinking exclusively, where it judges reasoning depth based on perceived task complexity. You can influence this through effort levels but can no longer specify exact token budgets for reasoning. The xhigh effort level is new and sits between high and max. Claude Code defaults to xhigh for all plans.
More literal instruction-following. The model follows prompts more exactly than 4.6. Anthropic added an explicit warning in the release notes: prompts written for 4.6 may behave unexpectedly with 4.7 because the new model takes everything literally. Where 4.6 interpreted ambiguous instructions generously, 4.7 picks one interpretation and holds it. If your system prompt has vague wording that older versions smoothed over, 4.7 will surface it as a problem. Test production prompts before switching.
Public benchmarks (April 2026)
| Benchmark | Opus 4.7 | Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 87.6% | 80.8% | 84.1% | 80.6% |
| SWE-bench Pro | 64.3% | 53.4% | 57.7% | 54.2% |
| MCP-Atlas | 77.3% | n/p | 68.1% | 73.9% |
| Terminal-Bench 2.0 | 69.4% | 65.4% | 75.1% | 68.5% |
| BrowseComp | 79.3% | 83.7% | 89.3% | 85.9% |
| GPQA Diamond | 94.2% | 91.3% | 92.8% | 94.3% |
| Finance Agent | 64.4% | 60.7% | 56.0% | 59.7% |
| CursorBench | 70% | 58% | n/p | n/p |
Sources: Anthropic launch post, Vellum’s third-party breakdown, and the llm-stats Opus 4.7 launch summary.
Two numbers deserve extra attention. SWE-bench Pro at +10.9 points over Opus 4.6 is the strongest signal that the gains are real and not benchmark-specific artifacts. Pro is the harder multi-language variant; its lead generalizes outside the benchmark’s training distribution. The MCP-Atlas score at 77.3% puts Opus 4.7 more than 9 points ahead of GPT-5.4 on multi-tool orchestration. If your agents use a rich tool set and need to coordinate across them, that gap is meaningful. BrowseComp at 79.3% is a genuine regression from 4.6’s 83.7%. For agents doing web research and source synthesis, 4.6 is still the stronger pick.
The agentic persistence fix
The most significant improvement in Opus 4.7 is behavioral, not a benchmark number. Opus 4.6 had a documented issue in long agentic sequences: during repeated tool calls, filesystem operations, or multi-step pipelines, the model would sometimes abandon subtasks partway through. It would declare completion before actually finishing, or drop a branch of reasoning without flagging the failure. Anthropic described this internally as a “persistence deficit” in long-horizon tasks.
Opus 4.7 fixes this substantially. Task abandonment rates in long-horizon agentic workflows dropped roughly 60% compared to 4.6. The model now maintains goal state across tool-call sequences, re-attempts failed steps, and distinguishes between “I cannot do this” and “this specific step failed, try a different approach.” When it does give up, the failure report is more useful, not just “I couldn’t complete the task.”
The improvement appears to come from two sources: additional training on long-horizon agentic trajectories, and a revised system prompt architecture that gives the model better access to its own task state. That tracking no longer eats into useful context. If you have been working around 4.6 persistence issues with custom prompting or manual checkpointing, those workarounds are mostly unnecessary with 4.7.
Vision at 3.75 megapixels
The vision upgrade is larger than the headline number suggests. Opus 4.7 accepts images up to 2,576 pixels on the long edge, producing roughly 3.75 megapixels of visual capacity versus 1.15 megapixels in 4.6. On the visual navigation benchmark without tools, the score jumped from 57.7% to 79.5%, a 22-point improvement that correlates directly with the resolution increase.
Real-world partner data is more striking. XBOW builds autonomous penetration testing tools that rely on reading dense screenshots. Their visual-acuity benchmark went from 54.5% to 98.5% with this change alone. Solve Intelligence, which uses Claude for patent workflows in life sciences, can now read chemical structures directly from images rather than describing them in words. For teams processing technical screenshots, dense PDFs, or scanned documents, the difference between 1.15MP and 3.75MP is the difference between a workflow that works and one that doesn’t.
One cost note: higher-resolution images cost more tokens. If you’re sending images via API and don’t need the extra detail, downsample before sending. There is no API toggle for this; you handle it yourself.
Where it wins: our 14-task editorial scoring
The 14-task scorecard grades models against the same six domains every release. We re-ran the suite at the shipped claude-opus-4-7 endpoint on the Max plan, with the default coding preset, thinking = {"type":"adaptive"} and output_config.effort = "high". Every score below is the median of 5 runs.
| Domain | Opus 4.7 (median) | Week delta vs Opus 4.6 | vs GPT-5.3-Codex |
|---|---|---|---|
| Refactor (cross-package rename) | 9.0 | +0.4 | +0.6 |
| Test-gen (property-based) | 8.4 | +0.2 | -0.3 |
| Debug (deep stack) | 8.8 | +0.6 | +0.4 |
| Agent & tool use (bounded) | 9.1 | +0.9 | +0.5 |
| Structured output (40-prop schema) | 8.2 | flat | -0.8 |
| RAG (retrieval over 1M context) | 8.6 | +0.3 | +0.1 |
The +0.9 jump on agent and tool use mirrors what Anthropic claims: a 14% gain on multi-step workflows with one third the tool errors. On the bounded planner task it triggered the give-up cleanly on 5 of 5 runs, versus 3 of 5 for Opus 4.6. That improvement shows up in CI bills for long agent loops. Vercel engineers noted that 4.7 now performs proof-checks on systems code before writing, catching off-by-one errors and race conditions that previous versions missed. Warp (the AI terminal app) specifically cited a tricky concurrency bug that Opus 4.6 could not crack.
Where it loses: strict JSON at scale
Our structured output task runs 100 adversarial inputs against a 40-property schema. GPT-5.3-Codex with response_format={"type":"json_schema","strict":true} hits 99-100 of 100. Opus 4.7 with the tool-use interface and tool_choice forced hits 94-96. The failures repeat the same three modes every run: an unescaped quote inside a nested string, a stray trailing comma on a deep array, and the case where the model returns the schema instead of an instance of the schema. The first two can be patched by a constrained-decoding path; the third is a prompt-side fix.
The BrowseComp regression matters too. Opus 4.7 scores 79.3% on BrowseComp versus 83.7% for Opus 4.6 and 89.3% for GPT-5.4. Source attribution accuracy dropped, contradiction detection is weaker, and citation specificity is lower. If your agents do major web browsing and synthesis, 4.7 is a step backward from 4.6 for that workload.
If your production pipeline runs 100k+ LLM calls a day and parse errors are a real cost line, keep GPT-5.3-Codex in the strict-JSON slot. For web research workflows, consider staying on Opus 4.6 or routing to GPT-5.4.
New API features
Task budgets (public beta). You give Claude a rough token estimate for a full agentic loop, covering thinking, tool calls, tool results, and final output. The model sees a running countdown and adjusts its work accordingly, prioritizing tasks and finishing gracefully as the budget runs low. This is a soft suggestion, not a hard cap. If you have had Claude spend 80% of its budget on the first 20% of a task and run out before finishing, this addresses that. Combined with auto mode for Max users, the risk of a single session consuming your daily allocation drops substantially.
xhigh effort level. The full effort ladder is now: low > medium > high > xhigh > max. Claude Code defaults to xhigh for all plans. If you use Claude Code and sessions are suddenly slower or token counts are higher after migration, that is why. You can dial back to high if needed.
File-system memory across sessions. Opus 4.7 is measurably better at reading, writing, and reusing notes stored in persistent files across multiple agent sessions. Agents that maintain scratchpads or structured memory stores see improvement without prompt changes. Notion’s AI team said 4.7 is “the first model to pass our implicit-need tests,” meaning it picks up on what you need without you explicitly stating it. For multi-session engineering work, this removes most of the context reconstruction overhead at the start of each session.
/ultrareview in Claude Code. A new command that runs a dedicated review session on your code changes, reading through everything and flagging bugs and design issues. Anthropic is giving Pro and Max Claude Code users three free ultrareviews to try it. The intent is catching deep-state issues, not just style problems.
What changed under the hood
Two migration items from Anthropic’s migration guide matter for cost forecasting:
- Tokenizer change. The same input maps to roughly 1.0-1.35x more tokens. Plan for a token-bill increase of ~15% on average before higher reasoning effort kicks in. Run your actual prompts through the 4.7 tokenizer before migrating production workloads; Anthropic’s 1.0-1.35x range is content-dependent and your real number will vary.
- More thinking on later turns. Opus 4.7 thinks more at higher effort levels in agentic settings. The reliability gain is real (the Pro score and tool-use error rate both confirm it). The output-token cost increase is also real. Tune
output_config.effortandmax_tokenson the second pass, not the first. - Visible reasoning is gone. Developers who built workflows around watching Claude’s thinking process will find it missing. Opus 4.7 changed default handling of reasoning summaries: the model pauses, then produces an answer, with no visible chain of thought. This breaks pipelines that expected 4.6-style reasoning output and reduces debugging signal for agent failures.
Opus 4.7 effort levels: when xhigh is worth it
At effort: "xhigh", our refactor task moves from 9.0 to 9.5 (+5%) at a 4.8x latency increase and a 3.1x cost increase per task. On the deep-debug task, xhigh is flat. On the bounded-planner task, xhigh actually hurts because the model overthinks and blows the step budget.
Rule of thumb: turn xhigh on for one-shot refactors that cross barrel files and for long-context retrieval. Leave it off for agent loops with tight budgets and for structured-output work.
Pricing and economics
| Model | Input $/M | Output $/M | Context | Median cost on our 14-task suite |
|---|---|---|---|---|
| Claude Opus 4.7 | $5.00 | $25.00 | 1M | $0.41/task |
| GPT-5.5 | $5.00 | $30.00 | 1M | $0.46/task |
| GPT-5.4 | $2.50 | $15.00 | 1.05M | $0.27/task |
| GPT-5.3-Codex | $1.75 | $14.00 | 400K | $0.31/task |
| Gemini 3.1 Pro | $2.00 (≤200K) / $4.00 (>200K) | $12.00 (≤200K) / $18.00 (>200K) | 2M | $0.31/task |
Pricing per Anthropic and OpenAI’s GPT-5.4 launch. Opus 4.7 ships a 1M context window, matching GPT-5.5 and trailing Gemini 3.1 Pro’s 2M. The tokenizer change means your effective cost-per-task can run 10-35% higher than the rate card suggests. Prompt caching and batch pricing remain unchanged: up to 90% savings still apply for eligible workloads.
Partner validation data
Third-party production numbers from the launch period are the strongest signal we have outside formal benchmarks:
- Cursor: CursorBench moved from 58% on Opus 4.6 to 70% on 4.7 on 93 real software engineering tasks. Opus 4.7 solved four tasks that neither 4.6 nor Sonnet 4.6 could handle at all.
- Rakuten: 3x more production task resolutions on their own SWE-bench variant. Multiple independent testers reported double-digit gains in the same period.
- XBOW: Their visual-acuity benchmark for autonomous penetration testing went from 54.5% to 98.5% after the vision resolution change.
- Warp: The AI terminal app reported that 4.7 fixed a tricky concurrency bug that Opus 4.6 could not solve.
- Notion: First model to pass their implicit-need tests on context continuity across sessions.
How it compares
| TCC editorial score | Claude Opus 4.7 | GPT-5.5 | GPT-5.3-Codex | GPT-5.4 | Gemini 3.1 Pro | Cursor Composer 2 |
|---|---|---|---|---|---|---|
| Refactor | 9.0 | 8.7 | 8.4 | 8.5 | 7.8 | 8.1 |
| Test-gen | 8.4 | 8.6 | 8.7 | 8.6 | 7.6 | 8.0 |
| Debug | 8.8 | 8.7 | 8.4 | 8.5 | 7.9 | 8.2 |
| Agent & tool use | 9.1 | 9.3 | 8.6 | 8.7 | 7.4 | 8.3 |
| Strict JSON | 8.2 | 8.9 | 9.0 | 8.8 | 7.9 | 8.0 |
| Cost per successful task | $0.41 | $0.46 | $0.21 | $0.27 | $0.31 | included in plan |
What about GPT-5.5?
OpenAI shipped GPT-5.5 on April 23, 2026: the first fully retrained base model since GPT-5, 1M-token context, $5/$30 per M tokens. GPT-5.5 leads Opus 4.7 by 13 points on Terminal-Bench 2.0 (82.7% vs 69.4%) and by 3 points on the Artificial Analysis Intelligence Index (60 vs 57). For agentic terminal automation and long tool-use chains, the slot Opus 4.7 held for two months now belongs to GPT-5.5.
Where Opus 4.7 still wins outright: SWE-Bench Pro at 64.3% vs 58.6% (the largest non-saturated coding gap on the current leaderboard), cross-package TypeScript refactor at 9.0 vs 8.7 on our scorecard, and bounded agent loops where Anthropic’s give-up training keeps step budgets clean. Pricing is identical on input ($5/M); GPT-5.5 charges $30/M output vs Opus’s $25/M, so blended cost-per-task runs ~12% higher. The new mental model: Opus 4.7 for hard refactor and bounded agents, GPT-5.5 for terminal automation and unbounded tool chains, GPT-5.3-Codex for strict JSON at scale.
What about DeepSeek V4 and Claude Mythos Preview?
DeepSeek V4-Pro and V4-Flash shipped April 24, 2026. V4-Pro (1.6T-param MoE, 49B active, Apache 2.0, 1M context) posts ~80.6% on SWE-bench Verified at $1.74/$3.48 per M tokens: the first open-weight model in the same accuracy band as Opus 4.7 for roughly a third of the per-token cost. V4-Flash (284B/13B MoE) at $0.14/$0.28 hits ~78% Verified, which covers the cheap-and-good slot when frontier quality is overkill. Either swaps in via the OpenAI or Anthropic-format APIs with a one-line config change. Opus 4.7 still wins on agent reliability and on the hardest cross-package work; V4-Pro is the open-weight pressure that should cap closed-source price increases for the rest of 2026. Pricing per the DeepSeek API docs.
Claude Mythos Preview (Anthropic, April 7, 2026) is a separate gated model under Project Glasswing that leads SWE-bench Verified at 93.9%. Access is restricted to vetted partners for defensive cybersecurity work. It is not a competitor to Opus 4.7 for general-purpose coding; it is the signal that the next frontier capability is already in lab hands. Opus 4.7 remains Anthropic’s recommended pick for day-to-day engineering.
Should you upgrade from 4.6?
Upgrade now if: you run agentic coding workflows and have hit the 4.6 persistence issues; your workload is primarily English-language coding; you work in non-English languages where the new tokenizer is 20-35% more efficient; or you build multi-step tool-use pipelines that require reliable task completion. Cursor, CodeRabbit, Replit, Warp, Notion, and Rakuten all saw double-digit improvements.
Don’t upgrade yet if: web research, source synthesis, or document analysis is your primary use case (4.6 is stronger on BrowseComp); you’re cost-sensitive on English-language API usage and don’t have routing in place; or visible reasoning output is part of your debugging pipeline.
Check the migration guide first if: you have custom system prompts tuned for 4.6 behavior. The literal instruction-following change can interact unexpectedly with prompts written around 4.6’s interpretive looseness. Prompts with wording like “be concise unless the question requires depth” will behave differently. Test before switching production traffic.
Pros and cons
| Strengths | Weaknesses |
|---|---|
| Leads SWE-bench Pro at 64.3% (+10.9pts over 4.6) | BrowseComp regressed 4.4pts; worse than 4.6 for web research |
| MCP-Atlas 77.3%, 9pts ahead of GPT-5.4 on multi-tool orchestration | Tokenizer inflates real costs 10-35% despite unchanged rate card |
| 3.75MP vision; visual-nav benchmark jumped 22pts | Visible reasoning removed; breaks pipelines that relied on 4.6 reasoning output |
| Task abandonment in long agents down ~60% from 4.6 | JSON output drift on deep nested arrays (-0.8 vs GPT-5.3-Codex) |
| Task budgets and xhigh effort give cost control on long runs | Prose quality regressed; more mechanical, bullet-heavy output |
| File-system memory improvements require no prompt changes | Adaptive-thinking misreads some tasks that look simple but aren’t |
Frequently asked questions
Is Claude Opus 4.7 better than Opus 4.6? For agentic coding and multi-step tool-use, yes, meaningfully so. For web research, source synthesis, and document analysis, 4.6 is actually stronger. The answer depends entirely on your workload.
Why does Opus 4.7 cost more if the per-token price is the same? The new tokenizer processes English text less efficiently than the 4.6 tokenizer. The same prompt will use 12-35% more tokens on average. Since you pay per token, the effective cost per task goes up even though the listed price per token has not changed. Non-English workloads benefit from the opposite effect.
Does Opus 4.7 fix the task abandonment issue from 4.6? Yes, substantially. Task abandonment in long agentic sequences dropped roughly 60%. It is not a complete elimination; edge cases still exist. But the improvement is large enough that most users running long-horizon workflows will notice a real difference without prompt changes.
How does Opus 4.7 compare to GPT-5.3-Codex for JSON output? GPT-5.3-Codex with strict JSON mode hits 99-100/100 on our 40-property schema test. Opus 4.7 hits 94-96/100. For pipelines running 100k+ calls a day where parse errors are a cost line, GPT-5.3-Codex is the right choice. See the full GPT-5.3-Codex review for that workload.
What happened to visible reasoning in Opus 4.7? It’s gone by default. The model pauses, then produces an answer, with no visible chain of thought. If your pipeline read reasoning summaries for debugging or to guide downstream steps, you need to rework that before migrating.
Is the vision upgrade worth it for document workflows? Yes, if you process images above 750px on the long edge. The jump from 1.15MP to 3.75MP is not marginal; XBOW’s real-world visual-acuity benchmark went from 54.5% to 98.5%. If your images are smaller or you don’t parse visual content, the vision upgrade has no impact.
Verdict
Opus 4.7 is the best generally available coding model on the public leaderboards in April 2026, and the +10.9-point jump on SWE-bench Pro is the number that says the gain is real and not a Verified-only artifact. The 60% drop in agentic task abandonment and the 22-point vision jump make this a clear upgrade for engineering teams running production code agents. The regressions on BrowseComp and JSON reliability are real trade-offs, not footnotes.
Put strict-JSON pipelines on GPT-5.3-Codex, keep web research on 4.6 or route to GPT-5.4, and put Opus 4.7 on the hard refactor and the bounded agent loop. Run your actual prompts through the 4.7 tokenizer before you commit a quarter-sized migration, because the 10-35% cost increase is invisible on the rate card and very visible on the bill.
If this is the first review you read on TCC, the 14-task methodology page explains why we do not show a single averaged score on the leaderboard.