GPT-5.3-Codex sits at 85.0% on SWE-bench Verified and 88.0% on Aider Polyglot in April 2026, the best two public coding numbers any OpenAI model has posted to date. The gap to Claude Opus 4.7 on Verified (87.6%) is 2.6 points; the gap to Opus 4.7 on the harder SWE-bench Pro is wider (~7 points). On strict-JSON output the relationship inverts: GPT-5.3-Codex with response_format={"type":"json_schema","strict":true} hits 99-100 of 100 on a 40-property schema where Opus 4.7 hits 94-96. The recurring r/ChatGPTCoding threads on the launch (search “GPT-5.3-Codex” on the subreddit) cluster on the same takeaway: it is the cheapest frontier model that ships strict mode that actually behaves strictly. This is the review.
Quick answer: if your production pipeline runs structured outputs at scale or your team is already on the OpenAI stack, GPT-5.3-Codex is the strongest cost-adjusted coding model right now. If you do cross-package refactors or run long agent loops with deep tool chains, Claude Opus 4.7 still leads.
Public benchmarks (April 2026)
| Benchmark | GPT-5.3-Codex | GPT-5.4 | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Verified | 85.0% | n/p | 87.6% | 80.6% |
| SWE-bench Pro | 56.4% (5.2-Codex) | 57.7% | 64.3% | 54.2% |
| Aider Polyglot (high) | 88.0% (GPT-5) | n/p | n/p (Opus 4 was 72.0%) | 83.1% (2.5 Pro) |
| Terminal-Bench 2.0 | 64.0% (5.2-Codex) | n/p | 69.4% | n/p |
Sources: Aider polyglot leaderboard, the TokenMix April 2026 SWE-bench summary, and DigitalApplied’s GPT-5 series guide. Aider’s leaderboard tests 225 Exercism exercises across C++, Go, Java, JavaScript, Python, and Rust; it is the most language-diverse public coding benchmark and the one that maps closest to what teams actually edit.
Where it wins, in our 14-task editorial scoring
The 14-task scorecard graded GPT-5.3-Codex via the OpenAI Responses API at the shipped endpoint with reasoning effort set to high. Median of 5 runs.
| Domain | GPT-5.3-Codex | vs Opus 4.7 | vs GPT-5.2-Codex |
|---|---|---|---|
| Refactor | 8.4 | -0.6 | +0.3 |
| Test-gen | 8.7 | +0.3 | +0.4 |
| Debug | 8.4 | -0.4 | +0.2 |
| Agent & tool use | 8.6 | -0.5 | +0.5 |
| Structured output (40-prop schema) | 9.0 | +0.8 | +0.4 |
| RAG (long context) | 8.5 | -0.1 | +0.3 |
The 9.0 on structured output is the headline. GPT-5.3-Codex with strict mode and a typed Pydantic response schema posts 100 of 100 across 500 runs in our adversarial set. The full breakdown of why is on the strict-JSON prompt post. The shorter version: "strict": true on the OpenAI side compiles the JSON Schema into a constrained-decoding grammar at inference, so the unescaped-quote and trailing-comma classes of failures cannot happen by construction.
Where it loses
Cross-package refactors. On a multi-package TypeScript monorepo with re-exported types behind barrel files, GPT-5.3-Codex misses one or two of the indirected call sites that Opus 4.7 catches. The pattern is the same one the refactor legacy TypeScript guide walks through in detail. If your daily work is one-off feature code, you will not feel this. If you ship a 14-call-site rename twice a month, you will.
The other miss is long-horizon agent planning with tool budgets. GPT-5.3-Codex respects a 5-step budget more often than not, but Opus 4.7 hits the budget cleanly on 5 of 5 in our suite where Codex hits it on 4 of 5. The bounded planner prompt gets it to parity, but you have to apply the prompt; Opus 4.7 does not need it.
Pricing and economics
| Model | Input $/M | Output $/M | Cached input $/M | Context |
|---|---|---|---|---|
| GPT-5-Codex | $1.25 | $10.00 | $0.125 | 400K |
| GPT-5.2-Codex | $1.75 | $14.00 | $0.175 | 400K in / 128K out |
| GPT-5.4 | $2.50 | $15.00 | $0.25 | 400K in / 128K out |
| Claude Opus 4.7 | $5.00 | $25.00 | n/a | 1M |
Pricing per the OpenAI models reference and the GPT-5.4 launch post. Cached-input pricing at $0.125/M is the lever most teams underuse. If your workload reuses long system prompts (most agent loops do), the cache hit ratio drops the effective input cost by 8-10x.
OpenAI also ships Batch and Flex pricing at half the standard rate, and Priority at 2x. For overnight evals or large-corpus structured-output runs, Batch is the cheapest frontier-quality option on the market today.
What the threads are saying
Two consistent themes on r/ChatGPTCoding and the OpenAI Python SDK GitHub issues:
- Strict mode finally works. The pre-5 era complaints about JSON Schema strict mode silently dropping fields are gone. The same threads now warn against the one remaining trap: nested
oneOfwith discriminator fields still occasionally produces empty branches. Workaround in the strict-JSON prompt. - Codex is better than 5.4 for code. The recurring “5.4 vs 5.3-Codex for daily coding” thread on r/ChatGPTCoding lands on the same answer in dozens of replies: 5.3-Codex is tuned for diff-edit and tool-use; 5.4 is tuned for reasoning. For an agentic coding session, take Codex.
How it compares
| TCC editorial score | GPT-5.3-Codex | Claude Opus 4.7 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Refactor | 8.4 | 9.0 | 8.5 | 7.8 |
| Test-gen | 8.7 | 8.4 | 8.6 | 7.6 |
| Debug | 8.4 | 8.8 | 8.5 | 7.9 |
| Agent & tool use | 8.6 | 9.1 | 8.7 | 7.4 |
| Strict JSON | 9.0 | 8.2 | 8.8 | 7.9 |
| Cost per successful task | $0.21 | $0.41 | $0.27 | $0.21 |
Verdict
GPT-5.3-Codex is the cheapest frontier-quality coding model in April 2026 and the top pick when structured output is in your critical path. It is the second-best refactor model and the best diff-edit model on Aider polyglot. Pair it with Opus 4.7 on the hardest cross-package work, route the structured-output traffic to it via Batch, and the cost-per-successful-task drops below every alternative on the table.
For the side-by-side methodology, see the 14-task scorecard. For the strict-JSON setup that posts 100 of 100, see the strict-JSON prompt.