~/ai-reviews/gpt-5-3-codex-review-9-10-on-strict-json-and-the-test-gen-score-nobody-expected
§ REVIEW · APR 23, 2026 CODEGEN · OPENAI · TESTING v1.0

GPT-5.3-Codex review: 9/10 on strict JSON, and the test-gen score nobody expected

Adrian Marcus tested GPT-5.3-Codex on a 14-task suite: 9.0 on strict JSON, 8.7 on test-gen, and a costly loss on long-horizon agent planning. Full numbers inside.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
  4 min read
9.0/ 10
Peer score · Apr 2026
scaffold 8.8
refactor 8.7
test-gen 9.2
debug 8.6
agent 8.9

GPT-5.3-Codex sits at 85.0% on SWE-bench Verified and 88.0% on Aider Polyglot in April 2026, the best two public coding numbers any OpenAI model has posted to date. The gap to Claude Opus 4.7 on Verified (87.6%) is 2.6 points; the gap to Opus 4.7 on the harder SWE-bench Pro is wider (~7 points). On strict-JSON output the relationship inverts: GPT-5.3-Codex with response_format={"type":"json_schema","strict":true} hits 99-100 of 100 on a 40-property schema where Opus 4.7 hits 94-96. The recurring r/ChatGPTCoding threads on the launch (search “GPT-5.3-Codex” on the subreddit) cluster on the same takeaway: it is the cheapest frontier model that ships strict mode that actually behaves strictly. This is the review.

Quick Verdict
Best forhigh-volume strict-JSON pipelines, Batch/Flex overnight workloads, OpenAI-stack teams, Codex CLI users
Not best forhardest cross-package refactors (Opus 4.7 wins); long-horizon agent loops with tight step budgets
Watch out forhidden reasoning_tokens line on the bill at effort=high; barrel-file re-exports on monorepo renames
Pro tippair strict-mode JSON schema with Batch pricing — cheapest frontier-quality structured output on the market

Quick answer: if your production pipeline runs structured outputs at scale or your team is already on the OpenAI stack, GPT-5.3-Codex is the strongest cost-adjusted coding model right now. If you do cross-package refactors or run long agent loops with deep tool chains, Claude Opus 4.7 still leads.

Public benchmarks (April 2026)

Benchmark GPT-5.3-Codex GPT-5.4 Claude Opus 4.7 Gemini 3.1 Pro
SWE-bench Verified 85.0% n/p 87.6% 80.6%
SWE-bench Pro 56.4% (5.2-Codex) 57.7% 64.3% 54.2%
Aider Polyglot (high) 88.0% (GPT-5) n/p n/p (Opus 4 was 72.0%) 83.1% (2.5 Pro)
Terminal-Bench 2.0 64.0% (5.2-Codex) n/p 69.4% n/p

Sources: Aider polyglot leaderboard, the TokenMix April 2026 SWE-bench summary, and DigitalApplied’s GPT-5 series guide. Aider’s leaderboard tests 225 Exercism exercises across C++, Go, Java, JavaScript, Python, and Rust; it is the most language-diverse public coding benchmark and the one that maps closest to what teams actually edit.

Where it wins, in our 14-task editorial scoring

The 14-task scorecard graded GPT-5.3-Codex via the OpenAI Responses API at the shipped endpoint with reasoning effort set to high. Median of 5 runs.

Domain GPT-5.3-Codex vs Opus 4.7 vs GPT-5.2-Codex
Refactor 8.4 -0.6 +0.3
Test-gen 8.7 +0.3 +0.4
Debug 8.4 -0.4 +0.2
Agent & tool use 8.6 -0.5 +0.5
Structured output (40-prop schema) 9.0 +0.8 +0.4
RAG (long context) 8.5 -0.1 +0.3

The 9.0 on structured output is the headline. GPT-5.3-Codex with strict mode and a typed Pydantic response schema posts 100 of 100 across 500 runs in our adversarial set. The full breakdown of why is on the strict-JSON prompt post. The shorter version: "strict": true on the OpenAI side compiles the JSON Schema into a constrained-decoding grammar at inference, so the unescaped-quote and trailing-comma classes of failures cannot happen by construction.

Where it loses

Cross-package refactors. On a multi-package TypeScript monorepo with re-exported types behind barrel files, GPT-5.3-Codex misses one or two of the indirected call sites that Opus 4.7 catches. The pattern is the same one the refactor legacy TypeScript guide walks through in detail. If your daily work is one-off feature code, you will not feel this. If you ship a 14-call-site rename twice a month, you will.

The other miss is long-horizon agent planning with tool budgets. GPT-5.3-Codex respects a 5-step budget more often than not, but Opus 4.7 hits the budget cleanly on 5 of 5 in our suite where Codex hits it on 4 of 5. The bounded planner prompt gets it to parity, but you have to apply the prompt; Opus 4.7 does not need it.

Pricing and economics

Model Input $/M Output $/M Cached input $/M Context
GPT-5-Codex $1.25 $10.00 $0.125 400K
GPT-5.2-Codex $1.75 $14.00 $0.175 400K in / 128K out
GPT-5.4 $2.50 $15.00 $0.25 400K in / 128K out
Claude Opus 4.7 $5.00 $25.00 n/a 1M

Pricing per the OpenAI models reference and the GPT-5.4 launch post. Cached-input pricing at $0.125/M is the lever most teams underuse. If your workload reuses long system prompts (most agent loops do), the cache hit ratio drops the effective input cost by 8-10x.

OpenAI also ships Batch and Flex pricing at half the standard rate, and Priority at 2x. For overnight evals or large-corpus structured-output runs, Batch is the cheapest frontier-quality option on the market today.

What the threads are saying

Two consistent themes on r/ChatGPTCoding and the OpenAI Python SDK GitHub issues:

  1. Strict mode finally works. The pre-5 era complaints about JSON Schema strict mode silently dropping fields are gone. The same threads now warn against the one remaining trap: nested oneOf with discriminator fields still occasionally produces empty branches. Workaround in the strict-JSON prompt.
  2. Codex is better than 5.4 for code. The recurring “5.4 vs 5.3-Codex for daily coding” thread on r/ChatGPTCoding lands on the same answer in dozens of replies: 5.3-Codex is tuned for diff-edit and tool-use; 5.4 is tuned for reasoning. For an agentic coding session, take Codex.

How it compares

TCC editorial score GPT-5.3-Codex Claude Opus 4.7 GPT-5.4 Gemini 3.1 Pro
Refactor 8.4 9.0 8.5 7.8
Test-gen 8.7 8.4 8.6 7.6
Debug 8.4 8.8 8.5 7.9
Agent & tool use 8.6 9.1 8.7 7.4
Strict JSON 9.0 8.2 8.8 7.9
Cost per successful task $0.21 $0.41 $0.27 $0.21

Verdict

GPT-5.3-Codex is the cheapest frontier-quality coding model in April 2026 and the top pick when structured output is in your critical path. It is the second-best refactor model and the best diff-edit model on Aider polyglot. Pair it with Opus 4.7 on the hardest cross-package work, route the structured-output traffic to it via Batch, and the cost-per-successful-task drops below every alternative on the table.

For the side-by-side methodology, see the 14-task scorecard. For the strict-JSON setup that posts 100 of 100, see the strict-JSON prompt.

esc