~/ai-reviews/gpt-5-3-codex-review-9-10-on-strict-json-the-test-gen-surprise
§ REVIEW · APR 23, 2026 CODEGEN · OPENAI · TESTING v1.0

GPT-5.3-Codex review: 9/10 on strict JSON, the test-gen surprise

Adrian Marcus tested GPT-5.3-Codex on a 14-task suite: 9.0 on strict JSON, 8.7 on test-gen, and a costly loss on long-horizon agent planning. Full numbers inside.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
  10 min read
9.0/ 10
Peer score · May 2026
scaffold 8.8
refactor 8.7
test-gen 9.2
debug 8.6
agent 8.9

GPT-5.3-Codex sits at 85.0% on SWE-bench Verified in April 2026, one of the best public coding numbers any OpenAI model has posted to date. The gap to Claude Opus 4.7 on Verified (87.6%) is 2.6 points; the gap to Opus 4.7 on the harder SWE-bench Pro is wider (~8 points). On strict-JSON output the relationship inverts: GPT-5.3-Codex with response_format={"type":"json_schema","strict":true} hits 99-100 of 100 on a 40-property schema where Opus 4.7 hits 94-96. The recurring r/ChatGPTCoding threads on the launch cluster on the same takeaway: it is the cheapest frontier model that ships strict mode that actually behaves strictly. This is the review.

Quick Verdict
Best forhigh-volume strict-JSON pipelines, async task delegation, parallel workloads, Batch/Flex overnight runs, OpenAI-stack teams, GitHub Copilot users
Not best forinteractive pair programming (Claude Code is tighter); hardest cross-package TypeScript refactors (Opus 4.7 wins); long-horizon agent loops with strict step budgets
Watch out forhidden reasoning_tokens line on the bill at effort=high; barrel-file re-exports on monorepo renames; cybersecurity classification adds procurement friction for some enterprise teams
Pro tippair strict-mode JSON schema with Batch pricing; cache your system prompt (90% input discount at $0.175/M) – cheapest frontier-quality structured output on the market
Update – April 23, 2026
New entrantOpenAI shipped GPT-5.5 with 82.7% on Terminal-Bench 2.0 and 58.6% on SWE-bench Pro – the new headline-leader on agentic coding, but at $5/$30 per M tokens (2x GPT-5.4) and 1M context (same as Opus 4.7).
Cost shiftGPT-5.3-Codex stays the cheapest frontier OpenAI option ($1.75/$14) and the strict-JSON leader. GPT-5.5 is the upgrade only when terminal automation or agent-loop accuracy matters more than $/successful-task.
Open-weight optionDeepSeek V4-Pro (Apr 24, 2026) at $1.74/$3.48 per M tokens posts ~80.6% Verified – first open-weight model in the same league for a fraction of the closed-source price.

Quick answer: if your production pipeline runs structured outputs at scale or your team is already on the OpenAI stack, GPT-5.3-Codex is the strongest cost-adjusted coding model right now. If you do cross-package refactors or run long agent loops with deep tool chains, Claude Opus 4.7 still leads. If your work is interactive pair programming inside an IDE, you want Claude Code, not Codex.

What GPT-5.3-Codex actually is (and isn’t)

This is not a pair programmer. GPT-5.3-Codex is an autonomous coding agent designed for task delegation. You hand it a task, it works in an isolated cloud sandbox pre-loaded with your repo, and you check back when it’s done. The workflow is: assign, steer if needed, review the output. You are managing it asynchronously, not typing alongside it.

The “steering” feature is what makes this practical: you can jump in while the model is working, ask questions, give feedback, and redirect it before it goes too far down the wrong path. One developer ran it for 25 hours uninterrupted, generating roughly 30,000 lines of code across 13 million tokens. That use case is what this model is built for. It is not built for the “I need help with this specific function right now” interaction, where Claude Code or Cursor Composer is a better fit.

Public benchmarks (April 2026)

Benchmark GPT-5.3-Codex GPT-5.4 GPT-5.5 Claude Opus 4.7 Gemini 3.1 Pro
SWE-bench Verified 85.0% n/p n/p 87.6% 80.6%
SWE-bench Pro 56.4% 57.7% 58.6% 64.3% 54.2%
Terminal-Bench 2.0 78.4% 75.1% 82.7% 69.4% 68.5%
OSWorld-Verified 64.7% n/p n/p n/p n/p
τ²-bench (tool use) 90.9% n/p n/p n/p 95.6%
GPQA Diamond 91.5% 92.8% n/p 94.2% 94.3%

Sources: Aider polyglot leaderboard, the TokenMix April 2026 SWE-bench summary, the GPT-5.5 launch post, and Artificial Analysis independent measurements.

The OSWorld jump is the number that doesn’t get enough attention. The predecessor (GPT-5.2-Codex) scored 38.2% on OSWorld-Verified; 5.3-Codex hits 64.7%. That’s a 26-point gain in general computer productivity tasks (UI navigation, terminal operations, document workflows), which maps directly to the kinds of multi-step autonomous tasks this model was designed for. On Terminal-Bench 2.0, GPT-5.3-Codex leads all Opus variants at 78.4%.

Token efficiency: the underrated advantage

This is the section most reviews skip, and it’s often the one that determines whether Codex actually saves you money versus Claude Code. On equivalent TypeScript tasks, the numbers from Composio’s independent efficiency comparison are striking:

Task GPT-5.3-Codex tokens Claude Code tokens Codex advantage
TypeScript feature implementation 72,579 234,772 ~3.2x fewer
Figma-to-code conversion ~1.5M ~6.2M ~4.1x fewer

At $1.75/$14 input/output versus Claude Code’s Opus 4.7 at $5/$25, the per-token cost difference is already 2.8x on input. When you layer in 3-4x fewer tokens consumed per task, the effective cost-per-task gap becomes large. The flip side: Claude Code does complex refactors with ~23% fewer runtime errors in large TypeScript codebases. The cheaper cost-per-task is real; it comes with a real capability trade-off on the hardest work.

Where it wins: our 14-task editorial scoring

The 14-task scorecard graded GPT-5.3-Codex via the OpenAI Responses API at the shipped endpoint with reasoning effort set to high. Median of 5 runs.

Domain GPT-5.3-Codex vs Opus 4.7 vs GPT-5.2-Codex
Refactor 8.4 -0.6 +0.3
Test-gen 8.7 +0.3 +0.4
Debug 8.4 -0.4 +0.2
Agent & tool use 8.6 -0.5 +0.5
Structured output (40-prop schema) 9.0 +0.8 +0.4
RAG (long context) 8.5 -0.1 +0.3

The 9.0 on structured output is the headline. GPT-5.3-Codex with strict mode and a typed Pydantic response schema posts 100 of 100 across 500 runs in our adversarial set. The full breakdown of why is on the strict-JSON prompt post. The short version: "strict": true on the OpenAI side compiles the JSON Schema into a constrained-decoding grammar at inference, so the unescaped-quote and trailing-comma failure classes cannot happen by construction. Opus 4.7 has no equivalent.

Where it loses

Cross-package refactors. On a multi-package TypeScript monorepo with re-exported types behind barrel files, GPT-5.3-Codex misses one or two of the indirected call sites that Opus 4.7 catches. The pattern is the same one the refactor legacy TypeScript guide walks through. If your daily work is one-off feature code, you will not feel this. If you ship a 14-call-site rename twice a month, you will see the gap.

Long-horizon bounded planning. GPT-5.3-Codex respects a 5-step budget more often than not, but Opus 4.7 hits the budget cleanly on 5 of 5 in our suite where Codex hits it on 4 of 5. The bounded planner prompt gets Codex to parity, but you have to apply the prompt; Opus 4.7 does not need it.

Interactive pair programming. The async delegation model is a feature for long-horizon tasks and a limitation for interactive sessions. If you want a model that follows your cursor, gives line-level suggestions, and stays in a synchronous conversation, Codex is not the right tool. Use Claude Code in Cursor or your IDE of choice instead.

Key capabilities

Parallel task delegation. You can run 7+ simultaneous Codex instances on independent tasks, each in its own isolated sandbox with full repo context. OpenAI demonstrated this at DevDay 2025, shipping multiple game implementations in parallel. For teams with a backlog of independent tickets, this is a genuine throughput multiplier.

Mid-task steering. Reprioritize, redirect, or ask questions while Codex is working. You are managing it asynchronously, not blocked waiting for output. The model maintains context across steering interventions without losing state.

GitHub Copilot integration. From February 9, 2026, GPT-5.3-Codex is available natively in GitHub.com, GitHub Mobile, Visual Studio, and VS Code through Copilot. If you’re already on a Copilot plan (~$10/month), you have Codex access today without changing your tooling.

Reasoning effort control. Low / medium / high / xhigh settings let you tune cost versus depth per task. Use low for routine maintenance tickets, xhigh for architectural decisions. The xhigh tier sits between high and max and is the default in most production setups for complex coding.

The cybersecurity classification

GPT-5.3-Codex is the first OpenAI model classified as “High capability” in cybersecurity under their Preparedness Framework. Apollo Research’s testing found a mean best-of-10 sabotage score of 0.88/1.00 (versus 0.75 for GPT-5.2-Codex). OpenAI does not claim it can fully automate cyberattacks but “cannot rule out the possibility.” This is why the API launch was delayed: OpenAI used the additional time to implement detection layers and the Trusted Access for Cyber program for vetted security professionals.

For most engineering teams, this classification is a footnote. For enterprise procurement teams with strict security requirements, it creates additional approval steps. Check with your security team before signing a large commitment.

Pricing and economics

Model Input $/M Output $/M Cached input $/M Context
GPT-5.3-Codex $1.75 $14.00 $0.175 400K
GPT-5.4 $2.50 $15.00 $0.25 1.05M in / 128K out
GPT-5.5 $5.00 $30.00 $0.50 1M
Claude Opus 4.7 $5.00 $25.00 n/a 1M
DeepSeek V4-Pro $1.74 $3.48 $0.435 1M (open-weight)

Pricing per the OpenAI models reference and the GPT-5.4 launch post. The cached-input price at $0.175/M is the lever most teams underuse. If your workload reuses long system prompts, which most agent loops do, the cache hit ratio drops the effective input cost by 10x. OpenAI also ships Batch and Flex pricing at half the standard rate; for overnight eval runs or large-corpus structured-output jobs, Batch is the cheapest frontier-quality option on the market today.

Access options

Access path Price What you get
ChatGPT Plus $20/mo Codex Web + CLI + VS Code; standard usage limits
ChatGPT Pro $200/mo Higher limits + Codex-Spark research preview (Cerebras, 1000+ t/s)
GitHub Copilot ~$10/mo Codex in GitHub.com, Mobile, VS, VS Code (from Feb 9, 2026)
ChatGPT Team $30/user/mo Shared workspace, admin controls
Enterprise / Edu Custom SOC 2, HIPAA, zero data retention
API $1.75/$14 per 1M Responses API; 400K context; Batch at half price

Codex model family (brief history)

The modern Codex product is unrelated to OpenAI’s deprecated GPT-3-based Codex of 2021-2023. They share a name only. The current lineage: codex-1 (May 2025, research preview based on o3), GPT-5-Codex (~Sep 2025), GPT-5.1-Codex (~Nov 2025), GPT-5.2-Codex (Jan 14, 2026, context compaction + Windows support), and GPT-5.3-Codex (Feb 5, 2026, current flagship, 25% faster, higher OSWorld and Terminal-Bench scores). The GPT-5.3-Codex-Spark variant (Feb 12, 2026) runs at 1000+ tokens/second on Cerebras hardware; it’s a research preview with 128K context, text-only, available to Pro users with separate rate limits.

What the threads are saying

Two consistent themes on r/ChatGPTCoding and the OpenAI Python SDK GitHub issues:

  1. Strict mode finally works. The pre-5 era complaints about JSON Schema strict mode silently dropping fields are gone. The same threads now warn against one remaining trap: nested oneOf with discriminator fields still occasionally produces empty branches. Workaround in the strict-JSON prompt.
  2. Codex is better than 5.4 for code. The recurring “5.4 vs 5.3-Codex for daily coding” thread on r/ChatGPTCoding lands on the same answer in dozens of replies: 5.3-Codex is tuned for diff-edit and tool-use; 5.4 is tuned for reasoning. For an agentic coding session, take Codex.

How it compares

TCC editorial score GPT-5.3-Codex Claude Opus 4.7 GPT-5.4 GPT-5.5 Gemini 3.1 Pro
Refactor 8.4 9.0 8.5 8.7 7.8
Test-gen 8.7 8.4 8.6 8.6 7.6
Debug 8.4 8.8 8.5 8.7 7.9
Agent & tool use 8.6 9.1 8.7 9.2 7.4
Strict JSON 9.0 8.2 8.8 8.9 7.9
Cost per successful task $0.21 $0.41 $0.27 $0.46 $0.21

Pros and cons

Strengths Weaknesses
Cheapest frontier OpenAI model at $1.75/$14 per M tokens Cross-package refactors: ~23% more runtime errors than Claude Code on large TypeScript codebases
Strict JSON mode with constrained decoding: 100/100 on 40-prop schema adversarial set Context window 400K vs Opus 4.7’s 1M and Gemini 3.1 Pro’s 2M
3-4x more token-efficient than Claude Code on equivalent tasks Async delegation model is awkward for interactive pair programming sessions
OSWorld-Verified jumped from 38.2% to 64.7% vs predecessor Cybersecurity “High capability” classification adds enterprise procurement friction
Parallel task delegation: 7+ simultaneous instances Bounded planner needs explicit prompt; Opus 4.7 hits clean give-up without it
GitHub Copilot integration: available to millions of existing Copilot users Codex-Spark (ultra-fast Cerebras variant) is research preview only

Frequently asked questions

Is GPT-5.3-Codex better than Claude Opus 4.7 for coding? Depends on the task. For strict-JSON pipelines, token efficiency, OSWorld tasks, and async delegation workloads, yes. For cross-package TypeScript refactors, bounded agent loops, and vision-heavy tasks, Opus 4.7 leads. They’re complementary: route structured-output traffic to Codex, hard refactors to Opus 4.7.

How does Codex compare to Claude Code for daily development? Different interaction model. Claude Code is synchronous pair programming inside your IDE. Codex is async task delegation to an isolated sandbox. For interactive development sessions, Claude Code in Cursor or your IDE is better. For ticket-level work you can fire off and review later, Codex is more efficient.

What does the “High capability” cybersecurity classification mean for API use? It caused a delayed API launch while OpenAI implemented additional safeguards. The API is now available. For most engineering teams, the classification is a footnote. For enterprise procurement with strict security requirements, it creates additional approval steps. The Apollo Research sabotage score (0.88/1.00) is the specific concern; OpenAI’s Trusted Access for Cyber program addresses it for vetted security teams.

Is GPT-5.3-Codex cheaper than Claude Code? On a per-token basis, yes: $1.75/$14 vs Opus 4.7’s $5/$25. Combined with 3-4x fewer tokens consumed per task on equivalent work, the effective cost-per-task advantage is substantial. The one caveat: you may need more passes on complex refactors that Claude Code handles in one, which narrows the gap.

Can I run multiple Codex tasks in parallel? Yes, up to 7+ simultaneous instances without context loss between them. Each runs in its own isolated sandbox. This is one of the primary reasons to choose Codex for teams with a backlog of independent tickets or nightly CI tasks.

How does caching work with GPT-5.3-Codex? Cached input pricing is $0.175/M, a 90% discount on standard input pricing. If your system prompt or large context is reused across many calls, which is true for most agent loops, this drops the effective input cost significantly. Enable caching by structuring prompts to have stable prefixes that reuse across requests.

Verdict

GPT-5.3-Codex is the cheapest frontier-quality OpenAI coding model in April 2026 and the top pick when structured output is in your critical path. The 3-4x token efficiency advantage over Claude Code is real, the constrained-decoding strict mode is the best JSON reliability on the market, and the parallel delegation model scales in ways synchronous pair-programming tools cannot. (DeepSeek V4-Flash at $0.14/$0.28 is cheaper still per token with ~78% Verified, the right call when your quality floor sits below frontier.)

Pair it with Opus 4.7 on the hardest cross-package work, route the structured-output traffic to Codex via Batch, and the cost-per-successful-task drops below every alternative on the table. For the side-by-side methodology, see the 14-task scorecard. For the strict-JSON setup that posts 100 of 100, see the strict-JSON prompt. For the comparison with Opus 4.7, see the Opus 4.7 review.

esc