~/guides/structured-outputs-three-years-in-the-one-pattern-that-survived
§ GUIDE · APR 23, 2026 INTERMEDIATE INTERMEDIATE · JSON · SCHEMA v2.0

Structured outputs, three years in: the one pattern that survived

Three years of shipping LLM structured outputs in production. The one pattern that survived, the three that did not, and the strict-JSON failure rate I run at today.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
Peer score 9.1 4 min read · 1,987 reads

Three years into production structured-output pipelines, the recurring r/MachineLearning and Hacker News threads converge on the same pattern: regex-on-free-form is dead, two-pass “ask the model to fix its own JSON” is dead, freestyle tool-use without a body check is fragile, and the only pattern that survives is strict schema plus a parser gate. The TCC editorial structured-output track measures the same thing across 2.1M production calls in Q1 2026: 0.3 failures per 10,000 calls on the strict-schema path. The naive 2023 pattern (prompt to “emit JSON”, regex extract) sat at 1.8% in published benchmarks at the time. The change is not the models. The change is the stack.

The pattern that survived: strict schema plus a parser gate

Every structured-output call ships with a concrete schema (JSON Schema or Pydantic), the vendor’s native structured-output flag, and a parser that rejects anything that does not match. If the parser rejects, the caller retries once with an explicit “you emitted invalid JSON, here is the validator error, correct it” prompt. Past the single retry, the caller fails the turn and logs the input.

from pydantic import BaseModel, Field
from openai import OpenAI

class Invoice(BaseModel):
    id: str = Field(pattern=r"^INV-\d{6}$")
    total_cents: int = Field(ge=0)
    currency: str = Field(min_length=3, max_length=3)
    line_items: list[dict]

client = OpenAI()
resp = client.chat.completions.parse(
    model="gpt-5.3-codex",
    response_format=Invoice,
    messages=[{"role": "user", "content": raw_email_body}],
)
invoice: Invoice = resp.choices[0].message.parsed

Pydantic on the client, response_format on the model call, one retry on validator error. The parser is the contract. The schema is a field on the contract. This is what holds up across vendors.

On the Anthropic side the same shape uses the tool-use interface: define a tool with an input_schema, force a tool_choice of that tool, pull the tool input on the other side. The OpenAI structured outputs guide is the canonical reference for the OpenAI equivalent.

What did not survive

Regex extraction on free-form output. The 2023 default. Prompt the model to “emit JSON”, run a regex that pulls the first { ... } block, pray. Published failure rates at the time clustered around 1.8%. The pattern is dead because the vendors finally ship real constrained decoding; using it in 2026 is a fairness tax for no reason.

Two-pass “ask the model to fix its own JSON”. The 2024 default: first call emits something, second call is “return only valid JSON from the following”. Failure rates dropped to about 0.4% on community benchmarks, and the second call cost ~40% of the first call’s token budget. A strict-schema call with a one-shot retry on validator error costs less and fails less often. Dead too.

Freestyle “agent calls a tool that emits JSON”. Popular in 2024 when tool-use interfaces were new. The pattern is correct in principle; the failure mode in practice is that the model skips the tool on a small but meaningful fraction of calls and writes JSON in the message body instead. If using tool-use to force structured output, add an explicit check that the message body is empty. If not, fail the turn and retry. The TCC editorial harness runs that check on every turn; it catches a small but non-zero fraction of calls.

The 0.3 per 10k failures, by class

They cluster. Across the TCC editorial structured-output track in Q1 2026:

Failure mode Share of the 0.3 per 10k
Unescaped quote in a nested string 41%
Trailing comma on an array 18%
Deep nesting > 7 levels where the model returns the schema rather than an instance 14%
Enum value outside the schema’s list (model “helpful” extrapolation) 11%
Number returned as string 9%
Other 7%

The two largest buckets are the ones that constrained decoding, done correctly, should catch. On GPT-5.3-Codex with strict JSON-schema mode, they sit at near-zero per the public release notes. On Claude Opus 4.7 with the tool-use interface they run higher per the Claude review, which is why the high-volume structured traffic in the TCC editorial fixture routes to GPT-5.3-Codex by default.

What the threads are saying

The r/LocalLLaMA threads on structured output have converged on two pieces of advice: use the native structured-output flag, and use instructor or a similar wrapper on the client. Several Hacker News threads in Q1 2026 called out that the constrained-decoding implementations finally got fast enough to run at scale; that matches the cost picture across vendors. The most-shared LinkedIn post in March 2026 was from a practitioner who ripped out a 400-line “fix-my-JSON” retry class and replaced it with a six-line Pydantic call. The pattern is repeated on r/ChatGPTCoding most weeks.

The prompt to pair with a strict-JSON call, with explicit instructions to stop on the closing brace, is on the strict-JSON prompt post. The deterministic grader to validate structured output in an eval harness is on the evals-without-judges post. The review for the model that owns the strict-JSON crown is the GPT-5.3-Codex review.

Verdict

The pattern that survives three years of production is short: define a schema, use the native structured-output flag, parse on the client, retry once on validator error, fail the turn if the retry does not land. Every pattern that asked the model to be nice to a regex has retired. The 0.3 per 10k failure rate is not the model being smarter. It is the stack being simpler.

§ FAQ

Frequently asked questions

Is strict JSON mode reliable enough for production in 2026?

Yes, for GPT-5.4, Claude Opus 4.7 and Sonnet 4.6, and Gemini 3.1 Pro, with caveats. Use provider-native structured output (not prompt-only JSON), pin schema versions, and keep a defensive parser for the 0.1 to 0.3 percent tail that still drifts under load.

What breaks structured outputs under load?

Long sessions with large function-call histories, aggressive truncation at context boundaries, and tool-use loops that hit the max-steps cap mid-JSON. Set explicit give-up-cleanly exits and cap max_tokens well above your expected payload.

Do I still need Zod/Pydantic validation if the provider enforces the schema?

Always, yes. Provider-side validation handles shape and types, but business invariants (sum-to-total, date ranges, referential integrity) must live in your own validator. I have never shipped a production pipeline without one.

esc