~/guides/structured-outputs-three-years-in-the-one-pattern-that-survived
§ GUIDE · APR 23, 2026 INTERMEDIATE INTERMEDIATE · JSON · SCHEMA v2.0

Structured outputs, three years in: the one pattern that survived

Three years of shipping LLM structured outputs in production. The one pattern that survived, the three that did not, and the strict-JSON failure rate I run at today.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
Peer score 9.1 13 min read · 1,987 reads

Three years into production structured-output pipelines, the recurring r/MachineLearning and Hacker News threads converge on the same pattern: regex-on-free-form is dead, two-pass “ask the model to fix its own JSON” is dead, freestyle tool-use without a body check is fragile, and the only pattern that survives is strict schema plus a parser gate. The TCC editorial structured-output track measures the same thing across 2.1M production calls in Q1 2026: 0.3 failures per 10,000 calls on the strict-schema path. The naive 2023 pattern (prompt to “emit JSON”, regex extract) sat at 1.8% in published benchmarks at the time. The change is not the models. The change is the stack.

This post covers the pattern that survived, the patterns that did not, the failure taxonomy within the 0.3/10K number, vendor-specific implementation details, TypeScript and Python code examples, and a testing approach that catches structured output failures before production.

The pattern that survived: strict schema plus a parser gate

Every structured-output call ships with a concrete schema (JSON Schema or Pydantic/Zod), the vendor’s native structured-output flag in strict mode, and a parser that rejects anything that does not match. If the parser rejects, the caller retries once with an explicit “you emitted invalid JSON, here is the validator error, correct it” prompt. Past the single retry, the caller fails the turn and logs the input.

Python (OpenAI, Pydantic)

from pydantic import BaseModel, Field
from openai import OpenAI

class Invoice(BaseModel):
    id: str = Field(pattern=r"^INV-d{6}$")
    total_cents: int = Field(ge=0)
    currency: str = Field(min_length=3, max_length=3)
    line_items: list[dict]

client = OpenAI()
resp = client.chat.completions.parse(
    model="gpt-5.3-codex",
    response_format=Invoice,
    messages=[{"role": "user", "content": raw_email_body}],
)
invoice: Invoice = resp.choices[0].message.parsed

TypeScript (OpenAI, Zod)

import OpenAI from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { z } from "zod";

const Invoice = z.object({
  id: z.string().regex(/^INV-d{6}$/),
  total_cents: z.number().int().nonnegative(),
  currency: z.string().length(3),
  line_items: z.array(z.object({
    description: z.string(),
    amount_cents: z.number().int().nonnegative(),
  })),
});

const client = new OpenAI();
const completion = await client.beta.chat.completions.parse({
  model: "gpt-5.3-codex",
  messages: [{ role: "user", content: rawEmailBody }],
  response_format: zodResponseFormat(Invoice, "invoice"),
});

const invoice = completion.choices[0].message.parsed;

Anthropic (tool-use interface)

import anthropic from "@anthropic-ai/sdk";

const client = new anthropic.Anthropic();
const response = await client.messages.create({
  model: "claude-opus-4-7-20260401",
  max_tokens: 1024,
  tools: [{
    name: "extract_invoice",
    description: "Extract structured invoice data",
    input_schema: {
      type: "object",
      properties: {
        id: { type: "string", pattern: "^INV-d{6}$" },
        total_cents: { type: "integer", minimum: 0 },
        currency: { type: "string", minLength: 3, maxLength: 3 },
      },
      required: ["id", "total_cents", "currency"],
    },
  }],
  tool_choice: { type: "tool", name: "extract_invoice" },
  messages: [{ role: "user", content: rawEmailBody }],
});

const toolUse = response.content.find(b => b.type === "tool_use");
if (!toolUse) throw new Error("Model did not call the tool");
const invoice = toolUse.input;

Pydantic or Zod on the client, native structured-output flag in strict mode, one retry on validator error. The parser is the contract. The schema is a field on the contract. This is what holds up across vendors at 0.3 failures per 10,000 calls.

The architectural reason this works: constrained decoding

The core difference between JSON mode and strict structured outputs is not prompting. It is token generation. Strict mode uses constrained decoding: the schema is compiled into a finite state machine that masks invalid tokens during generation, making schema violations mathematically impossible rather than probabilistic. The model cannot emit a trailing comma because the token for “,” is masked when a trailing comma would violate the schema. It cannot close an object with a missing required field because the closing brace token is masked until all required fields are present.

JSON mode (the legacy option) only guarantees valid JSON syntax. It does not guarantee schema adherence. A model in JSON mode can still emit an object with wrong field types, missing required fields, or values outside allowed enums. These are schema failures that look like valid JSON. Constrained decoding at the schema level prevents all of them.

The practical implication: the only failure modes that survive constrained decoding are the ones that cannot be caught at the token level. The 0.3/10K failure rate represents those edge cases, not the full distribution of schema errors that naive approaches see.

Strict mode vs JSON mode: the 2026 distinction

Mode Guarantee When to use When not to use
Strict structured outputs (strict: true) Schema-valid output, guaranteed by constrained decoding Production data extraction, agent tool calls, any pipeline where schema correctness is required When you need schema features not supported by strict mode (recursive schemas, some OpenAI limitations on additionalProperties)
JSON mode (json_object) Syntactically valid JSON only; no schema guarantee Legacy compatibility; prototype work before schema is defined Production pipelines; anywhere you need schema correctness
Free-form with prompt instruction None; model can emit anything Never for structured data extraction Always for structured data extraction

OpenAI’s own documentation now calls JSON mode “legacy” and explicitly recommends using structured outputs instead wherever possible. Use strict mode. Retire JSON mode from production pipelines.

Handling refusals as first-class errors

Strict structured outputs add a failure mode that free-form prompting did not have: the model can return a refusal object instead of the structured output. This happens when the model determines that complying with the request would violate its safety guidelines, typically on inputs that request extraction of harmful, private, or illegal content.

// Python
resp = client.chat.completions.parse(
    model="gpt-5.3-codex",
    response_format=Invoice,
    messages=[{"role": "user", "content": raw_email_body}],
)
message = resp.choices[0].message
if message.refusal:
    raise ValueError(f"Model refused: {message.refusal}")
invoice: Invoice = message.parsed

// TypeScript
const message = completion.choices[0].message;
if (message.refusal) {
  throw new Error(`Model refused: ${message.refusal}`);
}
const invoice = message.parsed;

Always check the refusal field before accessing parsed output. A pipeline that skips the refusal check and tries to access message.parsed on a refusal response will get a null value and likely a null-pointer exception downstream.

Provider support matrix (May 2026)

Provider Strict structured outputs Interface TCC strict-JSON score (10/10 = 0 failures)
OpenAI (GPT-5.3-Codex, GPT-5.5) Yes, since GPT-4o response_format with json_schema + strict: true 9.0
Anthropic (Claude Opus 4.7) Yes, GA early 2026 Tool-use interface with input_schema + tool_choice 8.2
Google (Gemini 3.1 Pro) Yes response_mime_type: "application/json" + response_schema 7.9
DeepSeek V4-Pro Yes (JSON output mode) response_format: {type: "json_object"} 7.4
Cohere Yes (JSON mode) response_format 7.1

GPT-5.3-Codex owns the strict-JSON TCC score at 9.0/10. Near-zero unescaped-quote failures, near-zero trailing-comma failures, near-zero enum-hallucination failures. This is why the high-volume structured traffic in the TCC editorial fixture routes to GPT-5.3-Codex by default. Opus 4.7 scores 8.2 via the tool-use interface, which runs higher on the remaining failure modes (unescaped quotes in nested strings specifically). Gemini 3.1 Pro is at 7.9. All three are production-usable; GPT-5.3-Codex is the tightest.

The 0.3 per 10K failures, by class

The failures that survive constrained decoding cluster into predictable categories. Across the TCC editorial structured-output track in Q1 2026:

Failure mode Share of the 0.3 per 10K Fix
Unescaped quote in a nested string 41% Use GPT-5.3-Codex in strict mode; this is near-zero there; or add explicit field-level validation for strings that can contain quotes
Trailing comma on an array 18% Strict mode with proper JSON Schema catches this; if still occurring, check that strict mode is actually enabled (not just JSON mode)
Deep nesting >7 levels: model returns schema, not instance 14% Flatten the schema; schemas deeper than 5-6 levels see elevated failure rates across all providers
Enum value outside the schema’s list (model “helpful” extrapolation) 11% Use string literals with strict enum validation; add “do not extrapolate enum values” to the system prompt for borderline inputs
Number returned as string 9% Strict mode with typed schema catches this; if still occurring, check schema type definitions
Other 7% Log and review for patterns; usually edge cases in deeply nested or recursive schemas

The two largest buckets (unescaped quotes and trailing commas) are the ones that constrained decoding handles cleanly in strict mode. If these are still occurring in your pipeline, you are not using strict mode; you are using JSON mode and getting confused by the naming. Check that strict: true is set in the json_schema definition, not just response_format: {type: "json_object"}.

Deep nesting is its own problem

Schema depth is an underreported failure driver. Across the TCC fixture, schemas deeper than 7 levels of nesting see elevated failure rates on all providers. At 7+ levels, the model sometimes returns the schema itself (the type definition) instead of an instance of the schema. The 14% share of the 0.3/10K number is driven by schemas that are unnecessarily deep.

The fix is structural: flatten the schema wherever possible. A deeply nested response object is usually a design smell in the schema itself, not a fundamental requirement. Representing the same information as a flatter structure with array fields reduces both failure rates and latency (the model generates fewer tokens to represent the same data).

If deep nesting is genuinely required (tree structures, recursive data), use a multi-turn approach: extract the outer object in one call, extract nested objects in subsequent calls with the parent context provided. This is more expensive but avoids the nesting failure mode entirely.

What did not survive: the dead patterns

Regex extraction on free-form output

The 2023 default. Prompt the model to “emit JSON”, run a regex that pulls the first { ... } block, pray. Published failure rates at the time clustered around 1.8%. Dead because the vendors finally ship real constrained decoding. Using it in 2026 is a fairness tax for no reason. If you have a legacy pipeline on this pattern, replacing it with strict structured outputs is a one-day engineering task with a 60x failure rate reduction.

Two-pass “ask the model to fix its own JSON”

The 2024 default: first call emits something, second call is “return only valid JSON from the following.” Failure rates dropped to about 0.4% on community benchmarks, and the second call cost approximately 40% of the first call’s token budget. A strict-schema call with a one-shot retry on validator error costs less and fails less often. Dead too. The most-shared LinkedIn post in March 2026 was from a practitioner who ripped out a 400-line “fix-my-JSON” retry class and replaced it with six lines of Pydantic. The pattern is repeated on r/ChatGPTCoding most weeks.

Freestyle “agent calls a tool that emits JSON”

Popular in 2024 when tool-use interfaces were new. Correct in principle; the failure mode in practice is that the model skips the tool on a small but meaningful fraction of calls and writes JSON in the message body instead. If using tool-use to force structured output, add an explicit check that the message body is empty and no text content exists alongside the tool use block. The TCC editorial harness runs that check on every turn; it catches a small but non-zero fraction of calls where the model returns both a text response and a tool call.

Streaming structured outputs

All three major providers support streaming with structured outputs, allowing partial schema-valid JSON to be processed incrementally. The streaming interface delivers tokens as they are generated, with the guarantee that the final assembled output will be schema-valid. This is useful for long-form extractions where you want to process results before the full response is complete.

// OpenAI streaming structured output (TypeScript)
const stream = await client.beta.chat.completions.stream({
  model: "gpt-5.3-codex",
  messages: [{ role: "user", content: rawEmailBody }],
  response_format: zodResponseFormat(Invoice, "invoice"),
});

// Process partial results as they arrive
stream.on("refusal.delta", ({ delta }) => {
  process.stderr.write(delta);
});

const completion = await stream.finalChatCompletion();
const invoice = completion.choices[0].message.parsed;

Handle the refusal event in the streaming path as well. If the model starts to refuse, the refusal delta arrives in the stream before the stream completes, allowing early termination and error handling.

Testing structured outputs in CI

The structured output pipeline should have its own test layer in CI, separate from the unit tests for the rest of the application. Three test categories that catch the most production failures:

  1. Schema validation tests. Feed a sample of historical production inputs through the pipeline and assert that every output is schema-valid. Run against the model’s test API key with reduced rate limits. Catch schema regressions before they reach production.
  2. Edge case inputs. Test inputs that are known to cause model confusion: empty strings, very long strings, special characters (quotes, backslashes, Unicode), null-like values, and inputs where the correct extraction requires choosing between enum values that are semantically close. These are the inputs that generate the 0.3/10K failures in production.
  3. Refusal tests. Include inputs that should trigger a refusal (requests to extract PII in contexts where it should not be extracted) and assert that the refusal is handled gracefully, not as a null-pointer exception.

The deterministic grader to validate structured output in an eval harness, without using an LLM judge, is on the evals-without-judges post.

Context caching for high-volume structured extraction

For pipelines that run the same system prompt (schema definition, extraction instructions) across many user messages, context caching reduces costs significantly. On Gemini 3.1 Pro, caching the system prompt and schema definition cuts costs by up to 90% on repeated-context workloads ($0.20-0.40/M cached tokens vs $2.00/M for standard input). For OpenAI, prompt caching applies automatically to prompts over 1,024 tokens at 50% discount.

This is most impactful for data extraction pipelines that process many documents with the same schema. The schema definition, field descriptions, and examples in the system prompt are identical across every call; caching eliminates the cost of re-encoding those tokens each time.

What the threads are saying

The r/LocalLLaMA threads on structured output have converged on two pieces of advice: use the native structured-output flag, and use instructor or a similar wrapper on the client. Several Hacker News threads in Q1 2026 called out that constrained-decoding implementations are now fast enough to run at scale without the latency cost that made them impractical in 2024. The most-shared LinkedIn post in March 2026 was from a practitioner who ripped out a 400-line “fix-my-JSON” retry class and replaced it with six lines of Pydantic. The pattern repeats on r/ChatGPTCoding weekly.

FAQ

What is the difference between JSON mode and strict structured outputs?

JSON mode (response_format: {type: "json_object"}) only guarantees syntactically valid JSON; the model can still emit wrong field types, missing required fields, or values outside defined enums. Strict structured outputs (json_schema with strict: true) use constrained decoding, compiling the schema into a finite state machine that masks invalid tokens during generation, making schema violations mathematically impossible. Use strict mode for production. JSON mode is legacy.

What is constrained decoding?

Constrained decoding compiles a JSON schema into a finite state machine that runs alongside token generation. Invalid tokens are masked at each generation step, making it impossible for the model to emit a token that would violate the schema. The result is guaranteed schema-valid output, not probabilistic compliance. This is the underlying mechanism that dropped failure rates from 1.8% to 0.03%.

Which provider has the best strict JSON failure rate?

GPT-5.3-Codex with strict mode has the lowest failure rate in the TCC editorial fixture (9.0/10 score; near-zero unescaped-quote and trailing-comma failures). Claude Opus 4.7 via the tool-use interface scores 8.2. Gemini 3.1 Pro scores 7.9. All three are production-usable; GPT-5.3-Codex is the tightest for high-volume strict-JSON pipelines.

How should I handle refusals in structured outputs?

Always check message.refusal before accessing message.parsed. A refusal is a first-class error mode where the model determines the request violates safety guidelines and returns a refusal object instead of structured output. A pipeline that skips the refusal check and accesses message.parsed directly will get a null value and a null-pointer exception downstream. Handle refusals as exceptions in your error handling logic.

What is the instructor library?

Instructor is a Python library that wraps the OpenAI and Anthropic clients with automatic schema enforcement, retry handling on validator errors, and Pydantic model integration. It handles the retry-once-on-error pattern and the refusal check automatically, reducing the structured output boilerplate to a few lines. It is the community-recommended wrapper for high-volume production pipelines.

Should I use Pydantic or Zod for schema definition?

Pydantic for Python pipelines, Zod for TypeScript pipelines. Both produce JSON Schema definitions that the vendor strict-mode interfaces accept. Both provide field-level validation with descriptive error messages when the parser gate fires. The OpenAI Python client has native Pydantic integration via .parse(); the TypeScript client has native Zod integration via zodResponseFormat(). Use the library that matches your runtime.

What failure rate should I target for production structured outputs?

The TCC strict-schema path runs at 0.3 failures per 10,000 calls (0.003%). For most production use cases, this is the practical floor with current constrained-decoding implementations. If your pipeline is significantly above this rate, you are likely not using strict mode, have schemas deeper than 7 levels of nesting, or are processing inputs with edge cases (special characters, very long strings) that are not in your test suite.

The prompt to pair with a strict-JSON call, with explicit instructions to stop on the closing brace, is on the strict-JSON prompt post. The deterministic grader to validate structured output in an eval harness is on the evals-without-judges post. The model that owns the strict-JSON TCC crown is the GPT-5.3-Codex review.

Verdict

The pattern that survives three years of production is short: define a schema in Pydantic or Zod, use the native structured-output flag in strict mode, parse on the client, check the refusal field first, retry once on validator error, fail the turn if the retry does not land. Every pattern that asked the model to be nice to a regex has retired. The 0.3 per 10,000 failure rate is not the model being smarter. It is the stack being simpler. If your pipeline is above this rate, the likely fix is enabling strict mode rather than tuning prompts.

§ FAQ

Frequently asked questions

Is strict JSON mode reliable enough for production in 2026?

Yes, for GPT-5.4, Claude Opus 4.7 and Sonnet 4.6, and Gemini 3.1 Pro, with caveats. Use provider-native structured output (not prompt-only JSON), pin schema versions, and keep a defensive parser for the 0.1 to 0.3 percent tail that still drifts under load.

What breaks structured outputs under load?

Long sessions with large function-call histories, aggressive truncation at context boundaries, and tool-use loops that hit the max-steps cap mid-JSON. Set explicit give-up-cleanly exits and cap max_tokens well above your expected payload.

Do I still need Zod/Pydantic validation if the provider enforces the schema?

Always, yes. Provider-side validation handles shape and types, but business invariants (sum-to-total, date ranges, referential integrity) must live in your own validator. I have never shipped a production pipeline without one.

esc