“Strict JSON” without the strict flag is the single most common production parsing failure across the OpenAI and Anthropic SDKs in 2026. The recurring r/ChatGPTCoding and Anthropic Discord threads on parse errors land on the same answer: the prompt does not fix it, the strict flag does, and the prompt is the insurance policy on top. The 11-line prompt below moves Claude Opus 4.7 to 99-100 of 100 and GPT-5.3-Codex to 100 of 100 across 500 runs on the TCC adversarial set (40-property schema, 100 inputs designed to break naive prompts).
Why “return JSON only” does not actually force JSON
When you write a format instruction in a prompt, you are doing one thing: shifting the probability distribution over the next token. The model has seen millions of examples during training where that phrasing is followed by { and a well-formed JSON body. Your instruction loads that pattern strongly. The probability mass on JSON-shaped tokens goes way up, often high enough that you get valid JSON 95-99% of the time on a well-tuned model.
But probable is not certain. At every decoding step the model selects the next token from its output distribution. At temperature 0 it picks the argmax deterministically. At any temperature above 0 it samples, meaning lower-probability tokens can and do get selected. A preamble phrase like “Sure! Here’s the evaluation:” has a very small but non-zero probability at step one. If something in the context nudges that probability upward, you get the preamble and your parse fails.
This is instruction-following. It is a soft mechanism. It has no hard guarantees. Production systems that rely on prompt-only enforcement see JSON parsing failures in 8-15% of requests.
What actually forces JSON is constrained decoding: at each decoding step, the system compares the partial output against the schema and sets any violating token’s logit to negative infinity. The model mathematically cannot emit it. This is implemented in OpenAI’s Structured Outputs (response_format with strict: true), Anthropic’s tool-use interface, and open-weight libraries like Outlines. With constrained decoding, parse failure rates drop below 0.1%.
The prompt below is the layer that runs on top of constrained decoding, handling the edge cases that constrained decoding does not cover.
The prompt
Return exactly one JSON object that validates against the schema.
Rules, in order of priority:
1. Return JSON only. No prose, no markdown, no code fences.
2. Every required field must be present, spelled exactly as in the schema.
3. Never return the schema. Return an instance of the schema.
4. If a value is not determinable from the input, set it to null when the schema permits it.
If the schema does not permit null, emit the most conservative valid value
(0 for numbers, "" for strings, [] for arrays).
5. Stop as soon as the closing brace of the top-level object is written. Do not continue.
The schema is supplied out-of-band via response_format. Do not repeat the schema in your output.
Before and after: what breaks without this prompt
The soft approach (what most pipelines use):
import json
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": 'Evaluate this response. Return JSON only: {"score": int, "reason": str}'
}]
)
try:
result = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
result = None # silent failure -- downstream receives None and keeps running
The try/except here is necessary but not sufficient. Catching the error and returning None defers the damage. Whatever uses result downstream has to handle None everywhere, and if it does not, the failure propagates silently and corrupts your scores. One production incident saw 47 consecutive evaluations logged as failures because a long input caused the judge to prepend one sentence before the JSON block.
The hard approach (constrained decoding + prompt):
from pydantic import BaseModel
class Evaluation(BaseModel):
score: int
reason: str
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": "Evaluate this response."}],
response_format=Evaluation,
)
result = response.choices[0].message.parsed # always a valid Evaluation object
No try/except on the parse. result is always a typed Evaluation because the schema was enforced at the token level. The prompt above still lives in the system message as documentation and a quality hint; the schema enforcement is the contract.
Why the prompt works, in 5 bullets
- “Return the schema” is a real failure mode. On deeply nested structures, models sometimes return the schema itself instead of an instance. Rule 3 addresses it explicitly. The same failure shows up in the recurring “model returned my JSON Schema instead of a JSON object” threads on the OpenAI Python SDK issue tracker.
- “Stop as soon as the closing brace” stops the trailing-text failure. Models love to add “Let me know if you need anything else!” after a valid JSON block. With native strict-JSON mode that trailing text should not appear; without it, it does on a meaningful percentage of calls. Rule 5 is insurance.
- “Conservative valid value” blocks the enum-extrapolation bug. Without rule 4, models return enum values outside the schema (helpful defaults like
"medium"when the schema lists"low","high"). Rule 4 routes those cases to the smallest valid value instead. - No markdown and no code fences. The
response_formatflag handles this on OpenAI and Anthropic, but explicit instruction reduces edge-case emission on long outputs and on models from vendors without strict-mode parity. - One ordering of rules. Models respect priority when you give them priority. Unordered rule lists produce edge cases where the model interprets two rules as contradictory. Ordered rules do not.
Implementation by provider
OpenAI (GPT-5.3-Codex, GPT-5.4):
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class MySchema(BaseModel):
field_a: str
field_b: int
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": STRICT_JSON_PROMPT},
{"role": "user", "content": your_input}
],
response_format=MySchema,
)
result = completion.choices[0].message.parsed
Anthropic (Claude Opus 4.7, Claude Sonnet 4.6):
import anthropic
import json
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
tools=[{
"name": "output_schema",
"description": "Output the structured result.",
"input_schema": {
"type": "object",
"properties": {
"field_a": {"type": "string"},
"field_b": {"type": "integer"}
},
"required": ["field_a", "field_b"]
}
}],
tool_choice={"type": "tool", "name": "output_schema"},
messages=[{"role": "user", "content": your_input}]
)
result = response.content[0].input # dict, always schema-valid
Open-weight models (Llama, Mistral via Outlines):
import outlines
import json
model = outlines.models.transformers("mistralai/Mistral-7B-Instruct-v0.2")
schema = json.dumps({
"type": "object",
"properties": {
"field_a": {"type": "string"},
"field_b": {"type": "integer"}
},
"required": ["field_a", "field_b"]
})
generator = outlines.generate.json(model, schema)
result = generator(your_input) # guaranteed schema-valid
Failure modes
- Unescaped quote in a nested string. The most common remaining failure class. The prompt cannot fix this cleanly. Fix is client-side: use
response_format={"type":"json_schema","strict":true}on OpenAI per the OpenAI structured outputs guide, or the tool-use interface on Anthropic. Do not rely on the prompt alone. - “Number as string” when the schema requires integer. A small but real failure mode on Gemini 3.1 Pro with complex schemas. The constrained decoding layer prevents this; the prompt alone does not.
- Safety refusal returns non-schema output. OpenAI’s constrained decoding contract explicitly excludes safety refusals. Your boundary code must handle a non-schema response on refused calls. Always check
finish_reasonbefore accessingmessage.parsed. - Schema too large for the context window. On schemas with 200+ properties, the model sometimes truncates the output before the closing brace. Split large schemas into a parent schema that references child schemas, rather than flattening everything.
Tested on (TCC editorial scoring)
- Claude Opus 4.7 with tool-use interface: 100 of 100 valid parses on the adversarial set.
- GPT-5.3-Codex with
response_formatstrict: 100 of 100 valid parses. - Claude Sonnet 4.6 with tool-use interface: 99 of 100 (one safety refusal on a borderline input).
- Gemini 3.1 Pro with response schema enforcement: 98 of 100 (two “number as string” failures on a nested array field).
- Prompt-only, no constrained decoding (any model): 85-92 of 100 depending on schema complexity. Not production-safe for load-bearing output.
Methodology on the 14-task scorecard.
Three rules for any pipeline acting on structured LLM output
- Validate at every trust boundary. Every point where LLM output enters your code as structured data is a trust boundary. Treat a parse failure as a first-class event: log it, alert on it, raise loudly. Never let a
Noneflow silently downstream. - Use constrained decoding when the output is load-bearing. If a score, routing decision, or classification depends on structured output, use a constrained endpoint or library. Soft-prompt failures in the 1-5% range compound hard in multi-step pipelines.
- Keep the prompt instruction anyway. Even with constrained decoding, write the format instruction in your prompt. It improves output quality and serves as documentation of intent. But treat it as a hint to the model, not a technical contract. The schema enforcement is the contract.
Frequently asked questions
Does the prompt work without constrained decoding?
It improves reliability from roughly 85% to 92-95% on complex schemas. That is not good enough for load-bearing pipelines. Use it together with constrained decoding, not as a replacement.
What about models that do not support response_format?
Use Outlines or llama.cpp’s --grammar-file flag for open-weight models. For hosted models without schema enforcement, add a validation-and-retry wrapper: parse, catch, retry with the error message appended. Cap retries at 2; anything beyond that indicates a prompt or schema problem, not a transient failure.
Should I include the schema in the prompt?
No, per rule 5 of the prompt: “The schema is supplied out-of-band via response_format.” Repeating the schema in the prompt can confuse the model into returning the schema instead of an instance (rule 3 exists because of this). Let the schema live in response_format only.
How do I handle null vs missing fields?
Rule 4 covers it: null when the schema permits, otherwise the most conservative valid value. For stricter pipelines, make all optional fields explicitly nullable in the schema and treat missing required fields as a constraint violation to be retried.
Related
The recurring r/ChatGPTCoding threads on parse errors are on r/ChatGPTCoding. The OpenAI structured outputs reference is at platform.openai.com/docs/guides/structured-outputs. The Outlines library is at github.com/dottxt-ai/outlines. The schema design prompt that produces the input schema for this workflow is on the Postgres schema design post.
One-line takeaway
Use constrained decoding as the contract, use this prompt as the quality layer on top, validate at every trust boundary, and JSON parse errors stop being a production incident category.