GPT-5.4 shipped March 5, 2026. The 10 parameters below cover every knob a production call site needs to set explicitly. The 3 footguns are the ones that break teams the fastest. Scroll past those for a full working Python example, the 272K-token surcharge breakdown, and the computer use section that lives on a different API endpoint than you expect.
By Ryan Calloway. Updated May 2026.
model="gpt-5.4" for model="gpt-5.5".The 10 parameters to set explicitly
| Parameter | API default | Production value | Why |
|---|---|---|---|
model |
none | gpt-5.4 or a dated snapshot |
Pin the version. Floating -latest breaks reproducibility across deploys. |
reasoning_effort |
medium |
medium for most work, high for planning and debugging |
Higher effort costs 2-4x in hidden reasoning tokens. Log usage.output_tokens_details.reasoning_tokens to see the real spend. |
response_format |
text | {"type":"json_schema","strict":true,...} |
Strict mode compiles the schema into constrained decoding. JSON parse errors become structurally impossible. |
tool_choice |
auto |
"required" or {"type":"function","function":{"name":"..."}} |
Forces the model to call the tool you registered. auto returns prose on a non-trivial fraction of calls. |
max_completion_tokens |
model max | 2048-8192 depending on task | Caps runaway generations. The unbounded billing line is the most common surprise in post-mortems. |
temperature |
1.0 | 0.0 for deterministic extraction, 0.4 for test generation | 0 is not fully deterministic on its own. Pair with seed for replay fidelity. |
seed |
none | fixed int per call class | Reproducibility on logged replays. Same seed plus same temperature gives statistically consistent outputs. |
top_logprobs |
none | 5 on eval runs |
Margin-of-confidence at eval time without a second API call. |
metadata |
none | {"eval_id":..., "call_class":...} |
Tags calls for post-hoc analysis in the OpenAI usage dashboard. |
prediction |
none | Previous output on re-generation tasks | Speculative decoding. Cuts latency materially on test-gen and template-fill workloads. Conflicts with stream on some SDK versions; test first. |
reasoning_effort reference
All five valid values, ordered cheapest to most expensive:
| Value | Hidden reasoning tokens | Best for |
|---|---|---|
none |
0 | Lookups, formatting, classification. No chain-of-thought at all. |
low |
light | Structured extraction, summarization, straightforward code edits. |
medium |
moderate | Default. Balanced for most API workloads. |
high |
heavy, ~1.8x visible tokens | Code review, multi-step planning, complex debugging across files. |
xhigh |
very heavy, 3-5x medium | Genuinely hard reasoning tasks. The line item that doubles bills in week one if left on from development. |
Tip: log usage.output_tokens_details.reasoning_tokens per request in production. It is easy to set high during development and ship it without noticing. At 500 requests per day, the difference between medium and high is significant by end of month.
The 3 settings that break you
-
response_formatwithout"strict": true. Without strict mode the model emits JSON-flavored text and your parser rejects a non-trivial percentage on high-volume traffic. Always set"strict": trueand supply the full schema. The OpenAI structured outputs guide covers the format. Strict mode compiles the schema into constrained decoding at the server side; it is not a prompt hint. -
tool_choiceleft atauto. If you registered a tool because the workflow requires it, settool_choiceto"required"or pin the specific function name. Withauto, the model answers in prose on a meaningful fraction of calls. That is a spec violation for any pipeline expecting structured tool output. -
max_completion_tokensunset. The model runs to the model maximum on error or ambiguous prompts. An unbounded call atreasoning_effort="high"on a vague prompt can generate thousands of tokens. This is the most common line item in reader-submitted billing post-mortems.
Minimal production call
from openai import OpenAI
import json
client = OpenAI() # reads OPENAI_API_KEY from env
INVOICE_SCHEMA = {
"name": "invoice_data",
"strict": True,
"schema": {
"type": "object",
"properties": {
"invoice_id": {"type": "string"},
"vendor": {"type": "string"},
"amount_usd": {"type": "number"},
"due_date": {"type": "string", "format": "date"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"}
},
"required": ["description", "quantity", "unit_price"],
"additionalProperties": False
}
}
},
"required": ["invoice_id", "vendor", "amount_usd", "due_date", "line_items"],
"additionalProperties": False
}
}
response = client.chat.completions.create(
model="gpt-5.4",
reasoning_effort="medium",
response_format={"type": "json_schema", "json_schema": INVOICE_SCHEMA},
max_completion_tokens=2048,
temperature=0.0,
seed=42,
metadata={"call_class": "invoice_extraction", "eval_id": "inv-001"},
messages=[
{
"role": "system",
"content": "Extract invoice data from the text. Return only the JSON structure."
},
{
"role": "user",
"content": "Invoice #INV-2026-0512 from Acme Corp. Due 2026-06-01. "
"Line items: 10x API units @ $5.00, 2x Support hours @ $150.00. Total $200."
}
],
)
data = json.loads(response.choices[0].message.content)
reasoning_tokens = response.usage.output_tokens_details.reasoning_tokens
print(f"Extracted: {data}")
print(f"Reasoning tokens used: {reasoning_tokens}")
Structured output with a tool call
When you need the model to both call a function and return typed data, combine tool_choice forced to the specific function with a tight input schema:
tools = [
{
"type": "function",
"function": {
"name": "create_ticket",
"description": "Create a support ticket. Call only when the user is reporting a bug or outage.",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"severity": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
"title": {"type": "string", "maxLength": 120},
"description": {"type": "string", "maxLength": 1000},
"component": {"type": "string"}
},
"required": ["severity", "title", "description", "component"],
"additionalProperties": False
}
}
}
]
response = client.chat.completions.create(
model="gpt-5.4",
reasoning_effort="medium",
tools=tools,
tool_choice={"type": "function", "function": {"name": "create_ticket"}},
max_completion_tokens=512,
temperature=0.0,
messages=[
{"role": "system", "content": "You are a support triage assistant."},
{"role": "user", "content": "The payments API has been returning 500 errors for 20 minutes."}
],
)
tool_call = response.choices[0].message.tool_calls[0]
ticket = json.loads(tool_call.function.arguments)
print(ticket)
# {"severity": "critical", "title": "Payments API 500 errors", ...}
Computer use (Responses API only)
Computer use on GPT-5.4 scores 75.0% on OSWorld-Verified, above the 72.4% human expert baseline. The catch: it lives on the Responses API (/v1/responses), not the Chat Completions endpoint. If your stack targets /v1/chat/completions, you need a separate integration path.
response = client.responses.create(
model="gpt-5.4",
tools=[{"type": "computer_use"}],
input=[{
"role": "user",
"content": "Open the browser, go to the admin dashboard, and export the monthly report as CSV."
}],
truncation="auto"
)
Three things to know before shipping computer use:
- Latency per action step is 8-15 seconds. It is not suitable for CI pipelines or time-sensitive workflows. Use it for human-paced tasks like report generation and form filling.
- Run against a staging environment with at least 10 end-to-end logged executions before production. Benchmark cost-per-successful-task, not just pass rate.
- For any workflow that writes or deletes production data, add a human approval step. OSWorld performance does not transfer directly to custom internal tools.
The 272K surcharge
GPT-5.4 has a 1.05M token context window, but OpenAI applies a pricing multiplier once input exceeds 272K tokens: 2x on input, 1.5x on output. A 500K-token context run costs substantially more than two 250K-token runs. This is the number that surprises most teams mid-sprint.
Where the 1M window genuinely helps: codebase-wide analysis where critical logic sits inside the first 200K tokens and the rest is reference context; multi-document Q&A where you need semantic connections across a broad corpus; long-running agent sessions where you want to avoid truncation. Where it underperforms: verbatim recall of facts buried past 300K tokens. For precision extraction at depth, use chunked retrieval and pass only the relevant chunks.
Pricing (May 2026)
| Model | Input $/M | Cached input $/M | Output $/M | Context | Best for |
|---|---|---|---|---|---|
| GPT-5.4 | $2.50 | $0.25 | $15.00 | 1.05M | Default production, strict-JSON extraction |
| GPT-5.4 (272K+) | $5.00 | $0.50 | $22.50 | up to 1.05M | Same model, surcharge kicks in above 272K input tokens |
| GPT-5.4-mini | ~$0.40 | ~$0.04 | ~$1.60 | varies | High-volume structured extraction, classification, routine code gen |
| GPT-5.5 | $5.00 | $0.50 | $30.00 | 1M | Agentic loops, terminal automation, 1M context work |
| GPT-5.3-Codex | $1.75 | n/a | $14.00 | 400K | Coding tasks at lower cost when 1M context is not needed |
| Batch (any tier) | 50% of standard | 50% | 50% | same | Overnight evals, large structured runs |
| Priority (any tier) | 2x standard | 2x | 2x | same | SLA-critical low-latency |
Cached input at 1/10 of standard is the lever most teams underuse. Long, stable system prompts pay for themselves in 2-3 calls. If your system prompt is the same across 1,000 daily requests, switching from uncached to cached halves your input cost on that prompt portion.
GPT-5.4-mini costs roughly 6x less than the full model and scores 54% on SWE-bench Pro – enough for the majority of production coding tasks. A tiered routing strategy (mini for extraction and classification, full model for planning and review) typically cuts monthly API spend by 30-50% compared to running the full model uniformly.
Watch-outs
- reasoning_effort=high in production. Log
usage.output_tokens_details.reasoning_tokensper request and set a billing alert at 3x your average completion token count. It is easy to shiphigheffort from a development branch and notice it only at month-end. - logit_bias deprecation. GPT-5.4 deprecates the
logit_biaspath for steering JSON structure. Carrying 2024 bias tables forward causes interaction errors with strict JSON mode. Remove them before migrating. - prediction conflicts with stream. The
predictionparameter (speculative decoding) conflicts withstream=Trueon some SDK versions. Check your client version before enabling both. - GPT-5.2 deprecation date. GPT-5.2 Thinking enters Legacy Models on June 5, 2026. API access is unchanged until then, but plan migration tests for any downstream consumer still pinning
gpt-5.2. - xhigh on GPT-5.5.
reasoning_effort="xhigh"exists on GPT-5.5 as a tier above"high". Use it sparingly; it is the line item that doubles bills in week one of upgrades. - Output verbosity at medium effort. Some prompts that returned 200 tokens on GPT-5.2 now return 350-500. If you parse structured output downstream, test that your extraction still works on the new verbosity before cutting over.
- Refusal threshold shift. The 5.4 refusal threshold changed. Prompts touching security, medical, or financial domains should include explicit professional context.
Migration path from GPT-5.2
The lowest-incident migration pattern: swap the model alias in a feature branch, run your existing eval suite, measure token usage and latency delta, then promote. GPT-5.4 is backward-compatible with Chat Completions. You do not need to refactor your prompt structure unless you want to add reasoning_effort or structured outputs.
Three things to verify during migration:
- Increased output verbosity at medium effort – test JSON extraction if you parse responses downstream.
- Refusal behavior changes on edge-case professional queries – add explicit context to affected prompts.
- Higher latency at 200K+ tokens (median 25-45 seconds at medium effort) – review timeout settings and loading states before cutover.
Related
For the side-by-side with GPT-5.3-Codex on structured output at scale, see the GPT-5.3-Codex review. For the prompt pattern that pairs with these parameters and posts 100 of 100 on a 40-property schema, see the strict-JSON prompt.
One-line takeaway
Set model with a date, reasoning_effort to medium, response_format to strict JSON schema, tool_choice to required, and max_completion_tokens to a real number. Log reasoning tokens in production. Everything else is a knob you can leave alone until you have a specific reason to move it.