~/cheatsheets/gpt-5-4-api-cheatsheet-the-9-parameters-that-matter-in-2026
§ CHEATSHEET · APR 23, 2026 ANTHROPIC · CHEATSHEET · CLAUDE v1.0

GPT-5.4 API cheatsheet: the 9 parameters that matter in 2026

GPT-5.4 API parameters, defaults, and the 3 that break your pipeline if you do not set them. Strict JSON, reasoning_effort, tool_choice, and the cost line to watch.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
  8 min read

GPT-5.4 shipped March 5, 2026. The 10 parameters below cover every knob a production call site needs to set explicitly. The 3 footguns are the ones that break teams the fastest. Scroll past those for a full working Python example, the 272K-token surcharge breakdown, and the computer use section that lives on a different API endpoint than you expect.

By Ryan Calloway. Updated May 2026.

Update – April 23, 2026
New flagshipOpenAI shipped GPT-5.5 – first fully retrained base since GPT-5, 1M context, $5/$30 per M tokens. Every parameter on this page still applies; swap model="gpt-5.4" for model="gpt-5.5".
Cost noticeGPT-5.5 doubles input ($2.50 to $5.00) and output ($15 to $30). Stay on GPT-5.4 unless you need the Terminal-Bench 2.0 lead or agent-loop accuracy.
When to upgradeTerminal automation, multi-step tool chains, long-horizon agents. For strict-JSON extraction GPT-5.4 stays the right pick on $/successful-call.

The 10 parameters to set explicitly

Parameter API default Production value Why
model none gpt-5.4 or a dated snapshot Pin the version. Floating -latest breaks reproducibility across deploys.
reasoning_effort medium medium for most work, high for planning and debugging Higher effort costs 2-4x in hidden reasoning tokens. Log usage.output_tokens_details.reasoning_tokens to see the real spend.
response_format text {"type":"json_schema","strict":true,...} Strict mode compiles the schema into constrained decoding. JSON parse errors become structurally impossible.
tool_choice auto "required" or {"type":"function","function":{"name":"..."}} Forces the model to call the tool you registered. auto returns prose on a non-trivial fraction of calls.
max_completion_tokens model max 2048-8192 depending on task Caps runaway generations. The unbounded billing line is the most common surprise in post-mortems.
temperature 1.0 0.0 for deterministic extraction, 0.4 for test generation 0 is not fully deterministic on its own. Pair with seed for replay fidelity.
seed none fixed int per call class Reproducibility on logged replays. Same seed plus same temperature gives statistically consistent outputs.
top_logprobs none 5 on eval runs Margin-of-confidence at eval time without a second API call.
metadata none {"eval_id":..., "call_class":...} Tags calls for post-hoc analysis in the OpenAI usage dashboard.
prediction none Previous output on re-generation tasks Speculative decoding. Cuts latency materially on test-gen and template-fill workloads. Conflicts with stream on some SDK versions; test first.

reasoning_effort reference

All five valid values, ordered cheapest to most expensive:

Value Hidden reasoning tokens Best for
none 0 Lookups, formatting, classification. No chain-of-thought at all.
low light Structured extraction, summarization, straightforward code edits.
medium moderate Default. Balanced for most API workloads.
high heavy, ~1.8x visible tokens Code review, multi-step planning, complex debugging across files.
xhigh very heavy, 3-5x medium Genuinely hard reasoning tasks. The line item that doubles bills in week one if left on from development.

Tip: log usage.output_tokens_details.reasoning_tokens per request in production. It is easy to set high during development and ship it without noticing. At 500 requests per day, the difference between medium and high is significant by end of month.

The 3 settings that break you

  1. response_format without "strict": true. Without strict mode the model emits JSON-flavored text and your parser rejects a non-trivial percentage on high-volume traffic. Always set "strict": true and supply the full schema. The OpenAI structured outputs guide covers the format. Strict mode compiles the schema into constrained decoding at the server side; it is not a prompt hint.
  2. tool_choice left at auto. If you registered a tool because the workflow requires it, set tool_choice to "required" or pin the specific function name. With auto, the model answers in prose on a meaningful fraction of calls. That is a spec violation for any pipeline expecting structured tool output.
  3. max_completion_tokens unset. The model runs to the model maximum on error or ambiguous prompts. An unbounded call at reasoning_effort="high" on a vague prompt can generate thousands of tokens. This is the most common line item in reader-submitted billing post-mortems.

Minimal production call

from openai import OpenAI
import json

client = OpenAI()  # reads OPENAI_API_KEY from env

INVOICE_SCHEMA = {
    "name": "invoice_data",
    "strict": True,
    "schema": {
        "type": "object",
        "properties": {
            "invoice_id":   {"type": "string"},
            "vendor":       {"type": "string"},
            "amount_usd":   {"type": "number"},
            "due_date":     {"type": "string", "format": "date"},
            "line_items":   {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "description": {"type": "string"},
                        "quantity":    {"type": "integer"},
                        "unit_price":  {"type": "number"}
                    },
                    "required": ["description", "quantity", "unit_price"],
                    "additionalProperties": False
                }
            }
        },
        "required": ["invoice_id", "vendor", "amount_usd", "due_date", "line_items"],
        "additionalProperties": False
    }
}

response = client.chat.completions.create(
    model="gpt-5.4",
    reasoning_effort="medium",
    response_format={"type": "json_schema", "json_schema": INVOICE_SCHEMA},
    max_completion_tokens=2048,
    temperature=0.0,
    seed=42,
    metadata={"call_class": "invoice_extraction", "eval_id": "inv-001"},
    messages=[
        {
            "role": "system",
            "content": "Extract invoice data from the text. Return only the JSON structure."
        },
        {
            "role": "user",
            "content": "Invoice #INV-2026-0512 from Acme Corp. Due 2026-06-01. "
                       "Line items: 10x API units @ $5.00, 2x Support hours @ $150.00. Total $200."
        }
    ],
)

data = json.loads(response.choices[0].message.content)
reasoning_tokens = response.usage.output_tokens_details.reasoning_tokens
print(f"Extracted: {data}")
print(f"Reasoning tokens used: {reasoning_tokens}")

Structured output with a tool call

When you need the model to both call a function and return typed data, combine tool_choice forced to the specific function with a tight input schema:

tools = [
    {
        "type": "function",
        "function": {
            "name": "create_ticket",
            "description": "Create a support ticket. Call only when the user is reporting a bug or outage.",
            "strict": True,
            "parameters": {
                "type": "object",
                "properties": {
                    "severity":    {"type": "string", "enum": ["low", "medium", "high", "critical"]},
                    "title":       {"type": "string", "maxLength": 120},
                    "description": {"type": "string", "maxLength": 1000},
                    "component":   {"type": "string"}
                },
                "required": ["severity", "title", "description", "component"],
                "additionalProperties": False
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-5.4",
    reasoning_effort="medium",
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "create_ticket"}},
    max_completion_tokens=512,
    temperature=0.0,
    messages=[
        {"role": "system", "content": "You are a support triage assistant."},
        {"role": "user",   "content": "The payments API has been returning 500 errors for 20 minutes."}
    ],
)

tool_call = response.choices[0].message.tool_calls[0]
ticket = json.loads(tool_call.function.arguments)
print(ticket)
# {"severity": "critical", "title": "Payments API 500 errors", ...}

Computer use (Responses API only)

Computer use on GPT-5.4 scores 75.0% on OSWorld-Verified, above the 72.4% human expert baseline. The catch: it lives on the Responses API (/v1/responses), not the Chat Completions endpoint. If your stack targets /v1/chat/completions, you need a separate integration path.

response = client.responses.create(
    model="gpt-5.4",
    tools=[{"type": "computer_use"}],
    input=[{
        "role": "user",
        "content": "Open the browser, go to the admin dashboard, and export the monthly report as CSV."
    }],
    truncation="auto"
)

Three things to know before shipping computer use:

The 272K surcharge

GPT-5.4 has a 1.05M token context window, but OpenAI applies a pricing multiplier once input exceeds 272K tokens: 2x on input, 1.5x on output. A 500K-token context run costs substantially more than two 250K-token runs. This is the number that surprises most teams mid-sprint.

Where the 1M window genuinely helps: codebase-wide analysis where critical logic sits inside the first 200K tokens and the rest is reference context; multi-document Q&A where you need semantic connections across a broad corpus; long-running agent sessions where you want to avoid truncation. Where it underperforms: verbatim recall of facts buried past 300K tokens. For precision extraction at depth, use chunked retrieval and pass only the relevant chunks.

Pricing (May 2026)

Model Input $/M Cached input $/M Output $/M Context Best for
GPT-5.4 $2.50 $0.25 $15.00 1.05M Default production, strict-JSON extraction
GPT-5.4 (272K+) $5.00 $0.50 $22.50 up to 1.05M Same model, surcharge kicks in above 272K input tokens
GPT-5.4-mini ~$0.40 ~$0.04 ~$1.60 varies High-volume structured extraction, classification, routine code gen
GPT-5.5 $5.00 $0.50 $30.00 1M Agentic loops, terminal automation, 1M context work
GPT-5.3-Codex $1.75 n/a $14.00 400K Coding tasks at lower cost when 1M context is not needed
Batch (any tier) 50% of standard 50% 50% same Overnight evals, large structured runs
Priority (any tier) 2x standard 2x 2x same SLA-critical low-latency

Cached input at 1/10 of standard is the lever most teams underuse. Long, stable system prompts pay for themselves in 2-3 calls. If your system prompt is the same across 1,000 daily requests, switching from uncached to cached halves your input cost on that prompt portion.

GPT-5.4-mini costs roughly 6x less than the full model and scores 54% on SWE-bench Pro – enough for the majority of production coding tasks. A tiered routing strategy (mini for extraction and classification, full model for planning and review) typically cuts monthly API spend by 30-50% compared to running the full model uniformly.

Watch-outs

Migration path from GPT-5.2

The lowest-incident migration pattern: swap the model alias in a feature branch, run your existing eval suite, measure token usage and latency delta, then promote. GPT-5.4 is backward-compatible with Chat Completions. You do not need to refactor your prompt structure unless you want to add reasoning_effort or structured outputs.

Three things to verify during migration:

  1. Increased output verbosity at medium effort – test JSON extraction if you parse responses downstream.
  2. Refusal behavior changes on edge-case professional queries – add explicit context to affected prompts.
  3. Higher latency at 200K+ tokens (median 25-45 seconds at medium effort) – review timeout settings and loading states before cutover.

For the side-by-side with GPT-5.3-Codex on structured output at scale, see the GPT-5.3-Codex review. For the prompt pattern that pairs with these parameters and posts 100 of 100 on a 40-property schema, see the strict-JSON prompt.

One-line takeaway

Set model with a date, reasoning_effort to medium, response_format to strict JSON schema, tool_choice to required, and max_completion_tokens to a real number. Log reasoning tokens in production. Everything else is a knob you can leave alone until you have a specific reason to move it.

esc