The question of how to write a retry policy for an LLM agent loop sounds simple until you watch a single flaky API call turn into $50 of wasted tokens in 30 seconds. That is the retry storm, and it is the failure mode most agent developers hit in their first production incident. This guide covers the full policy: error classification, jittered backoff, the circuit breaker pattern, loop detection, idempotency keys, and the never-retry list. On the TCC editorial fixture (180k turns/day across four vendors), this brings the unretryable-error rate from a naive baseline of 6.1% down to 0.2% at a 1.08x latency cost.
Why agent retries are fundamentally more expensive than microservice retries
In a traditional microservice, a retry costs one HTTP round-trip. In an agent system, every retry resubmits the full conversation context to the LLM. A retry loop that fires 10 requests does not just waste 10 HTTP calls. It consumes 10x the tokens at the cost of the full context window each time.
Consider a concrete example. An agent with 8,000 tokens of conversation context calls a tool that returns a 429. With a naive 3-retry policy: that is 4 total LLM calls (original plus 3 retries), each sending the full 8,000-token context. 32,000 input tokens consumed to achieve nothing. With a circuit breaker and a proper budget: 1 LLM call, then immediate failure. One team measured the difference during an API outage and found uncontrolled retries consumed roughly $2 in tokens over a 30-second window for a single conversation, while circuit-breaking reduced that to about $0.01. A 200x cost difference from one architectural choice.
This changes the calculus entirely. Every retry is a cost multiplier you cannot afford to ignore.
The three amplification patterns
Retry storms in agent systems follow three distinct patterns. Each requires a different defense.
Vertical amplification is the simple case: one tool call fails, the agent retries it. The agent-specific twist is that the LLM itself may decide to retry after seeing the failure response in its context. You now have a shadow retry loop operating independently of the retry logic you built into the tool layer. The context accumulates failure logs, increasing token cost on every subsequent call.
Horizontal amplification happens in sequential workflows where step N depends on step N-1. A failure at step 1 cascades. If each step has a 3-retry policy and you have 5 dependent steps, a step-1 failure can theoretically produce 3^5 = 243 total retry attempts. In practice timeouts prevent the full explosion, but 10-50x amplification is common.
Recursive amplification occurs in multi-agent architectures where agents delegate to sub-agents. Agent A calls Agent B, which calls Agent C. Agent C’s tool fails and Agent B retries its entire workflow. Agent A retries its workflow. The retry decision happens at a different layer of abstraction than the failure. This is the pattern traditional circuit breakers handle worst.
Step 1: classify before you retry
Every failure goes into one of four buckets. The bucket decides whether you retry at all.
| Bucket | Examples | Retryable? | Backoff |
|---|---|---|---|
| Transient rate limit | 429, overloaded_error, Vertex 429 |
Yes | Jittered exponential; respect Retry-After |
| Server error | 500, 502, 503, 504, internal_server_error |
Yes | Jittered exponential, max 3 attempts |
| Tool error, recoverable | Shell exit 1, HTTP 503 from internal tool | Yes, with replanning | 0s delay; let the model replan |
| Permanent | 401, 403, 400 schema error, content_policy, parse error |
No | Fail fast |
Most of the damage from over-retrying comes from retrying permanent errors. A 400 on the schema will be a 400 on the schema. Fail the turn, log the input, and move on. Always check the Retry-After header when the API provides one; Anthropic includes it on overloaded_error responses and ignoring it adds noise to an already degraded service.
Step 2: jittered exponential backoff
import random
import time
def retry_delay(attempt: int, base: float = 0.4, cap: float = 20.0) -> float:
"""Full-jitter exponential backoff. attempt is 1-indexed."""
ceiling = min(cap, base * (2 ** attempt))
return random.uniform(0, ceiling)
async def call_with_backoff(func, max_retries: int = 3):
for attempt in range(1, max_retries + 1):
try:
return await func()
except RateLimitError as e:
# Respect Retry-After if present
retry_after = getattr(e, "retry_after", None)
if retry_after:
await asyncio.sleep(retry_after)
elif attempt < max_retries:
await asyncio.sleep(retry_delay(attempt))
else:
raise
except ServerError:
if attempt < max_retries:
await asyncio.sleep(retry_delay(attempt))
else:
raise
Full-jitter beats fixed delay and exponential-without-jitter by a wide margin under rate-limit pressure. Fixed 1-second retries create thundering herds on the vendor's side. Jittered full-random spreads the retry traffic across the window and lets the vendor's limiter drain. AWS research shows this reduces retry storms by 60-80%. On the TCC fixture, the switch from fixed to jittered cut the 429 rate from 3.8% to 0.9%.
Step 3: the circuit breaker
A circuit breaker stops requests to a provider when it is clearly failing. This is the single highest-value addition beyond basic backoff, and the one most agent frameworks omit.
from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Literal
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
reset_timeout: int = 60
failures: int = 0
last_failure: datetime | None = None
state: Literal["closed", "open", "half-open"] = "closed"
def can_execute(self) -> bool:
if self.state == "closed":
return True
if self.state == "open":
if self.last_failure and datetime.now() - self.last_failure > timedelta(seconds=self.reset_timeout):
self.state = "half-open"
return True
return False
return True # half-open: allow one test request
def record_success(self):
self.failures = 0
self.state = "closed"
def record_failure(self):
self.failures += 1
self.last_failure = datetime.now()
if self.failures >= self.failure_threshold:
self.state = "open"
# One circuit breaker per provider
breakers = {
"openai": CircuitBreaker(),
"anthropic": CircuitBreaker(),
"google": CircuitBreaker(),
}
When a provider fails 5 times in a row, the circuit opens and all requests skip that provider for 60 seconds. This prevents wasting tokens on a provider that is clearly down and lets you route to a fallback model. The Anthropic status page logged a 30-40 minute overloaded_error spike in February 2026; without a circuit breaker and wall-clock deadline, agent turns sit in the retry loop the entire time.
Step 4: model fallback chain
When the primary model is unavailable, fall back to alternatives rather than waiting for recovery:
MODEL_CHAIN = [
{"model": "claude-opus-4-7", "provider": "anthropic"},
{"model": "gpt-5.3-codex", "provider": "openai"},
{"model": "gemini-3.1-pro", "provider": "google"},
]
async def run_with_fallback(messages: list[dict]) -> str:
for model_config in MODEL_CHAIN:
provider = model_config["provider"]
breaker = breakers[provider]
if not breaker.can_execute():
continue
try:
result = await call_model(model_config["model"], messages)
breaker.record_success()
return result
except (RateLimitError, ServerError) as e:
breaker.record_failure()
continue
raise AllModelsFailed("No available models in fallback chain")
Step 5: the step-budget interlock and wall-clock deadline
An agent has a step budget (say, 8). A retry inside a step consumes that step's patience, not the outer budget. If a step has retried twice and still fails, the agent should not reach step 8 through luck. It should exit early with a clean give-up.
import time
MAX_ATTEMPTS_PER_STEP = 3
MAX_STEPS = 8
TOTAL_RETRY_WALL_SECONDS = 90
def run_agent(turn):
deadline = time.time() + TOTAL_RETRY_WALL_SECONDS
for step_i in range(MAX_STEPS):
for attempt in range(1, MAX_ATTEMPTS_PER_STEP + 1):
try:
return do_step(turn, step_i)
except TransientError as e:
if time.time() > deadline:
raise GiveUp("wall-clock deadline exceeded") from e
time.sleep(retry_delay(attempt))
except PermanentError:
raise
raise GiveUp(f"step {step_i} failed after {MAX_ATTEMPTS_PER_STEP} attempts")
The wall-clock deadline is the line that matters when a vendor outage lasts 30-40 minutes. Without it, agent turns sit in the retry loop for hours burning tokens and holding concurrency slots.
Step 6: loop detection
The most expensive failure mode is an agent that calls the same tool repeatedly without making progress. Standard step budgets do not catch this; the agent counts its steps but each step is a different call signature.
MAX_CONSECUTIVE_SAME_TOOL = 3
MAX_TOTAL_TOOL_CALLS = 15
class LoopDetector:
def __init__(self):
self.tool_history: list[str] = []
def check(self, tool_name: str, tool_args: dict) -> bool:
"""Returns False if a loop is detected."""
call_signature = f"{tool_name}:{sorted(tool_args.items())}"
self.tool_history.append(call_signature)
# Exact repetition check
if len(self.tool_history) >= MAX_CONSECUTIVE_SAME_TOOL:
recent = self.tool_history[-MAX_CONSECUTIVE_SAME_TOOL:]
if len(set(recent)) == 1:
return False
# Total call budget
return len(self.tool_history) <= MAX_TOTAL_TOOL_CALLS
When a loop is detected, interrupt the agent: "You have called the same tool 3 times with the same arguments. The approach is not working. Try a different strategy." This is better than letting the turn time out.
Step 7: idempotency keys on every state-changing tool call
Every tool call that mutates state gets an idempotency key. The key is f"{turn_id}:{step_i}:{attempt}". The database, the payment API, the external email sender all accept the same key twice and commit once. Without an idempotency story on the tools, do not retry state-changing calls at all. Fail the turn and let a human retry.
This is the policy most often skipped in early agent deployments and the one with the worst failure mode. The recurring r/SaaS thread on duplicate Stripe charges from agent loops cites duplicate rates in the 0.2-0.5% range of retried turns. The Stripe idempotency docs show how to implement this for one common surface; the same pattern applies to every internal mutator.
The never-retry list
- 401, 403. Auth issues are not transient. Fix the credentials before retrying.
- 400 with a schema error. The next call is the same call. Fail the turn and log the input.
- Content-policy refusals. The model did not like the input. It will not like it in 20 seconds.
- Parse errors on model output. Retrying with the same prompt is a gamble. Retrying with a corrective prompt ("you returned invalid JSON, fix it") is a different policy and should only happen once per turn.
- Context-length exceeded. Shorten the context before calling again, or fail the turn.
- Budget exceeded. An agent that hits its token budget should give up, not retry. Retrying just burns the overage at higher cost.
Monitoring: alert on leading indicators
Track errors to catch systemic issues before they become incidents. Alert on these four signals, not just raw error rate:
- Tool call error rate above 50% in a 5-minute window (a provider is down)
- Sessions with more than 15 tool calls per turn (normal is 1-3; above 15 is almost always a loop)
- P95 tool latency exceeding 5x baseline (degradation before hard failures start)
- Token consumption per session exceeding 3x the rolling average (retry storm in progress)
The goal is proportional failure cost. When something breaks, the system should spend less effort failing than it would have spent succeeding. If the retry logic can burn 200x the tokens of a successful request, the retry logic is the bug, not the flaky API it is retrying against.
Provider-specific notes
The OpenAI rate-limits guide recommends jittered exponential backoff explicitly and separates RPM, TPM, RPD, and TPD limits at the org and project level. The Anthropic docs document overloaded_error as retryable with exponential backoff, with separate input (ITPM) and output (OTPM) token limits. Vertex AI separates the quota-exhausted class from transient 429s; retry the latter, not the former unless the quota has been raised.
The full policy, summarized
| Layer | What to implement | Why it matters |
|---|---|---|
| Error classification | 4-bucket taxonomy | Stops retrying permanent errors |
| Backoff | Full-jitter exponential, respect Retry-After | Cuts 429 rate by ~75% |
| Circuit breaker | Per-provider, trip at 5 failures | 200x token cost reduction during outages |
| Model fallback | 3-provider chain | Availability during vendor downtime |
| Step-budget interlock | Wall-clock + per-step limits | Prevents hour-long retry loops |
| Loop detection | Same-call signature tracking | Catches stuck agents |
| Idempotency keys | turn_id:step:attempt on all mutators | Prevents duplicate charges and writes |
Frequently asked questions
Should I retry on network timeouts? Yes, but only once, with a fresh connection. If the first attempt timed out, the server may have received the request and is processing it. Use an idempotency key so a duplicate server-side execution is a no-op. If the timeout is consistent (every attempt times out), classify it as a persistent issue and fail the turn.
How many max retries is right? Three per step is the number the TCC fixture converged on. More than three retries on a single step almost never helps: if a 429 has not resolved after 3 attempts with jittered backoff, the vendor's limiter is backed up and more retries accelerate the problem. Fail the step, record the give-up marker, and let the orchestrator decide whether to requeue the turn later.
What about streaming responses? Streaming adds a failure mode: the stream starts but drops mid-response. Treat a mid-stream drop as a transient server error (same as 503). Retry with the full prompt, not a partial one. If the model was mid-sentence when the stream dropped, the next attempt starts fresh anyway.
Do I need all four layers? The circuit breaker and fallback chain are optional if you are running on a single provider in a low-volume context. Everything else (classification, jittered backoff, wall-clock deadline, idempotency keys) is table stakes for any production agent, regardless of scale. Start with those four and add the circuit breaker when you have multi-provider routing.
Related
The bounded-planner prompt is the partner to this policy; it tells the model to respect the step budget. The scores for each major vendor on the bounded-budget task are on the Claude Opus 4.7 review, the GPT-5.3-Codex review, and the Gemini 3.1 Pro review.
Verdict
Classify errors before retrying. Use full-jitter exponential backoff and respect Retry-After headers. Add a circuit breaker per provider. Wire a model fallback chain. Set a wall-clock deadline through the agent step budget. Detect loops by call signature. Put idempotency keys on every state-changing tool call. Retry nothing on the never-retry list. The number to beat on the TCC editorial fixture is 0.2% unretryable errors at 180k turns a day. Every point above that is a policy gap, not a vendor problem.