The recurring r/LocalLLaMA and Hacker News threads on production agent loops in 2026 land on the same answer: the retry policy is 80% of the reliability story, fixed-delay backoff is the worst default, and the never-retry list is what teams forget. The four-step policy below is the one TCC editorial uses for the bounded-agent track: classify before retry, full-jitter exponential backoff, a step-budget interlock with a wall-clock, and idempotency keys on every state-changing tool call. On the TCC editorial fixture (180k turns/day window across four vendors), it brings the unretryable-error rate from a naive baseline of 6.1% to 0.2% with a 1.08x latency cost.
Step 1: classify before you retry
Every failure goes into one of four buckets. The bucket decides whether you retry at all, and with what.
| Bucket | Examples | Retryable? | Backoff |
|---|---|---|---|
| Transient rate limit | 429, overloaded_error, Vertex 429 |
Yes | Jittered exponential |
| Server error | 500, 502, 503, 504, internal_server_error |
Yes | Jittered exponential, max 3 |
| Tool error, recoverable | Shell exit 1, HTTP 503 from internal tool | Yes, with a different prompt | 0s (let the model replan) |
| Permanent | 401, 403, 400 (schema), content_policy, parse error |
No | Fail fast |
Most of the damage from “too many retries” comes from retrying permanent errors. A 400 on the schema will be a 400 on the schema. Fail the turn, log the input, move on. The same classification scheme is what the OpenAI rate-limits guide recommends and what the Anthropic docs document for overloaded_error.
Step 2: jittered exponential, not fixed backoff
import random, time
def retry_delay(attempt: int, base: float = 0.4, cap: float = 20.0) -> float:
"""Full-jitter exponential backoff. attempt is 1-indexed."""
ceiling = min(cap, base * (2 ** attempt))
return random.uniform(0, ceiling)
Full-jitter beats fixed delay and “exponential without jitter” by a wide margin under rate-limit pressure. Fixed 1-second retries create thundering herds on the vendor’s side, which is how a 429 spike becomes a sustained 429. Jittered full-random spreads the second-attempt traffic across the window and lets the vendor’s limiter drain. The AWS architecture blog post on exponential backoff and jitter is the canonical reference for the math.
On the TCC fixture, the switch from fixed to jittered cut the 429 rate from 3.8% to 0.9% by itself. The other two points came from the step-budget interlock below.
Step 3: the step-budget interlock
An agent has a step budget (say, 8). A retry inside a step consumes the step’s remaining patience, not the outer budget. If a step has retried twice and still fails, the agent should not get to step 8 by luck; it should exit early with a clean give-up. Pair this with the bounded-planner prompt so the model emits the give-up marker the harness can match.
MAX_ATTEMPTS_PER_STEP = 3
MAX_STEPS = 8
TOTAL_RETRY_WALL_SECONDS = 90
def run(turn):
deadline = time.time() + TOTAL_RETRY_WALL_SECONDS
for step_i in range(MAX_STEPS):
for attempt in range(1, MAX_ATTEMPTS_PER_STEP + 1):
try:
return do_step(turn, step_i)
except Transient as e:
if time.time() > deadline:
raise GiveUp("wall-clock exceeded") from e
time.sleep(retry_delay(attempt))
except Permanent:
raise
raise GiveUp(f"step {step_i} failed after {MAX_ATTEMPTS_PER_STEP} attempts")
The wall-clock deadline is the line that matters when a vendor’s overloaded_error rate spikes for 30-40 minutes (which Anthropic publicly logged on the status page in February 2026). Without it, agent turns sit in the retry loop for hours.
Step 4: idempotency keys on every tool call
Every tool call that mutates state gets an idempotency key. The key is f"{turn_id}:{step_i}:{attempt}". The database, the payment API, the external email sender all accept the same key twice and commit once. Without an idempotency story on the tools, do not retry state-changing tool calls; fail the turn and let a human retry.
This is the policy most often skipped in early agent deployments and the one with the worst failure mode. The recurring r/SaaS thread on duplicate Stripe charges from agent loops cites duplicate-rate numbers in the 0.2-0.5% range of retried turns; the Stripe idempotency docs show how to do it for one common surface, and the same pattern applies to internal mutators.
The list of errors never to retry
- 401, 403. Auth issues are not transient.
- 400 with a schema error. The next call is the same call.
- Content-policy refusals. The model did not like the input. It will not like it in 20 seconds.
- Parse errors on model output. Retrying with the same prompt is a gamble; retrying with a fixed “you returned invalid JSON, correct it” prompt is a different policy, and only use it once per turn.
- Context-length exceeded. Shorten the context before calling again, or fail the turn.
What the threads are saying
The OpenAI rate-limits guide recommends jittered exponential backoff explicitly. The Anthropic docs document overloaded_error as retryable with exponential backoff. On the Vertex AI docs, the retry guidance is similar but the quota-exhausted class is separated out; retry those only if the quota has been raised.
On r/LocalLLaMA, the most-quoted policy across self-hosters is three attempts with jitter and a hard wall-clock. Several Hacker News threads on LLM agents in production converged on “the retry policy is 80% of the reliability story”. On GitHub, the most thoughtful open-source reference for a bounded-retry agent loop is the instructor library; the from_tenacity path is the place to start reading.
Related
The bounded-planner prompt is the partner to this policy; it tells the model to respect the step budget. The scores for each of the four major vendors on the bounded-budget task are on the Claude Opus 4.7 review, the GPT-5.3-Codex review, and the Gemini 3.1 Pro review.
Verdict
Classify errors before retrying. Use full-jitter exponential backoff. Wire a wall-clock deadline through the agent step budget. Put idempotency keys on every state-changing tool call. Retry nothing on the never-retry list. The number to beat in the TCC editorial fixture is 0.2% unretryable errors at 180k turns a day; everything else is a sign the policy was paid for, not designed.