~/guides/agent-loops-and-retries-4-step-policy-that-cuts-429s-30x
§ GUIDE · APR 23, 2026 ADVANCED ADVANCED · AGENTS · TOOL-USE v1.4 · reviewed by 3

Agent loops and retries: 4-step policy that cuts 429s 30x

The retry policy that cut my agent-loop 429 rate from 6% to 0.2% across 4 vendors. Jitter, step-budget interlock, and the one thing you should never retry.
Ryan CallowayStaff contributor
Peer-reviewedby 3 reviewers
Reproduces in4a2f1e
Peer score 9.4 6 min read · 3,214 reads

The question of how to write a retry policy for an LLM agent loop sounds simple until you watch a single flaky API call turn into $50 of wasted tokens in 30 seconds. That is the retry storm, and it is the failure mode most agent developers hit in their first production incident. This guide covers the full policy: error classification, jittered backoff, the circuit breaker pattern, loop detection, idempotency keys, and the never-retry list. On the TCC editorial fixture (180k turns/day across four vendors), this brings the unretryable-error rate from a naive baseline of 6.1% down to 0.2% at a 1.08x latency cost.

Why agent retries are fundamentally more expensive than microservice retries

In a traditional microservice, a retry costs one HTTP round-trip. In an agent system, every retry resubmits the full conversation context to the LLM. A retry loop that fires 10 requests does not just waste 10 HTTP calls. It consumes 10x the tokens at the cost of the full context window each time.

Consider a concrete example. An agent with 8,000 tokens of conversation context calls a tool that returns a 429. With a naive 3-retry policy: that is 4 total LLM calls (original plus 3 retries), each sending the full 8,000-token context. 32,000 input tokens consumed to achieve nothing. With a circuit breaker and a proper budget: 1 LLM call, then immediate failure. One team measured the difference during an API outage and found uncontrolled retries consumed roughly $2 in tokens over a 30-second window for a single conversation, while circuit-breaking reduced that to about $0.01. A 200x cost difference from one architectural choice.

This changes the calculus entirely. Every retry is a cost multiplier you cannot afford to ignore.

The three amplification patterns

Retry storms in agent systems follow three distinct patterns. Each requires a different defense.

Vertical amplification is the simple case: one tool call fails, the agent retries it. The agent-specific twist is that the LLM itself may decide to retry after seeing the failure response in its context. You now have a shadow retry loop operating independently of the retry logic you built into the tool layer. The context accumulates failure logs, increasing token cost on every subsequent call.

Horizontal amplification happens in sequential workflows where step N depends on step N-1. A failure at step 1 cascades. If each step has a 3-retry policy and you have 5 dependent steps, a step-1 failure can theoretically produce 3^5 = 243 total retry attempts. In practice timeouts prevent the full explosion, but 10-50x amplification is common.

Recursive amplification occurs in multi-agent architectures where agents delegate to sub-agents. Agent A calls Agent B, which calls Agent C. Agent C’s tool fails and Agent B retries its entire workflow. Agent A retries its workflow. The retry decision happens at a different layer of abstraction than the failure. This is the pattern traditional circuit breakers handle worst.

Step 1: classify before you retry

Every failure goes into one of four buckets. The bucket decides whether you retry at all.

Bucket Examples Retryable? Backoff
Transient rate limit 429, overloaded_error, Vertex 429 Yes Jittered exponential; respect Retry-After
Server error 500, 502, 503, 504, internal_server_error Yes Jittered exponential, max 3 attempts
Tool error, recoverable Shell exit 1, HTTP 503 from internal tool Yes, with replanning 0s delay; let the model replan
Permanent 401, 403, 400 schema error, content_policy, parse error No Fail fast

Most of the damage from over-retrying comes from retrying permanent errors. A 400 on the schema will be a 400 on the schema. Fail the turn, log the input, and move on. Always check the Retry-After header when the API provides one; Anthropic includes it on overloaded_error responses and ignoring it adds noise to an already degraded service.

Step 2: jittered exponential backoff

import random
import time

def retry_delay(attempt: int, base: float = 0.4, cap: float = 20.0) -> float:
    """Full-jitter exponential backoff. attempt is 1-indexed."""
    ceiling = min(cap, base * (2 ** attempt))
    return random.uniform(0, ceiling)

async def call_with_backoff(func, max_retries: int = 3):
    for attempt in range(1, max_retries + 1):
        try:
            return await func()
        except RateLimitError as e:
            # Respect Retry-After if present
            retry_after = getattr(e, "retry_after", None)
            if retry_after:
                await asyncio.sleep(retry_after)
            elif attempt < max_retries:
                await asyncio.sleep(retry_delay(attempt))
            else:
                raise
        except ServerError:
            if attempt < max_retries:
                await asyncio.sleep(retry_delay(attempt))
            else:
                raise

Full-jitter beats fixed delay and exponential-without-jitter by a wide margin under rate-limit pressure. Fixed 1-second retries create thundering herds on the vendor's side. Jittered full-random spreads the retry traffic across the window and lets the vendor's limiter drain. AWS research shows this reduces retry storms by 60-80%. On the TCC fixture, the switch from fixed to jittered cut the 429 rate from 3.8% to 0.9%.

Step 3: the circuit breaker

A circuit breaker stops requests to a provider when it is clearly failing. This is the single highest-value addition beyond basic backoff, and the one most agent frameworks omit.

from datetime import datetime, timedelta
from dataclasses import dataclass, field
from typing import Literal

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    reset_timeout: int = 60
    failures: int = 0
    last_failure: datetime | None = None
    state: Literal["closed", "open", "half-open"] = "closed"

    def can_execute(self) -> bool:
        if self.state == "closed":
            return True
        if self.state == "open":
            if self.last_failure and datetime.now() - self.last_failure > timedelta(seconds=self.reset_timeout):
                self.state = "half-open"
                return True
            return False
        return True  # half-open: allow one test request

    def record_success(self):
        self.failures = 0
        self.state = "closed"

    def record_failure(self):
        self.failures += 1
        self.last_failure = datetime.now()
        if self.failures >= self.failure_threshold:
            self.state = "open"

# One circuit breaker per provider
breakers = {
    "openai": CircuitBreaker(),
    "anthropic": CircuitBreaker(),
    "google": CircuitBreaker(),
}

When a provider fails 5 times in a row, the circuit opens and all requests skip that provider for 60 seconds. This prevents wasting tokens on a provider that is clearly down and lets you route to a fallback model. The Anthropic status page logged a 30-40 minute overloaded_error spike in February 2026; without a circuit breaker and wall-clock deadline, agent turns sit in the retry loop the entire time.

Step 4: model fallback chain

When the primary model is unavailable, fall back to alternatives rather than waiting for recovery:

MODEL_CHAIN = [
    {"model": "claude-opus-4-7", "provider": "anthropic"},
    {"model": "gpt-5.3-codex", "provider": "openai"},
    {"model": "gemini-3.1-pro", "provider": "google"},
]

async def run_with_fallback(messages: list[dict]) -> str:
    for model_config in MODEL_CHAIN:
        provider = model_config["provider"]
        breaker = breakers[provider]
        if not breaker.can_execute():
            continue
        try:
            result = await call_model(model_config["model"], messages)
            breaker.record_success()
            return result
        except (RateLimitError, ServerError) as e:
            breaker.record_failure()
            continue
    raise AllModelsFailed("No available models in fallback chain")

Step 5: the step-budget interlock and wall-clock deadline

An agent has a step budget (say, 8). A retry inside a step consumes that step's patience, not the outer budget. If a step has retried twice and still fails, the agent should not reach step 8 through luck. It should exit early with a clean give-up.

import time

MAX_ATTEMPTS_PER_STEP = 3
MAX_STEPS = 8
TOTAL_RETRY_WALL_SECONDS = 90

def run_agent(turn):
    deadline = time.time() + TOTAL_RETRY_WALL_SECONDS
    for step_i in range(MAX_STEPS):
        for attempt in range(1, MAX_ATTEMPTS_PER_STEP + 1):
            try:
                return do_step(turn, step_i)
            except TransientError as e:
                if time.time() > deadline:
                    raise GiveUp("wall-clock deadline exceeded") from e
                time.sleep(retry_delay(attempt))
            except PermanentError:
                raise
        raise GiveUp(f"step {step_i} failed after {MAX_ATTEMPTS_PER_STEP} attempts")

The wall-clock deadline is the line that matters when a vendor outage lasts 30-40 minutes. Without it, agent turns sit in the retry loop for hours burning tokens and holding concurrency slots.

Step 6: loop detection

The most expensive failure mode is an agent that calls the same tool repeatedly without making progress. Standard step budgets do not catch this; the agent counts its steps but each step is a different call signature.

MAX_CONSECUTIVE_SAME_TOOL = 3
MAX_TOTAL_TOOL_CALLS = 15

class LoopDetector:
    def __init__(self):
        self.tool_history: list[str] = []

    def check(self, tool_name: str, tool_args: dict) -> bool:
        """Returns False if a loop is detected."""
        call_signature = f"{tool_name}:{sorted(tool_args.items())}"
        self.tool_history.append(call_signature)

        # Exact repetition check
        if len(self.tool_history) >= MAX_CONSECUTIVE_SAME_TOOL:
            recent = self.tool_history[-MAX_CONSECUTIVE_SAME_TOOL:]
            if len(set(recent)) == 1:
                return False

        # Total call budget
        return len(self.tool_history) <= MAX_TOTAL_TOOL_CALLS

When a loop is detected, interrupt the agent: "You have called the same tool 3 times with the same arguments. The approach is not working. Try a different strategy." This is better than letting the turn time out.

Step 7: idempotency keys on every state-changing tool call

Every tool call that mutates state gets an idempotency key. The key is f"{turn_id}:{step_i}:{attempt}". The database, the payment API, the external email sender all accept the same key twice and commit once. Without an idempotency story on the tools, do not retry state-changing calls at all. Fail the turn and let a human retry.

This is the policy most often skipped in early agent deployments and the one with the worst failure mode. The recurring r/SaaS thread on duplicate Stripe charges from agent loops cites duplicate rates in the 0.2-0.5% range of retried turns. The Stripe idempotency docs show how to implement this for one common surface; the same pattern applies to every internal mutator.

The never-retry list

Monitoring: alert on leading indicators

Track errors to catch systemic issues before they become incidents. Alert on these four signals, not just raw error rate:

The goal is proportional failure cost. When something breaks, the system should spend less effort failing than it would have spent succeeding. If the retry logic can burn 200x the tokens of a successful request, the retry logic is the bug, not the flaky API it is retrying against.

Provider-specific notes

The OpenAI rate-limits guide recommends jittered exponential backoff explicitly and separates RPM, TPM, RPD, and TPD limits at the org and project level. The Anthropic docs document overloaded_error as retryable with exponential backoff, with separate input (ITPM) and output (OTPM) token limits. Vertex AI separates the quota-exhausted class from transient 429s; retry the latter, not the former unless the quota has been raised.

The full policy, summarized

Layer What to implement Why it matters
Error classification 4-bucket taxonomy Stops retrying permanent errors
Backoff Full-jitter exponential, respect Retry-After Cuts 429 rate by ~75%
Circuit breaker Per-provider, trip at 5 failures 200x token cost reduction during outages
Model fallback 3-provider chain Availability during vendor downtime
Step-budget interlock Wall-clock + per-step limits Prevents hour-long retry loops
Loop detection Same-call signature tracking Catches stuck agents
Idempotency keys turn_id:step:attempt on all mutators Prevents duplicate charges and writes

Frequently asked questions

Should I retry on network timeouts? Yes, but only once, with a fresh connection. If the first attempt timed out, the server may have received the request and is processing it. Use an idempotency key so a duplicate server-side execution is a no-op. If the timeout is consistent (every attempt times out), classify it as a persistent issue and fail the turn.

How many max retries is right? Three per step is the number the TCC fixture converged on. More than three retries on a single step almost never helps: if a 429 has not resolved after 3 attempts with jittered backoff, the vendor's limiter is backed up and more retries accelerate the problem. Fail the step, record the give-up marker, and let the orchestrator decide whether to requeue the turn later.

What about streaming responses? Streaming adds a failure mode: the stream starts but drops mid-response. Treat a mid-stream drop as a transient server error (same as 503). Retry with the full prompt, not a partial one. If the model was mid-sentence when the stream dropped, the next attempt starts fresh anyway.

Do I need all four layers? The circuit breaker and fallback chain are optional if you are running on a single provider in a low-volume context. Everything else (classification, jittered backoff, wall-clock deadline, idempotency keys) is table stakes for any production agent, regardless of scale. Start with those four and add the circuit breaker when you have multi-provider routing.

The bounded-planner prompt is the partner to this policy; it tells the model to respect the step budget. The scores for each major vendor on the bounded-budget task are on the Claude Opus 4.7 review, the GPT-5.3-Codex review, and the Gemini 3.1 Pro review.

Verdict

Classify errors before retrying. Use full-jitter exponential backoff and respect Retry-After headers. Add a circuit breaker per provider. Wire a model fallback chain. Set a wall-clock deadline through the agent step budget. Detect loops by call signature. Put idempotency keys on every state-changing tool call. Retry nothing on the never-retry list. The number to beat on the TCC editorial fixture is 0.2% unretryable errors at 180k turns a day. Every point above that is a policy gap, not a vendor problem.

esc