The bounded-budget task is the most-flagged agent failure mode on the recurring r/LocalLLaMA and r/ChatGPTCoding “agents that respect a step budget” threads. The pattern is always the same: 5 steps, 4-step budget, the agent should exit at step 4 with a structured give-up message and not keep trying to finish at step 5. On a bare default prompt, most frontier models fail this 3 of 5 runs. The prompt below moves the exit rate to 5 of 5 on Claude Opus 4.7 and 4 of 5 on GPT-5.3-Codex on the TCC editorial fixture (median of 5 runs).
Why agents ignore step budgets by default
LLMs are trained on data where trying again is rewarded and stopping early is penalized. That bias runs deep. When an agent hits its budget without completing the task, the model’s default behaviour is to merge two steps into one, skip a verification, or quietly exceed the limit by one turn. It is not misbehaving. It is doing exactly what the training signal rewarded. You have to explicitly give it a second reward path.
The other structural problem is what engineers call the “tail cost”. The most expensive tokens in any agent run are the last 3-4 turns, where the context is largest. An agent that keeps trying past its budget does not just waste a little compute: it wastes the most expensive compute. Budget enforcement at the plan stage is the cheapest intervention.
The prompt
You are a planner. You will be given a task and a step budget. Follow this loop.
1. Read the task. Write a plan as a numbered list. Each item is one step.
2. If the plan needs more steps than the budget, do not shrink the plan. Instead:
- Emit exactly: `GIVE_UP: plan requires N steps, budget is M.`
- Replace N and M with integers. Do nothing else.
3. If the plan fits in the budget, execute step 1. Report the result.
4. Continue, one step per turn, until the task is done OR you hit the budget.
5. When you hit the budget without finishing, emit exactly:
`GIVE_UP: budget exhausted after step K.`
Replace K with the last completed step number. Do nothing else.
Rules:
- Never silently exceed the budget.
- Never merge two steps into one to fit the budget.
- The give-up message is a success case. It is not a failure. Emit it cleanly.
Example: prompt in action
Task: “Migrate the payments table to add a currency column, backfill with ‘USD’, update all downstream queries, and run the test suite.” Budget: 3.
Turn 1 (planning):
Plan:
1. Write the migration file (add currency column, default NULL).
2. Backfill existing rows to 'USD'.
3. Update downstream query files.
4. Add NOT NULL constraint after backfill.
5. Run test suite.
GIVE_UP: plan requires 5 steps, budget is 3.
The agent exits cleanly. No code was written, no steps were started. The harness receives GIVE_UP: plan requires 5 steps, budget is 3. and routes it to a human.
Compare to the default prompt: without the bounded planner, the model starts writing the migration file, gets to step 3, then writes “finishing the remaining steps in one pass” and merges steps 4 and 5 into a single turn that skips the test run. The bug ships.
Why it works, in 5 bullets
- It reframes the exit as a success. Models default to “finish the task” because the training signal rewards completion. Telling the model “give-up is a success case” gives it a second reward path that matches the budget constraint.
- It forces an exact-token exit marker.
GIVE_UP:is a stable prefix your harness can regex-match without an LLM-as-judge. That removes the ambiguous case where the model says “I was unable to complete this” in prose and the grader cannot tell whether it gave up or just got tired. - It bans merging two steps into one. Without this rule, every model will fold steps to fit the budget. That is the exact failure mode: the model finishes the task in N-1 steps by cutting a corner, and the shortcut is where the bug lives.
- It separates planning from execution. The plan is the checkpoint. If the plan does not fit, exit before you start. That is the cheapest way to avoid blowing a budget: discover at step 3 that the task needs 8.
- Numbered-list output matches the model’s strength. Frontier models are better at emitting numbered lists than free-form plans. The harness can parse the plan out of the first turn and reject if it does not look like a numbered list.
Wiring the GIVE_UP signal in Python
import re
GIVE_UP_RE = re.compile(r'^GIVE_UP:', re.MULTILINE)
class BoundedPlanner:
def __init__(self, llm_client, max_turns: int = 5):
self.client = llm_client
self.max_turns = max_turns
self.turns = 0
def run(self, task: str) -> dict:
messages = [
{"role": "system", "content": BOUNDED_PLANNER_PROMPT},
{"role": "user", "content": f"Task: {task}nBudget: {self.max_turns} steps."},
]
while self.turns < self.max_turns:
response = self.client.chat(messages)
content = response.choices[0].message.content
self.turns += 1
if GIVE_UP_RE.search(content):
return {"status": "GIVE_UP", "turn": self.turns, "message": content}
if "TASK_COMPLETE" in content:
return {"status": "COMPLETE", "turn": self.turns, "message": content}
messages.append({"role": "assistant", "content": content})
messages.append({"role": "user", "content": "Continue. Next step only."})
# Hard cap: force exit if model never emitted GIVE_UP
return {"status": "BUDGET_EXCEEDED", "turn": self.turns, "message": content}
The hard cap at max_turns is the safety net for the case where the model never emits GIVE_UP. On Gemini 3.1 Pro (2 of 5 clean exits on the TCC fixture), you will hit this path. Never assume the prompt alone is sufficient enforcement.
Tier-based budget recommendations
The step budget should map to your user tier, not just the task complexity. A budget that is too generous on a free tier burns token cost for marginal output quality.
- Free tier: max_turns=3. Forces early planning exits and discourages complex tasks on the free plan.
- Pro tier: max_turns=8. Covers most engineering tasks end-to-end.
- Enterprise / API: max_turns=15-20. Reserved for multi-file refactors and migration chains.
At max_turns=3, the bounded planner rejects tasks that need more than 3 steps at planning time, which avoids the worst case: a task that runs 3 turns and exits mid-way, leaving the codebase in a partial state.
Failure modes
- Budget exhausted on the plan itself. If the task is vague and the plan runs long, step 1 never finishes. Pair this prompt with a
max_output_tokenscap so the plan cannot eat the entire budget. - Model emits "GIVE_UP" inside a code block. Claude sometimes wraps the marker in backticks, which breaks naive regex. The harness should match
^GIVE_UP:at line start with multiline mode, or strip code fences before matching. - Model tries to finish at step N+1 anyway. Seen on GPT-5.3-Codex at
reasoning_effort=mediumroughly 1 in 5 runs in the TCC fixture. Fix: setreasoning_effort=highfor the final exit step, or add an explicit "step K+1 is not allowed" line to the prompt. - Plan is valid but the first execution step stalls. Some tasks have plans that fit the budget but steps that each require many tokens. Cap per-step output separately from the total turn count.
Tested on (TCC editorial scoring)
- Claude Opus 4.7 at
adaptive thinking, effort=high: 5 of 5 clean exits on the bounded-budget task. - Claude Sonnet 4.6: 4 of 5 clean exits.
- GPT-5.3-Codex at
reasoning_effort=medium: 3 of 5; athigh: 4 of 5. - GPT-5.4 at
reasoning_effort=medium: 4 of 5. - Gemini 3.1 Pro at auto thinking budget: 2 of 5. Requires the hard-cap governor fallback.
Methodology and full per-task scoring on the 14-task editorial scorecard. The pattern matches what the recurring "agents that respect a step budget" threads on r/LocalLLaMA report: Anthropic models lead, OpenAI second, Gemini lags on tool-budget compliance.
Frequently asked questions
What if the task genuinely needs more steps than the budget allows?
That is the correct output. The agent should give up, and the caller should either raise the budget or split the task into smaller pieces. A GIVE_UP on a legitimate task is not a prompt failure; it is the system working as designed.
Can I use this with LangChain or LangGraph?
Yes. Wire it as a conditional edge. If the AgentGovernor.check_termination() returns "GIVE_UP" or "MAX_TURNS_REACHED", route to a terminal node. The GIVE_UP signal is the same regardless of framework.
Does this work for tool-use agents, not just text planners?
The same structure works. Replace "steps" with "tool calls" and "plan" with "tool sequence". The GIVE_UP marker still fires before the first tool call if the sequence is over budget.
What about the case where the model counts steps wrong?
Add a step-counter to your harness. Count turns at the application layer, not just from the model's self-report. The model's step count is a hint; the harness count is the authority.
Related
The retry policy that wraps this prompt in production is on the agent loop retry policy post. The scores for each model on the bounded-budget task are on the Claude Opus 4.7 review and the GPT-5.3-Codex review. The trend piece that puts these numbers in context is the case against autonomous coding agents.
One-line takeaway
Give the model a second reward path called "give-up is a success", force an exact exit marker, ban step-merging, add a hard-cap governor in your harness, and the bounded-budget task stops being the flakiest thing in your agent loop.