The recurring r/ChatGPTCoding “the model blamed the throwing frame instead of the actual bug” thread keeps surfacing the same fix: a single-shot prompt does not work, a three-turn localize-verify-fix prompt does. The prompt below moves Claude Opus 4.7 to root cause in 3 turns median on the TCC editorial debug fixture (a 42-frame production NullReferenceException trace with a planted intermediate-frame trap), with a correct fix in 4 of 5 runs. The bare “help me debug this stack trace” prompt frequently fingers an intermediate frame and writes a defensive null check that hides the bug.
Why single-shot debugging prompts fail on stack traces
A stack trace is not a bug report. It is a log of execution state at the moment an exception propagated to an unhandled frame. The frame where the exception threw is almost never where the bug lives. The bug is usually 3-8 frames earlier, at the point where a value was written incorrectly or a precondition was silently violated.
Single-shot prompts fail because they ask the model to localize, verify, and fix in the same turn. The model anchors on the throwing frame (it is the most salient piece of text), generates a hypothesis immediately, and then constructs evidence to support that hypothesis. That is confirmation bias at the inference layer. The three-turn structure breaks the pattern: localize in turn 1, force independent verification in turn 2, only then generate the fix in turn 3.
The “3 hypotheses + discriminating test” approach from the research literature formalizes this: generate three ranked hypotheses, propose one discriminating test per hypothesis, then run the tests before writing a patch.
The prompt
You are a senior engineer localizing a production bug from a stack trace.
Inputs:
- Language: <python|typescript|go|rust>
- Stack trace (full): <paste>
- Recent code changes (since last green deploy): <paste or "none">
- Reproduction: <one line, "not reproducible" is allowed>
- Source (top 10 frames): <paste the relevant function bodies>
Output the following, in order, with nothing else:
TURN 1: Localize:
- ROOT_FRAME: <file:function:line>
- CONFIDENCE: 0-100
- WHY: one sentence
TURN 2: Verify:
- EVIDENCE: three independent signals that the ROOT_FRAME is correct.
If you cannot find three, stop and say so.
- ALTERNATIVE_ROOTS: up to 2 other frames you considered, and why you rejected them.
TURN 3: Fix:
- PATCH: a minimal diff against the current code.
- TEST: one test that would have caught this bug, with the assertion that fails before the patch.
- ROLLBACK: one sentence on the fastest safe rollback plan.
Rules:
- Prefer the frame closest to the data corruption, not the frame where the exception threw.
- Do not suggest defensive null checks unless null is a legitimate domain value.
- If CONFIDENCE < 70, do not proceed to TURN 3. Ask for the missing log line instead.
Example: prompt applied to a real fixture
Stack trace (abbreviated): a NullReferenceException in PaymentService.charge() at line 412, propagated through OrderController.submit() at line 88, originating from UserRepository.find_by_session() at line 201. The planted trap: line 88 looks suspicious because it passes the result directly to the charge function without a null guard. The real bug: line 201, where find_by_session returns None on an expired session token instead of raising.
Turn 1 output (localize):
ROOT_FRAME: user_repository.py:UserRepository.find_by_session:201
CONFIDENCE: 82
WHY: find_by_session returns None on expired tokens instead of raising SessionExpiredError;
the caller at OrderController:88 was written against the contract that find_by_session
raises on invalid sessions.
Turn 2 output (verify):
EVIDENCE:
1. The docstring for find_by_session (line 196) says "raises SessionExpiredError on invalid
token" but the implementation returns None on the expiry check at line 201.
2. The commit that changed the return behaviour (commit a3f7b2) is in the recent changes
provided; it changed a raise to a return None without updating callers.
3. All other callers of find_by_session that appear in the codebase either check for None
or have a try/except for SessionExpiredError -- confirming the original contract was raise.
ALTERNATIVE_ROOTS:
- OrderController.submit:88 -- rejected. The null check here would be defensive programming,
not a fix. The contract says find_by_session raises; the fix is to restore that contract.
- PaymentService.charge:412 -- rejected. This is the throwing frame, not the origin.
Hardening charge() against None user would mask the bug in every future caller.
Turn 3 output (fix):
PATCH:
--- a/user_repository.py
+++ b/user_repository.py
@@ -198,7 +198,7 @@ class UserRepository:
def find_by_session(self, token: str) -> User:
session = self._sessions.get(token)
if session is None or session.is_expired():
- return None
+ raise SessionExpiredError(f"Session token expired or not found: {token!r}")
return session.user
TEST:
def test_find_by_session_raises_on_expired_token():
repo = UserRepository()
repo._sessions["tok"] = Session(user=User(id=1), expires_at=datetime(2020, 1, 1))
with pytest.raises(SessionExpiredError):
repo.find_by_session("tok")
ROLLBACK: Revert commit a3f7b2 with `git revert a3f7b2 --no-edit`.
Why it works, in 5 bullets
- Splits the work into three turns. Models are better at one job per turn than three jobs in one. Localize, verify, fix is the natural sequence; forcing it onto the prompt structure matches how an experienced engineer thinks through the same trace.
- “Three independent signals” blocks confirmation bias. Without it, models find one piece of evidence and commit. With it, the model has to disprove two alternative roots, which is where the real bug usually hides.
- “Frame closest to the data corruption” redirects attention. The default behaviour is to blame the throwing frame. The real bug is often three frames up, where the bad value was written. This clause moves the model to the right end of the trace, matching the standard recommendation in the Python traceback module docs.
- Bans defensive null checks as a reflex. Models love to suggest
if (x == null) return;. That pattern hides bugs instead of fixing them. The rule forces the model to either find a legitimate null-value case or diagnose the root. - Confidence threshold of 70. Below 70, the model stops and asks for the missing log line. This is how you get useful “I need more info” behaviour without an endless loop of guesses.
Supplying source for the top 10 frames
The prompt now explicitly asks for source code for the top 10 frames. Without it, the model guesses at function bodies and the EVIDENCE section is speculative. Extracting the right slices with Python’s traceback module:
import traceback
import linecache
def extract_frame_source(tb, context_lines=10) -> str:
"""Return source for each frame in a traceback, context_lines before and after."""
frames = []
for frame_info in traceback.extract_tb(tb):
filename = frame_info.filename
lineno = frame_info.lineno
start = max(1, lineno - context_lines)
end = lineno + context_lines
lines = []
for i in range(start, end + 1):
line = linecache.getline(filename, i)
marker = ">>>" if i == lineno else " "
lines.append(f"{marker} {i:4d}: {line}", )
frames.append(f"# {filename}:{lineno}n" + "".join(lines))
return "nn".join(frames[:10]) # top 10 frames only
Paste the output of extract_frame_source(sys.exc_info()[2]) into the “Source (top 10 frames)” field of the prompt.
Integration with error tracking (Sentry, Datadog)
If your stack traces come from Sentry, pull the full context automatically:
import sentry_sdk
def build_debug_prompt(sentry_event: dict) -> str:
exception = sentry_event["exception"]["values"][-1]
frames = exception["stacktrace"]["frames"][-10:]
trace_text = "n".join(
f"{f['filename']}:{f['function']}:{f['lineno']}n"
+ "n".join(f.get("context_line", ""), )
for f in frames
)
recent_changes = sentry_event.get("release", "none")
return LOCALIZE_PROMPT_TEMPLATE.format(
language="python",
stack_trace=trace_text,
recent_changes=recent_changes,
reproduction="not reproducible" if not sentry_event.get("user") else
f"User {sentry_event['user']['id']} triggered at {sentry_event['timestamp']}"
)
Failure modes
- Missing source. If the model cannot see the source file for a frame, it will guess. Supply the source for the top 10 frames as part of the input. Use the
extract_frame_sourcehelper above. - Transient bugs. Race conditions and ordering bugs defeat the single-trace prompt. Pair with a second prompt that asks for a reproduction plan, then rerun with the reproduction evidence in the context.
- Frame-number drift after a minor version bump. If the stack trace was captured on version N and you paste the current code (version N+1), the line numbers are off. The model will still identify the right function most of the time; the PATCH will need manual verification.
- CONFIDENCE never reaches 70 on minified output. Minified JavaScript or stripped binaries provide no useful frame context. Deobfuscate before sending. Source maps for JavaScript, debug symbols for compiled languages.
Tested on (TCC editorial scoring)
- Claude Opus 4.7,
adaptive thinking, effort=high: root cause in 3 turns median, correct fix in 4 of 5 runs, missed once on a race-condition variant of the fixture. - GPT-5.3-Codex,
reasoning_effort=high: root cause in 3 turns median, correct fix in 4 of 5 runs. - Gemini 3.1 Pro, auto thinking budget: root cause in 4 turns median, correct fix in 3 of 5 runs.
- Claude Sonnet 4.6: root cause in 3 turns median, correct fix in 3 of 5 runs, missed the “data corruption” heuristic once.
Methodology on the 14-task scorecard. The cross-model gap is consistent with the recurring “which AI is best at debugging stack traces” threads on r/ChatGPTCoding and r/learnprogramming through 2025-2026.
Comparison against a senior engineer
The same 42-frame trace was given to a senior engineer with 8 years on the codebase as a stopwatch baseline. He hit root cause in 3 minutes and a correct fix in 8 minutes. Claude Opus 4.7 with this prompt hit root cause in 42 seconds and a correct fix in 2 minutes 10 seconds. The model’s verification step was less convincing: he produced a reproducible unit test on the spot, while the model produced a test that would have failed but read less obviously. That gap is what TCC tracks quarter over quarter as the human-vs-model debugging delta narrows.
Frequently asked questions
What if the model stops at CONFIDENCE < 70 in turn 1?
That is the correct behaviour. The model needs a log line, a repro, or the source for a frame it cannot see. Ask for the specific missing evidence it requests. Forcing it past the threshold produces a low-quality fix that may mask the real bug.
Should I send the entire codebase as context?
No. Send only the source for the top 10 frames. Models degrade on very large contexts for targeted tasks; the signal-to-noise ratio drops as you add unrelated files. The top 10 frames contain the bug in the TCC fixture 100% of the time.
Can I automate this into a Slack bot for on-call?
Yes. Wire the Sentry webhook to trigger the prompt, post TURN 1 to Slack immediately, and post TURN 3 when the confidence threshold passes. The on-call engineer confirms or rejects TURN 1 before the fix is applied. This is the pattern described in the agent loop retry policy post.
What about Go and Rust panics?
The prompt accepts any language. Go panics include a goroutine stack that maps directly to the stack trace format the prompt expects. Rust panics with backtraces (set RUST_BACKTRACE=1) work the same way. The “frame closest to the data corruption” rule is language-agnostic.
Related
The debug scores by model are on the Claude Opus 4.7 review and the GPT-5.3-Codex review. The retry policy that wraps multi-turn calls like this one is on the agent loop retry policy post. The structured-output format for the localized-root response slots into the strict-JSON prompt if you want to pipe this into a dashboard.
One-line takeaway
Split localize, verify, and fix into three turns, demand three independent signals, bias toward the data-corruption frame, set a confidence threshold, and supply the source for the top 10 frames. Then the model finds the root cause in 3 turns, not 7.