Bug localization prompt: the 42-frame stack trace, root cause in 3 turns

The recurring r/ChatGPTCoding “the model blamed the throwing frame instead of the actual bug” thread keeps surfacing the same fix: a single-shot prompt does not work, a three-turn localize-verify-fix prompt does. The prompt below moves Claude Opus 4.7 to root cause in 3 turns median on the TCC editorial debug fixture (a 42-frame production NullReferenceException trace with a planted intermediate-frame trap), with a correct fix in 4 of 5 runs. The bare “help me debug this stack trace” prompt frequently fingers an intermediate frame and writes a defensive null check that hides the bug.

The prompt

You are a senior engineer localizing a production bug from a stack trace.

Inputs:
- Language: <python|typescript|go|rust>
- Stack trace (full): <paste>
- Recent code changes (since last green deploy): <paste or "none">
- Reproduction: <one line, "not reproducible" is allowed>

Output the following, in order, with nothing else:

TURN 1: Localize:
- ROOT_FRAME: <file:function:line>
- CONFIDENCE: 0-100
- WHY: one sentence

TURN 2: Verify:
- EVIDENCE: three independent signals that the ROOT_FRAME is correct. If you cannot find three, stop and say so.
- ALTERNATIVE_ROOTS: up to 2 other frames you considered, and why you rejected them.

TURN 3: Fix:
- PATCH: a minimal diff against the current code.
- TEST: one test that would have caught this bug, with the assertion that fails before the patch.
- ROLLBACK: one sentence on the fastest safe rollback plan.

Rules:
- Prefer the frame closest to the data corruption, not the frame where the exception threw.
- Do not suggest defensive null checks unless null is a legitimate domain value.
- If CONFIDENCE < 70, do not proceed to TURN 3. Ask for the missing log line instead.

Why it works, in 5 bullets

Splits the work into three turns. Models are better at one job per turn than three jobs in one. Localize, verify, fix is the natural sequence; forcing it onto the prompt structure matches how an experienced engineer thinks through the same trace.
“Three independent signals” blocks confirmation bias. Without it, models find one piece of evidence and commit. With it, the model has to disprove two alternative roots, which is where the real bug usually hides.
“Frame closest to the data corruption” redirects attention. The default behaviour is to blame the throwing frame. The real bug is often three frames up, where the bad value was written. This clause moves the model to the right end of the trace, which matches the standard recommendation in the Python traceback module docs and the Node.js error docs.
Bans defensive null checks as a reflex. Models love to suggest if (x == null) return;. That pattern hides bugs instead of fixing them. The rule forces the model to either find a legitimate null-value case or diagnose the root.
Confidence threshold of 70. Below 70, the model stops and asks for the missing log line. This is how you get useful “I need more info” behaviour without an endless loop of guesses.

Failure modes

Missing source. If the model cannot see the source file for a frame, it will guess. Supply the source for the top 10 frames as part of the input. The standard library traceback module can be shelled out to get the right slices.
Transient bugs. Race conditions and ordering bugs defeat the single-trace prompt. Pair with a second prompt that asks for a reproduction plan, then rerun with the reproduction evidence in the context.
Frame-number drift after a minor version bump. If the stack trace was captured on version N and you paste the current code (version N+1), the line numbers are off. Model will still get the right function most of the time; the PATCH will need manual verification.

Tested on (TCC editorial scoring)

Claude Opus 4.7, adaptive thinking, effort=high: root cause in 3 turns median, correct fix in 4 of 5 runs, missed once on a race-condition variant of the fixture.
GPT-5.3-Codex, reasoning_effort=high: root cause in 3 turns median, correct fix in 4 of 5 runs.
Gemini 3.1 Pro, auto thinking budget: root cause in 4 turns median, correct fix in 3 of 5 runs.
Claude Sonnet 4.6: root cause in 3 turns median, correct fix in 3 of 5 runs, missed the “data corruption” heuristic once.

Methodology on the 14-task scorecard. The cross-model gap is consistent with the recurring “which AI is best at debugging stack traces” threads on r/ChatGPTCoding and r/learnprogramming through 2025-2026.

Comparison against a senior engineer

As a sanity check, the same 42-frame trace was given to a senior engineer with 8 years on the codebase as a stopwatch baseline. He hit root cause in 3 minutes and a correct fix in 8 minutes. Claude Opus 4.7 with this prompt hit root cause in 42 seconds and a correct fix in 2 minutes 10 seconds. The model’s verification step was less convincing than his: he produced a reproducible unit test on the spot, while the model produced a test that would have failed but read less obviously than his hand-written version. That gap is what TCC tracks quarter over quarter as the human-vs-model debugging delta narrows.

The debug scores by model are on the Claude Opus 4.7 review and the GPT-5.3-Codex review. The retry policy that wraps multi-turn calls like this one is on the agent loop retry policy post. The structured-output format for the localized-root response slots into the strict-JSON prompt if you want to pipe this into a dashboard.

One-line takeaway

Split localize, verify, and fix into three turns, demand three independent signals, bias toward the data-corruption frame, and set a confidence threshold. Then the model finds the root cause in 3 turns, not 7.