~/prompts/bug-localization-prompt-42-frame-trace-3-turns-to-root-cause
§ PROMPT · APR 23, 2026 CLAUDE · DEBUG · STACKTRACE v1.0

Bug localization prompt: 42-frame trace, 3 turns to root cause

The prompt that localizes a bug in a 42-frame stack trace to a single line in 3 turns, median. Tested on Claude Opus 4.7, GPT-5.3-Codex, and a staff engineer.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
  9 min read
# DEBUG · claude-opus-4-7
Given this stack trace and this file tree, return the top 3 files most likely to contain the root cause.
For each, give a one-sentence rationale.

Output as a JSON array: [{"file": string, "rationale": string}]

The recurring r/ChatGPTCoding “the model blamed the throwing frame instead of the actual bug” thread keeps surfacing the same fix: a single-shot prompt does not work, a three-turn localize-verify-fix prompt does. The prompt below moves Claude Opus 4.7 to root cause in 3 turns median on the TCC editorial debug fixture (a 42-frame production NullReferenceException trace with a planted intermediate-frame trap), with a correct fix in 4 of 5 runs. The bare “help me debug this stack trace” prompt frequently fingers an intermediate frame and writes a defensive null check that hides the bug.

Why single-shot debugging prompts fail on stack traces

A stack trace is not a bug report. It is a log of execution state at the moment an exception propagated to an unhandled frame. The frame where the exception threw is almost never where the bug lives. The bug is usually 3-8 frames earlier, at the point where a value was written incorrectly or a precondition was silently violated.

Single-shot prompts fail because they ask the model to localize, verify, and fix in the same turn. The model anchors on the throwing frame (it is the most salient piece of text), generates a hypothesis immediately, and then constructs evidence to support that hypothesis. That is confirmation bias at the inference layer. The three-turn structure breaks the pattern: localize in turn 1, force independent verification in turn 2, only then generate the fix in turn 3.

The “3 hypotheses + discriminating test” approach from the research literature formalizes this: generate three ranked hypotheses, propose one discriminating test per hypothesis, then run the tests before writing a patch.

The prompt

You are a senior engineer localizing a production bug from a stack trace.

Inputs:
- Language: <python|typescript|go|rust>
- Stack trace (full): <paste>
- Recent code changes (since last green deploy): <paste or "none">
- Reproduction: <one line, "not reproducible" is allowed>
- Source (top 10 frames): <paste the relevant function bodies>

Output the following, in order, with nothing else:

TURN 1: Localize:
- ROOT_FRAME: <file:function:line>
- CONFIDENCE: 0-100
- WHY: one sentence

TURN 2: Verify:
- EVIDENCE: three independent signals that the ROOT_FRAME is correct.
  If you cannot find three, stop and say so.
- ALTERNATIVE_ROOTS: up to 2 other frames you considered, and why you rejected them.

TURN 3: Fix:
- PATCH: a minimal diff against the current code.
- TEST: one test that would have caught this bug, with the assertion that fails before the patch.
- ROLLBACK: one sentence on the fastest safe rollback plan.

Rules:
- Prefer the frame closest to the data corruption, not the frame where the exception threw.
- Do not suggest defensive null checks unless null is a legitimate domain value.
- If CONFIDENCE < 70, do not proceed to TURN 3. Ask for the missing log line instead.

Example: prompt applied to a real fixture

Stack trace (abbreviated): a NullReferenceException in PaymentService.charge() at line 412, propagated through OrderController.submit() at line 88, originating from UserRepository.find_by_session() at line 201. The planted trap: line 88 looks suspicious because it passes the result directly to the charge function without a null guard. The real bug: line 201, where find_by_session returns None on an expired session token instead of raising.

Turn 1 output (localize):

ROOT_FRAME: user_repository.py:UserRepository.find_by_session:201
CONFIDENCE: 82
WHY: find_by_session returns None on expired tokens instead of raising SessionExpiredError;
     the caller at OrderController:88 was written against the contract that find_by_session
     raises on invalid sessions.

Turn 2 output (verify):

EVIDENCE:
1. The docstring for find_by_session (line 196) says "raises SessionExpiredError on invalid
   token" but the implementation returns None on the expiry check at line 201.
2. The commit that changed the return behaviour (commit a3f7b2) is in the recent changes
   provided; it changed a raise to a return None without updating callers.
3. All other callers of find_by_session that appear in the codebase either check for None
   or have a try/except for SessionExpiredError -- confirming the original contract was raise.

ALTERNATIVE_ROOTS:
- OrderController.submit:88 -- rejected. The null check here would be defensive programming,
  not a fix. The contract says find_by_session raises; the fix is to restore that contract.
- PaymentService.charge:412 -- rejected. This is the throwing frame, not the origin.
  Hardening charge() against None user would mask the bug in every future caller.

Turn 3 output (fix):

PATCH:
--- a/user_repository.py
+++ b/user_repository.py
@@ -198,7 +198,7 @@ class UserRepository:
     def find_by_session(self, token: str) -> User:
         session = self._sessions.get(token)
         if session is None or session.is_expired():
-            return None
+            raise SessionExpiredError(f"Session token expired or not found: {token!r}")
         return session.user

TEST:
def test_find_by_session_raises_on_expired_token():
    repo = UserRepository()
    repo._sessions["tok"] = Session(user=User(id=1), expires_at=datetime(2020, 1, 1))
    with pytest.raises(SessionExpiredError):
        repo.find_by_session("tok")

ROLLBACK: Revert commit a3f7b2 with `git revert a3f7b2 --no-edit`.

Why it works, in 5 bullets

Supplying source for the top 10 frames

The prompt now explicitly asks for source code for the top 10 frames. Without it, the model guesses at function bodies and the EVIDENCE section is speculative. Extracting the right slices with Python’s traceback module:

import traceback
import linecache

def extract_frame_source(tb, context_lines=10) -> str:
    """Return source for each frame in a traceback, context_lines before and after."""
    frames = []
    for frame_info in traceback.extract_tb(tb):
        filename = frame_info.filename
        lineno = frame_info.lineno
        start = max(1, lineno - context_lines)
        end = lineno + context_lines
        lines = []
        for i in range(start, end + 1):
            line = linecache.getline(filename, i)
            marker = ">>>" if i == lineno else "   "
            lines.append(f"{marker} {i:4d}: {line}", )
        frames.append(f"# {filename}:{lineno}n" + "".join(lines))
    return "nn".join(frames[:10])  # top 10 frames only

Paste the output of extract_frame_source(sys.exc_info()[2]) into the “Source (top 10 frames)” field of the prompt.

Integration with error tracking (Sentry, Datadog)

If your stack traces come from Sentry, pull the full context automatically:

import sentry_sdk

def build_debug_prompt(sentry_event: dict) -> str:
    exception = sentry_event["exception"]["values"][-1]
    frames = exception["stacktrace"]["frames"][-10:]
    trace_text = "n".join(
        f"{f['filename']}:{f['function']}:{f['lineno']}n"
        + "n".join(f.get("context_line", ""), )
        for f in frames
    )
    recent_changes = sentry_event.get("release", "none")
    return LOCALIZE_PROMPT_TEMPLATE.format(
        language="python",
        stack_trace=trace_text,
        recent_changes=recent_changes,
        reproduction="not reproducible" if not sentry_event.get("user") else
            f"User {sentry_event['user']['id']} triggered at {sentry_event['timestamp']}"
    )

Failure modes

Tested on (TCC editorial scoring)

Methodology on the 14-task scorecard. The cross-model gap is consistent with the recurring “which AI is best at debugging stack traces” threads on r/ChatGPTCoding and r/learnprogramming through 2025-2026.

Comparison against a senior engineer

The same 42-frame trace was given to a senior engineer with 8 years on the codebase as a stopwatch baseline. He hit root cause in 3 minutes and a correct fix in 8 minutes. Claude Opus 4.7 with this prompt hit root cause in 42 seconds and a correct fix in 2 minutes 10 seconds. The model’s verification step was less convincing: he produced a reproducible unit test on the spot, while the model produced a test that would have failed but read less obviously. That gap is what TCC tracks quarter over quarter as the human-vs-model debugging delta narrows.

Frequently asked questions

What if the model stops at CONFIDENCE < 70 in turn 1?
That is the correct behaviour. The model needs a log line, a repro, or the source for a frame it cannot see. Ask for the specific missing evidence it requests. Forcing it past the threshold produces a low-quality fix that may mask the real bug.

Should I send the entire codebase as context?
No. Send only the source for the top 10 frames. Models degrade on very large contexts for targeted tasks; the signal-to-noise ratio drops as you add unrelated files. The top 10 frames contain the bug in the TCC fixture 100% of the time.

Can I automate this into a Slack bot for on-call?
Yes. Wire the Sentry webhook to trigger the prompt, post TURN 1 to Slack immediately, and post TURN 3 when the confidence threshold passes. The on-call engineer confirms or rejects TURN 1 before the fix is applied. This is the pattern described in the agent loop retry policy post.

What about Go and Rust panics?
The prompt accepts any language. Go panics include a goroutine stack that maps directly to the stack trace format the prompt expects. Rust panics with backtraces (set RUST_BACKTRACE=1) work the same way. The “frame closest to the data corruption” rule is language-agnostic.

The debug scores by model are on the Claude Opus 4.7 review and the GPT-5.3-Codex review. The retry policy that wraps multi-turn calls like this one is on the agent loop retry policy post. The structured-output format for the localized-root response slots into the strict-JSON prompt if you want to pipe this into a dashboard.

One-line takeaway

Split localize, verify, and fix into three turns, demand three independent signals, bias toward the data-corruption frame, set a confidence threshold, and supply the source for the top 10 frames. Then the model finds the root cause in 3 turns, not 7.

esc