~/prompts/architecture-code-review-prompt-3-real-issues-0-false-positives
§ PROMPT · APR 23, 2026 CLAUDE · CODE-REVIEW · REVIEW v1.0

Architecture code review prompt: 3 real issues, 0 false positives

The architecture-review prompt that flagged 3 real issues and ignored a planted false-positive trap on my 600-line PR. What to include, what to remove, and the 4 models I tested.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
  7 min read
# REVIEW · claude-opus-4-7
You are reviewing a pull request in a large TypeScript codebase.
You will receive: the diff, the full contents of every file in the diff, and
the file tree.

Your output is three sections:
1. Does the change achieve its stated intent?
2. What invariants in the surrounding module does it break?
3. Three smallest fixes, ranked by impact.

The recurring r/ChatGPTCoding “AI PR review keeps flagging style nits and missing the auth bug” thread has a consistent answer: the default “review this PR” prompt does not work, and the fix is to anchor the model on production impact and force a non-issues section. The prompt below catches 3 of 3 real issues on the TCC code-review fixture (a 600-line PR with a missing auth check, an N+1 query, a race condition, and a planted false-positive trap) and does not fall for the trap on any of 5 runs with Claude Opus 4.7. The bare “review this PR” prompt catches 1-2 real issues and falls for the trap 3 of 5 runs.

Why bare “review this PR” fails

There are two structural problems with the default prompt. First, the model has no threshold for what counts as an issue worth flagging, so it treats a variable name it dislikes the same as a missing authentication check. Second, models are trained to be helpful and thorough, which means they add commentary even when no commentary is warranted. The result is a review that looks authoritative but buries the two real bugs under twelve style observations.

A modern AI-assisted code review pipeline has three layers: static analysis (ESLint, SonarQube, Semgrep) catches 40-60% of deterministic issues, ML-based pattern detection (CodeQL) catches another slice, and the LLM gets the residual: semantic issues that require understanding business logic, cross-file interactions, and production context. This prompt is the LLM layer. It should not duplicate what Semgrep already caught.

The prompt

You are a staff engineer reviewing a pull request. You have one job: flag issues that will cause a production incident within 90 days.

Output format, exact:
---
ISSUES:
- [severity: high|medium|low] [file:line] <one-sentence description> <why it will break>
...
NON_ISSUES_CONSIDERED:
- [file:line] <what I thought might be wrong> <why I concluded it is fine>
---

Rules:
1. Do not comment on style, naming, or formatting unless it causes a correctness issue.
2. Prefer high-confidence issues. If you are less than 70% sure, list it under NON_ISSUES_CONSIDERED.
3. For every issue flagged, name the production scenario that triggers it.
4. Cap the output at 8 ISSUES. If you find more, keep the highest severity.
5. Read the PR end to end before emitting the first bullet. Do not stream.

Example output for a real fixture

Input: a 120-line Python diff that adds an admin endpoint without checking the is_staff flag.

---
ISSUES:
- [severity: high] [views/admin.py:34] Missing authorization check on /api/admin/export
  Any authenticated user can call this endpoint and download the full user table.
  Production scenario: attacker logs in with a free account, hits /api/admin/export, gets PII.

- [severity: medium] [services/report.py:88] N+1 query in generate_report()
  The loop on line 88 executes one SELECT per user_id. At 10k users this hits the DB 10k
  times per report request.
  Production scenario: report generation times out under normal load after user base passes 5k.

NON_ISSUES_CONSIDERED:
- [views/admin.py:21] The missing input validation on start_date looked suspicious.
  Concluded fine: the ORM parameter binding sanitizes it, and the field is type-coerced to date
  before the query runs.
---

Architecture review variant

For larger PRs that touch multiple services, the architecture variant zooms out from individual lines to flag structural issues:

You are a senior software architect reviewing a pull request for structural soundness.

Context:
- System: <one-sentence description>
- PR scope: <what this PR is trying to do>
- Team size: <N developers>

Output format, exact:
---
ARCHITECTURAL_CONCERNS:
- [severity: high|medium|low] <one-sentence description> <production consequence>

STRENGTHS:
- <one thing the PR does well structurally>

RECOMMENDED_REFACTOR:
- One sentence: the single structural change with the highest impact on maintainability.
---

Rules:
1. Evaluate coupling: could you change one component without touching three others?
2. Evaluate error propagation: do failures in one component cascade silently to another?
3. Do not flag line-level issues. Architectural review only.
4. If the PR is structurally sound, ARCHITECTURAL_CONCERNS should be empty.

Why the base prompt works, in 5 bullets

Failure modes

Tested on (TCC editorial scoring)

Methodology on the 14-task scorecard. The cross-model pattern (Anthropic leads, OpenAI close, Gemini behind on flow-sensitive tasks) is consistent with the public SWE-bench Pro leaderboard.

Wiring it into a PR workflow

# .github/workflows/ai-review.yml
on: pull_request
jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - run: |
          diff=$(git diff --unified=5 origin/main...HEAD)
          echo "$diff" | python scripts/ai_review.py > review.md
      - uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('review.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

The ai_review.py script wraps the prompt above, splits long diffs at 80k tokens by module boundary, and concatenates the ISSUES sections before posting. Keep the NON_ISSUES_CONSIDERED section local for debugging; posting all of it clutters the PR thread. Run the architecture variant separately when the PR touches more than 3 files across service boundaries.

Model-specific tips

Frequently asked questions

Should the AI review replace human review?
No. The AI layer runs before human review and filters out the issues that are cheap to catch programmatically. Humans focus on intent, business logic, and the things models still miss (subtle race conditions, domain-specific invariants). The TCC pipeline gates CI on AI review but still requires a human approval before merge.

What is the false-negative rate on security issues?
On the TCC fixture: 0 missed security issues with Opus 4.7, 1 missed with Codex at medium effort. That fixture has one security issue; a real PR may have more. Do not use this prompt as your only security gate. Semgrep or Snyk should run in parallel.

The prompt says “do not stream”. Does that hurt latency?
On Claude you can still use streaming at the API level. The instruction is to the model, not to the API call. It tells the model to reason before emitting, not to buffer the entire response. In practice the first token appears within the normal latency window.

Can I add our team’s coding conventions?
Yes, as a separate section appended to the prompt. Keep conventions separate from the core rules so you can update them without touching the working prompt. Format: “Team conventions (apply only when they cause a correctness or production risk): […]”.

The 14-task methodology is on the editorial process page. The retry policy for the API call wrapping this prompt is on the agent loop retry policy post. The model that currently posts the best code-review score on the TCC suite is Claude Opus 4.7.

One-line takeaway

Anchor the model on a 90-day incident horizon, force a non-issues section, cap output at 8, and the AI review stops flagging your comments and starts catching the missing auth check.

esc