Architecture-level code review prompt: the one that catches 3 real issues and skips the false positive

The recurring r/ChatGPTCoding “AI PR review keeps flagging style nits and missing the auth bug” thread has a consistent answer: the default “review this PR” prompt does not work, and the fix is to anchor the model on production impact and force a non-issues section. The prompt below catches 3 of 3 real issues on the TCC code-review fixture (a 600-line PR with a missing auth check, an N+1 query, a race condition, and a planted false-positive trap) and does not fall for the trap on any of 5 runs with Claude Opus 4.7. The bare “review this PR” prompt catches 1-2 real issues and falls for the trap 3 of 5 runs.

The prompt

You are a staff engineer reviewing a pull request. You have one job: flag issues that will cause a production incident within 90 days.

Output format, exact:
---
ISSUES:
- [severity: high|medium|low] [file:line] <one-sentence description> <why it will break>
...
NON_ISSUES_CONSIDERED:
- [file:line] <what I thought might be wrong> <why I concluded it is fine>
---

Rules:
1. Do not comment on style, naming, or formatting unless it causes a correctness issue.
2. Prefer high-confidence issues. If you are less than 70% sure, list it under NON_ISSUES_CONSIDERED.
3. For every issue flagged, name the production scenario that triggers it.
4. Cap the output at 8 ISSUES. If you find more, keep the highest severity.
5. Read the PR end to end before emitting the first bullet. Do not stream.

Why it works, in 5 bullets

“90 days” is a concrete time horizon. It replaces “important” with a measurable threshold. The model drops style nits and focuses on the class of bug that sits in production for a sprint before it fires.
The NON_ISSUES_CONSIDERED section is the false-positive filter. Models love to flag “suspicious” code. Forcing them to list the suspicious thing and then explain why it is fine drops the false-positive rate by a factor of ~3x in the TCC fixture and matches what the Claude Code review feature ships internally.
Severity tagging anchors priority. High/medium/low on every issue gives downstream tooling (Slack post, CI gate, dashboard) a sort key. The cap at 8 keeps the review readable.
“Read end to end before emitting the first bullet” blocks the streaming-think trap. Without it, the model starts generating bullets from the first file and forgets the cross-file interactions.
The “production scenario” clause filters speculative issues. If the model cannot name a scenario where the bug fires, the bug is not an issue; it is an opinion.

Failure modes

Small PRs (under 50 lines). The prompt is overkill. The model will stretch to find issues. Keep a separate “small PR” prompt with “return an empty list if the PR is correct” in the rules.
Vendor-specific framework code. If your PR uses a framework the model is unfamiliar with (internal DSL, a niche ORM), the false-positive rate jumps. Supply the framework’s reference pages in the context.
Race conditions without a timing model. The model will catch obvious concurrency issues and miss subtle ones. On the TCC fixture it catches the shared-counter race; a 2024 variant with a distributed lock around a cache produced misses on every model. Pair with a property-based test when concurrency is the main risk; the property-based test prompt covers the harness.

Tested on (TCC editorial scoring)

Claude Opus 4.7, adaptive thinking, effort=high: 3 of 3 real issues, 0 of 5 runs fell for the trap.
Claude Sonnet 4.6: 3 of 3 real issues, 1 of 5 runs fell for the trap.
GPT-5.3-Codex at reasoning_effort=high: 3 of 3 real issues, 1 of 5 runs fell for the trap.
Gemini 3.1 Pro, auto thinking budget: 2 of 3 real issues (missed the race), 2 of 5 runs fell for the trap.
Aider architect mode with Claude Opus 4.7 as the planner: 3 of 3 real issues, 0 of 5 runs fell for the trap. Same as the bare Opus call.

Methodology on the 14-task scorecard. The cross-model pattern (Anthropic leads, OpenAI close, Gemini behind on flow-sensitive tasks) is consistent with the public SWE-bench Pro leaderboard.

Wiring it into a PR workflow

# .github/workflows/ai-review.yml
on: pull_request
jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with: { fetch-depth: 0 }
      - run: |
          diff=$(git diff --unified=5 origin/main...HEAD)
          echo "$diff" | python scripts/ai_review.py > review.md
      - uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const body = fs.readFileSync('review.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

The ai_review.py script wraps the prompt above, splits long diffs at 80k tokens, and concatenates the ISSUES sections before posting. Keep the NON_ISSUES_CONSIDERED section local for debugging; posting all of it clutters the PR thread.

The 14-task methodology is on the editorial process page. The retry policy for the API call wrapping this prompt is on the agent loop retry policy post. The model that currently posts the best code-review score on the TCC suite is Claude Opus 4.7.

One-line takeaway

Anchor the model on a 90-day incident horizon, force a non-issues section, cap output at 8, and the AI review stops flagging your comments and starts catching the missing auth check.