The recurring r/ChatGPTCoding “AI PR review keeps flagging style nits and missing the auth bug” thread has a consistent answer: the default “review this PR” prompt does not work, and the fix is to anchor the model on production impact and force a non-issues section. The prompt below catches 3 of 3 real issues on the TCC code-review fixture (a 600-line PR with a missing auth check, an N+1 query, a race condition, and a planted false-positive trap) and does not fall for the trap on any of 5 runs with Claude Opus 4.7. The bare “review this PR” prompt catches 1-2 real issues and falls for the trap 3 of 5 runs.
Why bare “review this PR” fails
There are two structural problems with the default prompt. First, the model has no threshold for what counts as an issue worth flagging, so it treats a variable name it dislikes the same as a missing authentication check. Second, models are trained to be helpful and thorough, which means they add commentary even when no commentary is warranted. The result is a review that looks authoritative but buries the two real bugs under twelve style observations.
A modern AI-assisted code review pipeline has three layers: static analysis (ESLint, SonarQube, Semgrep) catches 40-60% of deterministic issues, ML-based pattern detection (CodeQL) catches another slice, and the LLM gets the residual: semantic issues that require understanding business logic, cross-file interactions, and production context. This prompt is the LLM layer. It should not duplicate what Semgrep already caught.
The prompt
You are a staff engineer reviewing a pull request. You have one job: flag issues that will cause a production incident within 90 days.
Output format, exact:
---
ISSUES:
- [severity: high|medium|low] [file:line] <one-sentence description> <why it will break>
...
NON_ISSUES_CONSIDERED:
- [file:line] <what I thought might be wrong> <why I concluded it is fine>
---
Rules:
1. Do not comment on style, naming, or formatting unless it causes a correctness issue.
2. Prefer high-confidence issues. If you are less than 70% sure, list it under NON_ISSUES_CONSIDERED.
3. For every issue flagged, name the production scenario that triggers it.
4. Cap the output at 8 ISSUES. If you find more, keep the highest severity.
5. Read the PR end to end before emitting the first bullet. Do not stream.
Example output for a real fixture
Input: a 120-line Python diff that adds an admin endpoint without checking the is_staff flag.
---
ISSUES:
- [severity: high] [views/admin.py:34] Missing authorization check on /api/admin/export
Any authenticated user can call this endpoint and download the full user table.
Production scenario: attacker logs in with a free account, hits /api/admin/export, gets PII.
- [severity: medium] [services/report.py:88] N+1 query in generate_report()
The loop on line 88 executes one SELECT per user_id. At 10k users this hits the DB 10k
times per report request.
Production scenario: report generation times out under normal load after user base passes 5k.
NON_ISSUES_CONSIDERED:
- [views/admin.py:21] The missing input validation on start_date looked suspicious.
Concluded fine: the ORM parameter binding sanitizes it, and the field is type-coerced to date
before the query runs.
---
Architecture review variant
For larger PRs that touch multiple services, the architecture variant zooms out from individual lines to flag structural issues:
You are a senior software architect reviewing a pull request for structural soundness.
Context:
- System: <one-sentence description>
- PR scope: <what this PR is trying to do>
- Team size: <N developers>
Output format, exact:
---
ARCHITECTURAL_CONCERNS:
- [severity: high|medium|low] <one-sentence description> <production consequence>
STRENGTHS:
- <one thing the PR does well structurally>
RECOMMENDED_REFACTOR:
- One sentence: the single structural change with the highest impact on maintainability.
---
Rules:
1. Evaluate coupling: could you change one component without touching three others?
2. Evaluate error propagation: do failures in one component cascade silently to another?
3. Do not flag line-level issues. Architectural review only.
4. If the PR is structurally sound, ARCHITECTURAL_CONCERNS should be empty.
Why the base prompt works, in 5 bullets
- “90 days” is a concrete time horizon. It replaces “important” with a measurable threshold. The model drops style nits and focuses on the class of bug that sits in production for a sprint before it fires.
- The NON_ISSUES_CONSIDERED section is the false-positive filter. Models love to flag “suspicious” code. Forcing them to list the suspicious thing and then explain why it is fine drops the false-positive rate by roughly 3x in the TCC fixture.
- Severity tagging anchors priority. High/medium/low on every issue gives downstream tooling (Slack post, CI gate, dashboard) a sort key. The cap at 8 keeps the review readable.
- “Read end to end before emitting the first bullet” blocks the streaming-think trap. Without it, the model starts generating bullets from the first file and forgets the cross-file interactions.
- The “production scenario” clause filters speculative issues. If the model cannot name a scenario where the bug fires, the bug is not an issue; it is an opinion.
Failure modes
- Small PRs (under 50 lines). The prompt is overkill. The model will stretch to find issues. Keep a separate “small PR” prompt with “return an empty list if the PR is correct” in the rules.
- Vendor-specific framework code. If your PR uses a framework the model is unfamiliar with (internal DSL, a niche ORM), the false-positive rate jumps. Supply the framework’s reference pages in the context.
- Race conditions without a timing model. The model will catch obvious concurrency issues and miss subtle ones. On the TCC fixture it catches the shared-counter race; a distributed lock around a cache produced misses on every model. Pair with a property-based test for the concurrency invariants; the property-based test prompt covers the harness.
- PRs over 80k tokens. Split the diff at module boundaries, run the review prompt against each chunk, then concatenate the ISSUES sections and deduplicate. Do not concatenate before sending; models degrade on very long diffs.
Tested on (TCC editorial scoring)
- Claude Opus 4.7,
adaptive thinking, effort=high: 3 of 3 real issues, 0 of 5 runs fell for the trap. - Claude Sonnet 4.6: 3 of 3 real issues, 1 of 5 runs fell for the trap.
- GPT-5.3-Codex at
reasoning_effort=high: 3 of 3 real issues, 1 of 5 runs fell for the trap. - Gemini 3.1 Pro, auto thinking budget: 2 of 3 real issues (missed the race), 2 of 5 runs fell for the trap.
- Aider architect mode with Claude Opus 4.7 as the planner: 3 of 3 real issues, 0 of 5 runs fell for the trap. Same as the bare Opus call.
Methodology on the 14-task scorecard. The cross-model pattern (Anthropic leads, OpenAI close, Gemini behind on flow-sensitive tasks) is consistent with the public SWE-bench Pro leaderboard.
Wiring it into a PR workflow
# .github/workflows/ai-review.yml
on: pull_request
jobs:
ai-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- run: |
diff=$(git diff --unified=5 origin/main...HEAD)
echo "$diff" | python scripts/ai_review.py > review.md
- uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const body = fs.readFileSync('review.md', 'utf8');
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body
});
The ai_review.py script wraps the prompt above, splits long diffs at 80k tokens by module boundary, and concatenates the ISSUES sections before posting. Keep the NON_ISSUES_CONSIDERED section local for debugging; posting all of it clutters the PR thread. Run the architecture variant separately when the PR touches more than 3 files across service boundaries.
Model-specific tips
- Claude Opus 4.7: Set
thinking={"type":"enabled","budget_tokens":8000}. The reasoning pass is where the cross-file interaction detection happens. Without it, performance drops to Claude Sonnet 4.6 territory. - GPT-5.3-Codex: Use
reasoning_effort="high". At medium effort it misses the race condition on 2 of 5 runs on the TCC fixture. - Gemini 3.1 Pro: Supply the full file for any frame touched by the diff, not just the diff lines. The 1M token context makes this cheap and it closes the gap on flow-sensitive detection.
Frequently asked questions
Should the AI review replace human review?
No. The AI layer runs before human review and filters out the issues that are cheap to catch programmatically. Humans focus on intent, business logic, and the things models still miss (subtle race conditions, domain-specific invariants). The TCC pipeline gates CI on AI review but still requires a human approval before merge.
What is the false-negative rate on security issues?
On the TCC fixture: 0 missed security issues with Opus 4.7, 1 missed with Codex at medium effort. That fixture has one security issue; a real PR may have more. Do not use this prompt as your only security gate. Semgrep or Snyk should run in parallel.
The prompt says “do not stream”. Does that hurt latency?
On Claude you can still use streaming at the API level. The instruction is to the model, not to the API call. It tells the model to reason before emitting, not to buffer the entire response. In practice the first token appears within the normal latency window.
Can I add our team’s coding conventions?
Yes, as a separate section appended to the prompt. Keep conventions separate from the core rules so you can update them without touching the working prompt. Format: “Team conventions (apply only when they cause a correctness or production risk): […]”.
Related
The 14-task methodology is on the editorial process page. The retry policy for the API call wrapping this prompt is on the agent loop retry policy post. The model that currently posts the best code-review score on the TCC suite is Claude Opus 4.7.
One-line takeaway
Anchor the model on a 90-day incident horizon, force a non-issues section, cap output at 8, and the AI review stops flagging your comments and starts catching the missing auth check.