Cursor 3 with Composer 2 scored 8.3 on agent tasks and 8.1 on refactoring. Every score, the parallel-agent behaviour, and the 3 settings that moved the numbers.
Adrian Marcus tested GPT-5.3-Codex on a 14-task suite: 9.0 on strict JSON, 8.7 on test-gen, and a costly loss on long-horizon agent planning. Full numbers inside.
Adrian Marcus scored Claude Opus 4.7 on a 63k-line TypeScript monorepo: 11/14 correct, median of 5 runs. Every miss pattern and the exact prompts are in the post.
The prompt that turns 220 lines of product requirements into a normalized Postgres schema with indexes, constraints, and migration order. Tested on 4 models.
The prompt that localizes a bug in a 42-frame stack trace to a single line in 3 turns, median. Tested on Claude Opus 4.7, GPT-5.3-Codex, and a staff engineer.
The prompt that writes 6 Hypothesis invariants for a JSON-diff library on the first run, with shrink strategies. Tested on GPT-5.3-Codex, Claude Opus 4.7, and Aider.
The 11-line prompt that drops LLM strict-JSON parse errors from 0.4% to under 0.01%. Paired with response_format, tested on GPT-5.3-Codex, Claude Opus 4.7, and Gemini.
The architecture-review prompt that flagged 3 real issues and ignored a planted false-positive trap on my 600-line PR. What to include, what to remove, and the 4 models I tested.
The 9-line prompt that moves my 5-step agent exit rate from 2/5 to 5/5 on Claude Opus 4.7. Why it works, where it fails, and the models I tested it on.
A step-by-step AI refactor on a 63k-line TypeScript monorepo. The prompt, the re-export trap, and why Claude Opus 4.7 + Aider beat the IDE agent by 3 call sites.