No benchmarks in the void. Every tool runs through the same 14-task suite. Scores broken down by domain, not averaged into one number.
| # | Tool | Overall | Refactor | Test-gen | Debug | Agent | Δ week | Verdict |
|---|---|---|---|---|---|---|---|---|
| 01 | CLClaude Opus 4.7 review: 11/14 on a real monorepo, and the 3 misses are all re-exports |
9.4 | 9.5 | 9.0 | 9.2 | 9.5 | +0.3 | top, 6 weeks |
| 02 | GPGPT-5.3-Codex review: 9/10 on strict JSON, and the test-gen score nobody expected |
9.0 | 8.7 | 9.2 | 8.6 | 8.9 | +0.4 | best test-gen |
| 03 | CUCursor 3 and Composer 2 review: parallel agents that do not cancel each other |
8.7 | 8.4 | 8.2 | 7.8 | 9.4 | +0.5 | best editor UX |
| 04 | GEGemini 3.1 Pro review: cheapest frontier token on the leaderboard, and where it still lags |
8.5 | 7.3 | 7.8 | 6.8 | 8.0 | +0.7 | long-context + multimodal |
| 05 | WIWindsurf 2.0 review: Cascade agent on the $20 Pro plan, and the 2 tasks where it beat Cursor |
8.3 | 7.9 | 8.3 | 7.7 | 8.6 | +0.2 | flow-first |
| 06 | AIAider review: the terminal agent that still wins on diff quality |
8.2 | 8.7 | 7.9 | 8.1 | 7.6 | +0.1 | terminal native |