~/ai-reviews
§ AI REVIEWS · 6 TOOLS · UPDATED WEEKLY

AI coding tools, reviewed against real tasks

No benchmarks in the void. Every tool runs through the same 14-task suite. Scores broken down by domain, not averaged into one number.

Google · v3.1 Pro
Gemini 3.1 Pro review: cheapest frontier token on the leaderboard, and where it still lags
A meaningful reasoning jump over 3 Pro. Best-in-class for long-context multimodal tasks; still trails Claude/GPT on tool-heavy agent runs.
context google review
overall8.5
scaffold7.9
refactor7.3
Windsurf · v2.0
Windsurf 2.0 review: Cascade agent on the $20 Pro plan, and the 2 tasks where it beat Cursor
Cascade is still the flow-first agent. v2.0 closes most of the parallel-agent gap with Cursor, via the new Agent Command Center.
codeium editor flow
overall8.3
scaffold8.5
refactor7.9
open-source · v0.86.2
Aider review: the terminal agent that still wins on diff quality
Still the best terminal-native pair-programmer. Excellent diffs, model-agnostic, works well with Opus 4.7 and GPT-5.3-Codex.
aider diff terminal
overall8.2
scaffold8.0
refactor8.7
Cursor · v3.1
Cursor 3 and Composer 2 review: parallel agents that do not cancel each other
Composer 2 is Cursor's first in-house coding model — 4× faster than equivalently intelligent models and wired directly into the Agents Window.
agents cursor editor
overall8.7
scaffold9.3
refactor8.4
OpenAI · v5.3-codex
GPT-5.3-Codex review: 9/10 on strict JSON, and the test-gen score nobody expected
Best-in-class test generation and the most steerable agent on the market. Pairs naturally with gpt-5.4 for general reasoning.
codegen openai testing
overall9.0
scaffold8.8
refactor8.7
Anthropic · v4.7
Claude Opus 4.7 review: 11/14 on a real monorepo, and the 3 misses are all re-exports
The current best-in-class for large-codebase refactoring and sustained agent runs. 1M context at standard pricing.
anthropic refactor typescript
overall9.4
scaffold9.3
refactor9.5
§ LEADERBOARD · APR 2026

Weekly leaderboard

scored against the 14-task suite
#ToolOverallRefactorTest-genDebugAgentΔ weekVerdict
01
Claude Opus 4.7 review: 11/14 on a real monorepo, and the 3 misses are all re-exports
9.4 9.5 9.0 9.2 9.5 +0.3 top, 6 weeks
02
GPT-5.3-Codex review: 9/10 on strict JSON, and the test-gen score nobody expected
9.0 8.7 9.2 8.6 8.9 +0.4 best test-gen
03
Cursor 3 and Composer 2 review: parallel agents that do not cancel each other
8.7 8.4 8.2 7.8 9.4 +0.5 best editor UX
04
Gemini 3.1 Pro review: cheapest frontier token on the leaderboard, and where it still lags
8.5 7.3 7.8 6.8 8.0 +0.7 long-context + multimodal
05
Windsurf 2.0 review: Cascade agent on the $20 Pro plan, and the 2 tasks where it beat Cursor
8.3 7.9 8.3 7.7 8.6 +0.2 flow-first
06
Aider review: the terminal agent that still wins on diff quality
8.2 8.7 7.9 8.1 7.6 +0.1 terminal native
esc