~/ home ~/ guides ~/ ai-reviews ~/ prompts ~/ cheatsheets ~/ trendsUpdated yesterday

~/ai-reviews

§ AI REVIEWS · 6 TOOLS · UPDATED WEEKLY

AI coding tools, reviewed against real tasks

No benchmarks in the void. Every tool runs through the same 14-task suite. Scores broken down by domain, not averaged into one number.

Google · v3.1 Pro

Gemini 3.1 Pro review: cheapest frontier token on the leaderboard, and where it still lags

A meaningful reasoning jump over 3 Pro. Best-in-class for long-context multimodal tasks; still trails Claude/GPT on tool-heavy agent runs.

context google review

Windsurf · v2.0

Windsurf 2.0 review: Cascade agent on the $20 Pro plan, and the 2 tasks where it beat Cursor

Cascade is still the flow-first agent. v2.0 closes most of the parallel-agent gap with Cursor, via the new Agent Command Center.

codeium editor flow

open-source · v0.86.2

Aider review: the terminal agent that still wins on diff quality

Still the best terminal-native pair-programmer. Excellent diffs, model-agnostic, works well with Opus 4.7 and GPT-5.3-Codex.

aider diff terminal

Cursor 3 and Composer 2 review: parallel agents that do not cancel each other

Composer 2 is Cursor's first in-house coding model — 4× faster than equivalently intelligent models and wired directly into the Agents Window.

agents cursor editor

OpenAI · v5.3-codex

GPT-5.3-Codex review: 9/10 on strict JSON, and the test-gen score nobody expected

Best-in-class test generation and the most steerable agent on the market. Pairs naturally with gpt-5.4 for general reasoning.

codegen openai testing

Anthropic · v4.7

Claude Opus 4.7 review: 11/14 on a real monorepo, and the 3 misses are all re-exports

The current best-in-class for large-codebase refactoring and sustained agent runs. 1M context at standard pricing.

anthropic refactor typescript

§ LEADERBOARD · APR 2026

Weekly leaderboard

scored against the 14-task suite

#	Tool	Overall	Refactor	Test-gen	Debug	Agent	Δ week	Verdict
01	CLClaude Opus 4.7 review: 11/14 on a real monorepo, and the 3 misses are all re-exports	9.4	9.5	9.0	9.2	9.5	+0.3	top, 6 weeks
02	GPGPT-5.3-Codex review: 9/10 on strict JSON, and the test-gen score nobody expected	9.0	8.7	9.2	8.6	8.9	+0.4	best test-gen
03	CUCursor 3 and Composer 2 review: parallel agents that do not cancel each other	8.7	8.4	8.2	7.8	9.4	+0.5	best editor UX
04	GEGemini 3.1 Pro review: cheapest frontier token on the leaderboard, and where it still lags	8.5	7.3	7.8	6.8	8.0	+0.7	long-context + multimodal
05	WIWindsurf 2.0 review: Cascade agent on the $20 Pro plan, and the 2 tasks where it beat Cursor	8.3	7.9	8.3	7.7	8.6	+0.2	flow-first
06	AIAider review: the terminal agent that still wins on diff quality	8.2	8.7	7.9	8.1	7.6	+0.1	terminal native

⌕ esc