~/archive
§ EDITOR · PROFILE · WORKING ENGINEER. RUNS THE TOOLS ON REAL CODEBASES.

Adrian Marcus

Working engineer. Runs the tools on real codebases. · Remote

I'm Adrian Marcus. I write every review and guide on this site myself. I pay for my own subscriptions and I run the tools on real codebases, not on toy examples.

I've been shipping production software for over a decade across TypeScript, Python, Go, and Rust, mostly on developer tooling and large-scale web applications. I started The Coding Colosseum because the AI-coding reviews I kept finding were either press releases in disguise or one-shot benchmarks that never got re-run. So I built the thing I wanted: the same 14 tasks, rerun weekly, scored on the median of five runs, with every failure pattern written down.

If a tool drops a point between runs, I say so. If it earns one back, I say that too. No sponsored placements, no NDA previews, no affiliate deals that move scores.

§ PUBLISHED · 25 ARTICLES

Recent work

  1. AI REVIEWS

    Cursor 3 and Composer 2 review: parallel agents that do not cancel each other

    Cursor 3 with Composer 2 scored 8.3 on agent tasks and 8.1 on refactoring. Every score, the parallel-agent behaviour, and the 3 settings that moved the numbers.

  2. AI REVIEWS

    GPT-5.3-Codex review: 9/10 on strict JSON, and the test-gen score nobody expected

    Adrian Marcus tested GPT-5.3-Codex on a 14-task suite: 9.0 on strict JSON, 8.7 on test-gen, and a costly loss on long-horizon agent planning. Full numbers inside.

  3. AI REVIEWS

    Claude Opus 4.7 review: 11/14 on a real monorepo, and the 3 misses are all re-exports

    Adrian Marcus scored Claude Opus 4.7 on a 63k-line TypeScript monorepo: 11/14 correct, median of 5 runs. Every miss pattern and the exact prompts are in the post.

  4. PROMPTS

    Schema design prompt: normalized Postgres schema from 220 lines of prose

    The prompt that turns 220 lines of product requirements into a normalized Postgres schema with indexes, constraints, and migration order. Tested on 4 models.

  5. PROMPTS

    Bug localization prompt: the 42-frame stack trace, root cause in 3 turns

    The prompt that localizes a bug in a 42-frame stack trace to a single line in 3 turns, median. Tested on Claude Opus 4.7, GPT-5.3-Codex, and a staff engineer.

  6. PROMPTS

    Property-based test generation prompt: 6 invariants on the first run

    The prompt that writes 6 Hypothesis invariants for a JSON-diff library on the first run, with shrink strategies. Tested on GPT-5.3-Codex, Claude Opus 4.7, and Aider.

  7. PROMPTS

    Strict JSON prompt: the 11 lines that drop parse errors to 0.01%

    The 11-line prompt that drops LLM strict-JSON parse errors from 0.4% to under 0.01%. Paired with response_format, tested on GPT-5.3-Codex, Claude Opus 4.7, and Gemini.

  8. PROMPTS

    Architecture-level code review prompt: the one that catches 3 real issues and skips the false positive

    The architecture-review prompt that flagged 3 real issues and ignored a planted false-positive trap on my 600-line PR. What to include, what to remove, and the 4 models I tested.

  9. PROMPTS

    Bounded agent planner prompt: force the give-up, save the bill

    The 9-line prompt that moves my 5-step agent exit rate from 2/5 to 5/5 on Claude Opus 4.7. Why it works, where it fails, and the models I tested it on.

  10. GUIDES

    Refactor legacy TypeScript with AI: a rename across 14 call sites, without the re-export trap

    A step-by-step AI refactor on a 63k-line TypeScript monorepo. The prompt, the re-export trap, and why Claude Opus 4.7 + Aider beat the IDE agent by 3 call sites.

Showing page 2
esc