I’m Adrian Marcus. I write every review, guide, and prompt on this site myself. I pay for my own Claude, GPT-5, Gemini, Cursor, and Aider subscriptions out of the same budget I use for side projects. Nothing on The Coding Colosseum is sponsored, previewed by a vendor, or moved by an affiliate deal.
I’ve been shipping production software for over a decade. Most of that work was on developer tooling and large-scale web apps, across TypeScript, Python, Go, and Rust. I started this site because the AI-coding reviews I kept finding were either press releases reformatted into listicles, or one-shot benchmarks that nobody reran when the next model dropped a month later. The prompt-engineering posts were worse: a screenshot of a ChatGPT tab and three bullet points of advice.
What the site publishes
Three things, in order of priority:
- AI coding reviews. Every tool I cover runs through the same 14-task suite. Scores are the median of 5 runs, not best-of-N. Per-task scores are visible; single-number averages are not. The full methodology lives on the editorial process page.
- Production guides. Hands-on writeups of a specific task on a specific codebase with a specific tool. Every prompt I used is in a code block. Every failure is called out in-line, not buried.
- Prompt library. Only prompts that survived 30 days in a real production pipeline get published. Each one ships with its failure modes and the models it was tested against.
About the editor
How I test a tool
The same 14 tasks, every time. Refactors across a 63k-line TypeScript monorepo, test-gen with property-based assertions, debugging non-trivial production stack traces, schema design from prose requirements, bounded-budget agent planning, and six more. Tasks are versioned in a private fixture repo with deterministic inputs. I run each task 5 times with a clean context, record the transcripts, and score each run 0-10 on a rubric that is also public.
Every tool gets the same task. Every tool gets the same clock on the same hardware. The one thing that changes is the model or the client. When a tool ships an update I’ve already tested, I rerun the suite within a week and publish a dated delta. A full example of a rerun and the resulting score change sits on the Claude Opus 4.7 review.
How I cover vendors
The same way I’d cover a compiler. I read the official docs (Anthropic, OpenAI, Google AI, Aider), I open a public paid account, and I run the 14 tasks. No NDA previews. No advance copies. I never ask a vendor to comment on a review before publication, and I don’t accept changes to a score in exchange for coverage. Corrections are a different matter: if a score is wrong, I fix it, date it, and keep the old number visible in the changelog on the affected post.
Where I draw the line
I don’t benchmark public evals against each other. I don’t run HumanEval again. I don’t chart HELM scores on a color-graded bar. The 14-task suite is the suite. If a tool wins there, it wins here. If it loses, it loses.
I also don’t pretend my suite is the whole picture. It covers the coding work I actually do, plus tasks that showed up in threads on Hacker News and r/ExperiencedDevs more than once. If you build infra for self-driving cars, the suite undersells the tool you need. If you ship web apps, it’s close.
Contact and corrections
Wrong score, dead link, broken fixture, or a claim that’s gone stale? Email corrections@thecodingcolosseum.com with a link and two lines. I fix in under a week and keep a dated changelog on the affected post. Editorial, partnerships, or speaking: editor@thecodingcolosseum.com.
That’s the whole operation. One person, one suite, one repo of fixtures, rerun weekly.