§ EDITOR · PROFILE · WORKING ENGINEER. RUNS THE TOOLS ON REAL CODEBASES.

Adrian Marcus

Working engineer. Runs the tools on real codebases. · Remote

I'm Adrian Marcus. I write every review and guide on this site myself. I pay for my own subscriptions and I run the tools on real codebases, not on toy examples.

I've been shipping production software for over a decade across TypeScript, Python, Go, and Rust, mostly on developer tooling and large-scale web applications. I started The Coding Colosseum because the AI-coding reviews I kept finding were either press releases in disguise or one-shot benchmarks that never got re-run. So I built the thing I wanted: the same 14 tasks, rerun weekly, scored on the median of five runs, with every failure pattern written down.

If a tool drops a point between runs, I say so. If it earns one back, I say that too. No sponsored placements, no NDA previews, no affiliate deals that move scores.

§ PUBLISHED · 25 ARTICLES

Recent work

GUIDES Apr 23, 2026

Evals without LLM judges: the boring harness that catches regressions

How I score LLM pipelines without an LLM-as-judge. Deterministic graders, property-based checks, and the 4 reasons a judge model keeps biting you in production.
GUIDES Apr 23, 2026

RAG defaults in 2026: chunk sizes, rerankers, and the 3 settings that move accuracy

The chunk size, overlap, rerank, and top-k values that moved my retrieval accuracy from 74% to 91%. Tested on a 1,400-chunk corpus with a ground-truth answer set.
GUIDES Apr 23, 2026

The case against autonomous coding agents in production, 2026 edition

Autonomous coding agents still fail 1 in 9 production runs on my suite. The three failure modes that cause it, and where a bounded planner is the honest answer.
GUIDES Apr 23, 2026

Structured outputs, three years in: the one pattern that survived

Three years of shipping LLM structured outputs in production. The one pattern that survived, the three that did not, and the strict-JSON failure rate I run at today.
GUIDES Apr 23, 2026

Agent loops and retries: the 4-step policy that cut my 429 rate from 6% to 0.2%

The retry policy that cut my agent-loop 429 rate from 6% to 0.2% across 4 vendors. Jitter, step-budget interlock, and the one thing you should never retry.

Showing page 3

◀1 23

Adrian Marcus

Recent work

Evals without LLM judges: the boring harness that catches regressions

RAG defaults in 2026: chunk sizes, rerankers, and the 3 settings that move accuracy

The case against autonomous coding agents in production, 2026 edition

Structured outputs, three years in: the one pattern that survived

Agent loops and retries: the 4-step policy that cut my 429 rate from 6% to 0.2%