Every review and guide on this site is built on the same 14 tasks, run the same way, scored on the same rubric. This page documents the editorial process end to end: what the tasks are, how I score them, how often I rerun them, and how I handle corrections. I rewrite this page whenever the process changes, and I date every revision at the bottom.
The 14-task suite
The suite covers the kinds of work a working engineer does in a week, not the kinds of work that look good in a demo. Every task has a deterministic fixture, a graded expected output, and a time budget. Tasks are versioned in a private repo; fixtures are public on request for reviewers who want to reproduce a score.
| # | Task | Domain | Budget |
|---|---|---|---|
| 1 | Cross-file rename across a 63k-line TypeScript monorepo (14 call sites, including 4 re-exported types) | Refactor | 15 min |
| 2 | Extract a pure function from a 900-line React component while preserving prop semantics | Refactor | 10 min |
| 3 | Remove a legacy Redux slice and migrate 27 call sites to Zustand | Refactor | 25 min |
| 4 | Write property-based tests (Hypothesis or fast-check) for a JSON-diff library with 6 invariants | Test-gen | 20 min |
| 5 | Generate integration tests for a FastAPI service with 3 external deps (mock the deps, cover 8 paths) | Test-gen | 20 min |
| 6 | Debug a production NullReferenceException with a 42-frame stack trace; identify the root cause and write the fix | Debug | 15 min |
| 7 | Reproduce and fix a race condition between two async handlers sharing a cache | Debug | 25 min |
| 8 | Design a normalized Postgres schema from 220 lines of prose product requirements; include indexes | Schema | 20 min |
| 9 | Write a migration script that backfills a new column on a 40M-row table without long locks | Schema | 15 min |
| 10 | Review a 600-line PR with 11 files, flag 3 real issues and 1 false-positive trap I planted | Code review | 15 min |
| 11 | Plan and execute a 5-step agent loop with a bounded step budget; give-up exit must fire on step 4 | Agent | 20 min |
| 12 | Call 4 tools in sequence (search, read, edit, verify) and recover from one tool error mid-flight | Agent | 15 min |
| 13 | Emit strict JSON matching a 40-property schema for 100 adversarial inputs without a single parse error | Structured output | 10 min |
| 14 | Answer 12 retrieval queries over a 1,400-chunk RAG corpus with an unambiguous correct chunk per query | RAG | 15 min |
Domains roll up into the four scores you see on every review: Refactor (tasks 1-3), Test-gen (4-5), Debug (6-7), and a combined “Agent & tool-use” bucket (11-12). Schema (8-9), Code review (10), Structured output (13), and RAG (14) get per-task numbers on the detail page but do not appear on the leaderboard summary.
Scoring rubric (0-10, per task)
- 10: complete, production-grade, no human intervention needed.
- 8-9: correct, one small style fix or one human touch.
- 6-7: mostly correct, one graded check fails, trivial to patch.
- 4-5: partial, significant rework needed, tests still failing.
- 2-3: plausible-looking output that does not satisfy the spec.
- 0-1: refuses, loops without progress, or produces unrelated output.
Runs, medians, not best-of-N
Every task runs 5 times. I publish the median. Not the best run, not the worst, not the average. Best-of-N flatters brittle tools; averages get wrecked by a single 0. The median tells you what you can expect on an ordinary Tuesday.
Temperatures, thinking-effort levels, and agent budgets are set to the vendor defaults unless the vendor offers a clearly documented “coding” preset, in which case that preset is used and the setting is recorded in the review. No custom system prompts beyond what the vendor ships in its coding agent.
Cadence
- Weekly rerun. The full suite runs every Monday. Deltas greater than 0.3 points on any domain score trigger an updated review.
- Within-a-week on model drops. When a vendor ships a model I already cover, I rerun the suite inside 7 days and publish a dated delta on the affected review.
- Quarterly task review. I keep the 14 tasks fixed for at least 90 days. If a task becomes trivial for every model (score >= 9 for 3 weeks running), I swap it for a harder one and publish the retirement + replacement in the changelog.
What I do not do
- Run public benchmarks (HumanEval, SWE-bench, and friends). Public evals are useful for researchers. They are noisy signal for engineers picking a tool this week.
- Accept preview access or review embargoes. Every review is written against the shipped public version.
- Take money for placement. No sponsorships, no vendor-funded posts. Affiliate links (used sparingly for tools I already score well) never move a score. The affiliate disclosure is on every affected post.
- Chart scores on a single-number leaderboard. The Aider leaderboard is one useful signal; it is not the right lens for picking a refactoring tool. My leaderboard shows per-domain numbers, or it does not ship.
Corrections policy
If a score is wrong, send a link to corrections@thecodingcolosseum.com with the two lines of evidence that disprove it. I fix inside 7 days. Corrections keep the old number visible with a strikethrough, show the new number next to it, and carry a dated changelog entry on the affected post. I do not silently re-score.
The 14 tasks sit in a private fixture repo that I share with any reviewer who asks in good faith. The private repo is the single source of truth. If the tasks on this page drift from the fixture, the fixture wins.