Editorial process and 14-task methodology

Every review and guide on this site is built on the same 14 tasks, run the same way, scored on the same rubric. This page documents the editorial process end to end: what the tasks are, how I score them, how often I rerun them, and how I handle corrections. I rewrite this page whenever the process changes, and I date every revision at the bottom.

The 14-task suite

The suite covers the kinds of work a working engineer does in a week, not the kinds of work that look good in a demo. Every task has a deterministic fixture, a graded expected output, and a time budget. Tasks are versioned in a private repo; fixtures are public on request for reviewers who want to reproduce a score.

#	Task	Domain	Budget
1	Cross-file rename across a 63k-line TypeScript monorepo (14 call sites, including 4 re-exported types)	Refactor	15 min
2	Extract a pure function from a 900-line React component while preserving prop semantics	Refactor	10 min
3	Remove a legacy Redux slice and migrate 27 call sites to Zustand	Refactor	25 min
4	Write property-based tests (Hypothesis or fast-check) for a JSON-diff library with 6 invariants	Test-gen	20 min
5	Generate integration tests for a FastAPI service with 3 external deps (mock the deps, cover 8 paths)	Test-gen	20 min
6	Debug a production NullReferenceException with a 42-frame stack trace; identify the root cause and write the fix	Debug	15 min
7	Reproduce and fix a race condition between two async handlers sharing a cache	Debug	25 min
8	Design a normalized Postgres schema from 220 lines of prose product requirements; include indexes	Schema	20 min
9	Write a migration script that backfills a new column on a 40M-row table without long locks	Schema	15 min
10	Review a 600-line PR with 11 files, flag 3 real issues and 1 false-positive trap I planted	Code review	15 min
11	Plan and execute a 5-step agent loop with a bounded step budget; give-up exit must fire on step 4	Agent	20 min
12	Call 4 tools in sequence (search, read, edit, verify) and recover from one tool error mid-flight	Agent	15 min
13	Emit strict JSON matching a 40-property schema for 100 adversarial inputs without a single parse error	Structured output	10 min
14	Answer 12 retrieval queries over a 1,400-chunk RAG corpus with an unambiguous correct chunk per query	RAG	15 min

Domains roll up into the four scores you see on every review: Refactor (tasks 1-3), Test-gen (4-5), Debug (6-7), and a combined “Agent & tool-use” bucket (11-12). Schema (8-9), Code review (10), Structured output (13), and RAG (14) get per-task numbers on the detail page but do not appear on the leaderboard summary.

Scoring rubric (0-10, per task)

10: complete, production-grade, no human intervention needed.
8-9: correct, one small style fix or one human touch.
6-7: mostly correct, one graded check fails, trivial to patch.
4-5: partial, significant rework needed, tests still failing.
2-3: plausible-looking output that does not satisfy the spec.
0-1: refuses, loops without progress, or produces unrelated output.

Runs, medians, not best-of-N

Every task runs 5 times. I publish the median. Not the best run, not the worst, not the average. Best-of-N flatters brittle tools; averages get wrecked by a single 0. The median tells you what you can expect on an ordinary Tuesday.

Temperatures, thinking-effort levels, and agent budgets are set to the vendor defaults unless the vendor offers a clearly documented “coding” preset, in which case that preset is used and the setting is recorded in the review. No custom system prompts beyond what the vendor ships in its coding agent.

Cadence

Weekly rerun. The full suite runs every Monday. Deltas greater than 0.3 points on any domain score trigger an updated review.
Within-a-week on model drops. When a vendor ships a model I already cover, I rerun the suite inside 7 days and publish a dated delta on the affected review.
Quarterly task review. I keep the 14 tasks fixed for at least 90 days. If a task becomes trivial for every model (score >= 9 for 3 weeks running), I swap it for a harder one and publish the retirement + replacement in the changelog.

What I do not do

Run public benchmarks (HumanEval, SWE-bench, and friends). Public evals are useful for researchers. They are noisy signal for engineers picking a tool this week.
Accept preview access or review embargoes. Every review is written against the shipped public version.
Take money for placement. No sponsorships, no vendor-funded posts. Affiliate links (used sparingly for tools I already score well) never move a score. The affiliate disclosure is on every affected post.
Chart scores on a single-number leaderboard. The Aider leaderboard is one useful signal; it is not the right lens for picking a refactoring tool. My leaderboard shows per-domain numbers, or it does not ship.

Corrections policy

If a score is wrong, send a link to corrections@thecodingcolosseum.com with the two lines of evidence that disprove it. I fix inside 7 days. Corrections keep the old number visible with a strikethrough, show the new number next to it, and carry a dated changelog entry on the affected post. I do not silently re-score.

The 14 tasks sit in a private fixture repo that I share with any reviewer who asks in good faith. The private repo is the single source of truth. If the tasks on this page drift from the fixture, the fixture wins.