~/editorial-process-and-14-task-methodology
§ PAGEupdated APR 23, 2026

Editorial process and 14-task methodology

How I score AI coding tools: a fixed 14-task suite, 5 runs per task, median only, rerun weekly. Every fixture, rubric, and changelog entry is public.

Every review and guide on this site is built on the same 14 tasks, run the same way, scored on the same rubric. This page documents the editorial process end to end: what the tasks are, how I score them, how often I rerun them, and how I handle corrections. I rewrite this page whenever the process changes, and I date every revision at the bottom.

The 14-task suite

The suite covers the kinds of work a working engineer does in a week, not the kinds of work that look good in a demo. Every task has a deterministic fixture, a graded expected output, and a time budget. Tasks are versioned in a private repo; fixtures are public on request for reviewers who want to reproduce a score.

# Task Domain Budget
1 Cross-file rename across a 63k-line TypeScript monorepo (14 call sites, including 4 re-exported types) Refactor 15 min
2 Extract a pure function from a 900-line React component while preserving prop semantics Refactor 10 min
3 Remove a legacy Redux slice and migrate 27 call sites to Zustand Refactor 25 min
4 Write property-based tests (Hypothesis or fast-check) for a JSON-diff library with 6 invariants Test-gen 20 min
5 Generate integration tests for a FastAPI service with 3 external deps (mock the deps, cover 8 paths) Test-gen 20 min
6 Debug a production NullReferenceException with a 42-frame stack trace; identify the root cause and write the fix Debug 15 min
7 Reproduce and fix a race condition between two async handlers sharing a cache Debug 25 min
8 Design a normalized Postgres schema from 220 lines of prose product requirements; include indexes Schema 20 min
9 Write a migration script that backfills a new column on a 40M-row table without long locks Schema 15 min
10 Review a 600-line PR with 11 files, flag 3 real issues and 1 false-positive trap I planted Code review 15 min
11 Plan and execute a 5-step agent loop with a bounded step budget; give-up exit must fire on step 4 Agent 20 min
12 Call 4 tools in sequence (search, read, edit, verify) and recover from one tool error mid-flight Agent 15 min
13 Emit strict JSON matching a 40-property schema for 100 adversarial inputs without a single parse error Structured output 10 min
14 Answer 12 retrieval queries over a 1,400-chunk RAG corpus with an unambiguous correct chunk per query RAG 15 min

Domains roll up into the four scores you see on every review: Refactor (tasks 1-3), Test-gen (4-5), Debug (6-7), and a combined “Agent & tool-use” bucket (11-12). Schema (8-9), Code review (10), Structured output (13), and RAG (14) get per-task numbers on the detail page but do not appear on the leaderboard summary.

Scoring rubric (0-10, per task)

Runs, medians, not best-of-N

Every task runs 5 times. I publish the median. Not the best run, not the worst, not the average. Best-of-N flatters brittle tools; averages get wrecked by a single 0. The median tells you what you can expect on an ordinary Tuesday.

Temperatures, thinking-effort levels, and agent budgets are set to the vendor defaults unless the vendor offers a clearly documented “coding” preset, in which case that preset is used and the setting is recorded in the review. No custom system prompts beyond what the vendor ships in its coding agent.

Cadence

What I do not do

Corrections policy

If a score is wrong, send a link to corrections@thecodingcolosseum.com with the two lines of evidence that disprove it. I fix inside 7 days. Corrections keep the old number visible with a strikethrough, show the new number next to it, and carry a dated changelog entry on the affected post. I do not silently re-score.

The 14 tasks sit in a private fixture repo that I share with any reviewer who asks in good faith. The private repo is the single source of truth. If the tasks on this page drift from the fixture, the fixture wins.

esc