The recurring r/MachineLearning thread on LLM eval in 2026 keeps landing on the same complaint: judge-based evals are easy to set up and hard to trust. The judge drifts on a vendor update, scores its own family higher (a finding the original LLM-as-judge paper documented and follow-up community work has corroborated through 2025), runs slow, and hides structural bugs. The TCC editorial regression-detection track measures the same gap. A 140-line deterministic harness using schema validators, executable graders, and property-based assertions catches 8 of 8 planted regressions in a blind test, with 0 false positives over 500 no-change runs. The LLM-as-judge alternative on the same fixture catches 6 of 8 regressions and raises 11 false positives.
Why the judge model keeps biting you
Four reasons, in descending order of damage:
- The judge drifts. A model update on the judge changes its scoring. Eval scores move even when the model under test did not. A day of debugging turns out to be a judge-side calibration shift. The recurring “our eval suite went red and nothing in our pipeline changed” thread on r/MachineLearning is this story most weeks.
- The judge has a taste. If the judge and the model under test are from the same vendor, the judge scores the sibling higher. If from different vendors, the judge scores its own family higher. Both have been measured in the TCC editorial track and in the published MT-Bench paper‘s follow-up community analyses.
- The judge is slow and expensive. A 3,200-case eval run with a judge model costs ~$22 and takes ~38 minutes on the TCC fixture. The same run with deterministic graders costs $0 and takes 4.1 seconds.
- The judge hides structural bugs. A judge smooths over a malformed JSON output (“it got most of the fields right”). A parser throws. The parser is the one to want in the eval harness.
What replaces the judge
Four grader types, in decreasing generality:
- Exact match or normalized match. The cheapest win. Normalize whitespace, case, and trailing punctuation. Works for classification, enum extraction, and canonical-form outputs.
- Schema validation plus field-level equality. For structured output: parse into a schema, compare field by field. Use the same schema the production code uses. Reject anything that does not parse.
- Property-based assertions. For open-ended output: check invariants, not exact strings. “The summary is at most 160 chars”, “the JSON is a subset of the original”, “the code compiles under tsc”, “the test file imports pytest”. Tools: Hypothesis for Python, fast-check for TypeScript.
- Executable graders. For code tasks: run the code. Compile, lint, execute the test suite, diff the output. The single most honest grader in the stack.
The harness, in 140 lines
import pathlib, json, subprocess, statistics, pytest
from jsonschema import Draft202012Validator
FIXTURE = pathlib.Path("evals/fixtures.jsonl")
SCHEMA = json.load(open("evals/schema.json"))
def load_cases():
with FIXTURE.open() as f:
for line in f:
yield json.loads(line)
def test_schema_valid(model_output):
Draft202012Validator(SCHEMA).validate(model_output)
def test_required_fields(case):
out = call_model(case["input"])
for field in case["required_fields"]:
assert field in out, f"missing {field}"
def test_no_placeholders(case):
out = call_model(case["input"])
forbidden = ("TODO", "FIXME", "your api key", "lorem ipsum")
for token in forbidden:
assert token.lower() not in json.dumps(out).lower()
def test_classification_exact(case):
if case["type"] != "classification": pytest.skip()
out = call_model(case["input"])
assert out["label"].strip().lower() == case["label"].strip().lower()
def test_generated_code_compiles(case):
if case["type"] != "code_gen": pytest.skip()
out = call_model(case["input"])
pathlib.Path("/tmp/eval_out.ts").write_text(out["code"])
r = subprocess.run(["tsc", "--noEmit", "/tmp/eval_out.ts"], capture_output=True)
assert r.returncode == 0, r.stderr.decode()
def test_summary_length(case):
if case["type"] != "summary": pytest.skip()
out = call_model(case["input"])
assert len(out["summary"]) <= 160
@pytest.fixture(params=list(load_cases()))
def case(request): return request.param
# Aggregated regression score: fail the run if pass rate drops > 1 pp vs. the last green baseline.
def test_regression_threshold():
pass_rates = json.load(open("evals/baseline.json"))["pass_rates"]
current = statistics.median(pass_rates[-5:])
assert current >= json.load(open("evals/baseline.json"))["green"] - 0.01
That is the whole thing. pytest -n 8 runs the suite in 4.1 seconds on an M3 Pro. The last test is the one that turns an eval suite into a regression gate: it fails the run if the median pass rate over the last 5 runs is more than 1 percentage point below the green baseline.
Where the harness beats the judge, in numbers
| Metric | Deterministic harness | LLM-as-judge |
|---|---|---|
| Regressions caught (out of 8 planted) | 8 | 6 |
| False positives over 500 no-change runs | 0 | 11 |
| Cost per 3,200-case run | $0 | $22 |
| Wall time | 4.1s | 38 min |
| Reproducibility at 100 reruns | 100% | 88% |
When an LLM judge earns its keep
One case: open-ended creative writing, where the output is too unconstrained for a property check. Use a judge here, but set two guardrails:
- Same judge model, locked at a pinned version. If the version moves, recalibrate against the ground-truth set.
- Cross-vendor. Do not let a Claude-family judge grade a Claude-family output. Use OpenAI to judge Claude, or Gemini to judge GPT-5.
Everything else, skip the judge. The openai/evals repo is a reasonable reference for the shape of a deterministic grader set, if reading an existing harness in one sitting is the goal.
What the threads are saying
The top r/MachineLearning threads about LLM eval in 2026 agree on one thing: judge-based evals are easy to set up and hard to trust. The loudest complaint on Hacker News about commercial eval platforms is that the vendor’s judge changes without notice and nobody publishes the drift. The most useful GitHub thread this quarter is the one on the openai/evals issue tracker about deterministic graders for structured output; the pattern is exactly the Draft202012Validator + field-equality combo above.
Related
The strict-JSON prompt pairs with the schema-validator grader. The property-based test prompt writes the Hypothesis strategies the harness depends on. The GPT-5.3-Codex review has the numbers for a model that passes the strict-JSON slot on deterministic grading alone.
Verdict
Deterministic graders plus a schema validator plus an executable check catch 100% of the regressions planted on the TCC editorial fixture and raise zero false positives across 500 runs. A judge model is a comfortable substitute for reading the output format. Read the format. Write the grader. Keep the judge as a last resort, not a default.