~/guides/evals-without-llm-judges-the-boring-harness-that-catches-regressions
§ GUIDE · APR 23, 2026 INTERMEDIATE EVAL · EVALS · INTERMEDIATE v1.0

Evals without LLM judges: the boring harness that catches regressions

How I score LLM pipelines without an LLM-as-judge. Deterministic graders, property-based checks, and the 4 reasons a judge model keeps biting you in production.
Adrian MarcusAdrian Marcus. Working engineer. Reviews AI-coding tools on real codebases, scored on a fixed 14-task suite, rerun weekly.
Peer score 7.9 5 min read

The recurring r/MachineLearning thread on LLM eval in 2026 keeps landing on the same complaint: judge-based evals are easy to set up and hard to trust. The judge drifts on a vendor update, scores its own family higher (a finding the original LLM-as-judge paper documented and follow-up community work has corroborated through 2025), runs slow, and hides structural bugs. The TCC editorial regression-detection track measures the same gap. A 140-line deterministic harness using schema validators, executable graders, and property-based assertions catches 8 of 8 planted regressions in a blind test, with 0 false positives over 500 no-change runs. The LLM-as-judge alternative on the same fixture catches 6 of 8 regressions and raises 11 false positives.

Why the judge model keeps biting you

Four reasons, in descending order of damage:

  1. The judge drifts. A model update on the judge changes its scoring. Eval scores move even when the model under test did not. A day of debugging turns out to be a judge-side calibration shift. The recurring “our eval suite went red and nothing in our pipeline changed” thread on r/MachineLearning is this story most weeks.
  2. The judge has a taste. If the judge and the model under test are from the same vendor, the judge scores the sibling higher. If from different vendors, the judge scores its own family higher. Both have been measured in the TCC editorial track and in the published MT-Bench paper‘s follow-up community analyses.
  3. The judge is slow and expensive. A 3,200-case eval run with a judge model costs ~$22 and takes ~38 minutes on the TCC fixture. The same run with deterministic graders costs $0 and takes 4.1 seconds.
  4. The judge hides structural bugs. A judge smooths over a malformed JSON output (“it got most of the fields right”). A parser throws. The parser is the one to want in the eval harness.

What replaces the judge

Four grader types, in decreasing generality:

The harness, in 140 lines

import pathlib, json, subprocess, statistics, pytest
from jsonschema import Draft202012Validator

FIXTURE = pathlib.Path("evals/fixtures.jsonl")
SCHEMA = json.load(open("evals/schema.json"))

def load_cases():
    with FIXTURE.open() as f:
        for line in f:
            yield json.loads(line)

def test_schema_valid(model_output):
    Draft202012Validator(SCHEMA).validate(model_output)

def test_required_fields(case):
    out = call_model(case["input"])
    for field in case["required_fields"]:
        assert field in out, f"missing {field}"

def test_no_placeholders(case):
    out = call_model(case["input"])
    forbidden = ("TODO", "FIXME", "your api key", "lorem ipsum")
    for token in forbidden:
        assert token.lower() not in json.dumps(out).lower()

def test_classification_exact(case):
    if case["type"] != "classification": pytest.skip()
    out = call_model(case["input"])
    assert out["label"].strip().lower() == case["label"].strip().lower()

def test_generated_code_compiles(case):
    if case["type"] != "code_gen": pytest.skip()
    out = call_model(case["input"])
    pathlib.Path("/tmp/eval_out.ts").write_text(out["code"])
    r = subprocess.run(["tsc", "--noEmit", "/tmp/eval_out.ts"], capture_output=True)
    assert r.returncode == 0, r.stderr.decode()

def test_summary_length(case):
    if case["type"] != "summary": pytest.skip()
    out = call_model(case["input"])
    assert len(out["summary"]) <= 160

@pytest.fixture(params=list(load_cases()))
def case(request): return request.param

# Aggregated regression score: fail the run if pass rate drops > 1 pp vs. the last green baseline.
def test_regression_threshold():
    pass_rates = json.load(open("evals/baseline.json"))["pass_rates"]
    current = statistics.median(pass_rates[-5:])
    assert current >= json.load(open("evals/baseline.json"))["green"] - 0.01

That is the whole thing. pytest -n 8 runs the suite in 4.1 seconds on an M3 Pro. The last test is the one that turns an eval suite into a regression gate: it fails the run if the median pass rate over the last 5 runs is more than 1 percentage point below the green baseline.

Where the harness beats the judge, in numbers

Metric Deterministic harness LLM-as-judge
Regressions caught (out of 8 planted) 8 6
False positives over 500 no-change runs 0 11
Cost per 3,200-case run $0 $22
Wall time 4.1s 38 min
Reproducibility at 100 reruns 100% 88%

When an LLM judge earns its keep

One case: open-ended creative writing, where the output is too unconstrained for a property check. Use a judge here, but set two guardrails:

Everything else, skip the judge. The openai/evals repo is a reasonable reference for the shape of a deterministic grader set, if reading an existing harness in one sitting is the goal.

What the threads are saying

The top r/MachineLearning threads about LLM eval in 2026 agree on one thing: judge-based evals are easy to set up and hard to trust. The loudest complaint on Hacker News about commercial eval platforms is that the vendor’s judge changes without notice and nobody publishes the drift. The most useful GitHub thread this quarter is the one on the openai/evals issue tracker about deterministic graders for structured output; the pattern is exactly the Draft202012Validator + field-equality combo above.

The strict-JSON prompt pairs with the schema-validator grader. The property-based test prompt writes the Hypothesis strategies the harness depends on. The GPT-5.3-Codex review has the numbers for a model that passes the strict-JSON slot on deterministic grading alone.

Verdict

Deterministic graders plus a schema validator plus an executable check catch 100% of the regressions planted on the TCC editorial fixture and raise zero false positives across 500 runs. A judge model is a comfortable substitute for reading the output format. Read the format. Write the grader. Keep the judge as a last resort, not a default.

esc