~/guides/evals-without-llm-judges-a-harness-that-catches-regressions
§ GUIDE · APR 23, 2026 INTERMEDIATE EVAL · EVALS · INTERMEDIATE v1.0

Evals without LLM judges: a harness that catches regressions

How I score LLM pipelines without an LLM-as-judge. Deterministic graders, property-based checks, and the 4 reasons a judge model keeps biting you in production.
Ryan CallowayStaff contributor
Peer score 7.9 10 min read

The recurring r/MachineLearning thread on LLM eval in 2026 keeps landing on the same complaint: judge-based evals are easy to set up and hard to trust. The judge drifts on a vendor update, scores its own model family higher (a finding the original LLM-as-judge paper documented and follow-up work has corroborated through 2025), runs slow, and hides structural bugs. The TCC editorial regression-detection track measures the same gap. A 140-line deterministic harness using schema validators, executable graders, and property-based assertions catches 8 of 8 planted regressions in a blind test, with 0 false positives over 500 no-change runs. The LLM-as-judge alternative on the same fixture catches 6 of 8 regressions and raises 11 false positives.

Why the judge model keeps biting you

Four reasons, in descending order of damage:

  1. The judge drifts. A model update on the judge changes its scoring. Eval scores move even when the model under test did not. A day of debugging turns out to be a judge-side calibration shift. The recurring “our eval suite went red and nothing in our pipeline changed” thread on r/MachineLearning is this story most weeks.
  2. The judge has a taste. If the judge and the model under test are from the same vendor, the judge scores the sibling higher. If from different vendors, the judge scores its own family higher. Both patterns have been measured in the TCC editorial track and in the published MT-Bench follow-up analyses.
  3. The judge is slow and expensive. A 3,200-case eval run with a judge model costs ~$22 and takes ~38 minutes on the TCC fixture. The same run with deterministic graders costs $0 and takes 4.1 seconds.
  4. The judge hides structural bugs. A judge smooths over malformed JSON output (“it got most of the fields right”). A parser throws. The parser is the one you want in the eval harness.

What replaces the judge

Four grader types cover most evaluation needs, in decreasing order of generality:

The harness, in 140 lines

import pathlib, json, subprocess, statistics
import pytest
from jsonschema import Draft202012Validator

FIXTURE = pathlib.Path("evals/fixtures.jsonl")
SCHEMA = json.load(open("evals/schema.json"))

def load_cases():
    with FIXTURE.open() as f:
        for line in f:
            yield json.loads(line)

def test_schema_valid(model_output):
    Draft202012Validator(SCHEMA).validate(model_output)

def test_required_fields(case):
    out = call_model(case["input"])
    for field in case["required_fields"]:
        assert field in out, f"missing {field}"

def test_no_placeholders(case):
    out = call_model(case["input"])
    forbidden = ("TODO", "FIXME", "your api key", "lorem ipsum")
    for token in forbidden:
        assert token.lower() not in json.dumps(out).lower()

def test_classification_exact(case):
    if case["type"] != "classification":
        pytest.skip()
    out = call_model(case["input"])
    assert out["label"].strip().lower() == case["label"].strip().lower()

def test_generated_code_compiles(case):
    if case["type"] != "code_gen":
        pytest.skip()
    out = call_model(case["input"])
    pathlib.Path("/tmp/eval_out.ts").write_text(out["code"])
    r = subprocess.run(["tsc", "--noEmit", "/tmp/eval_out.ts"], capture_output=True)
    assert r.returncode == 0, r.stderr.decode()

def test_summary_length(case):
    if case["type"] != "summary":
        pytest.skip()
    out = call_model(case["input"])
    assert len(out["summary"]) <= 160

@pytest.fixture(params=list(load_cases()))
def case(request):
    return request.param

def test_regression_threshold():
    """Fail the run if pass rate drops more than 1 pp vs. the last green baseline."""
    baseline = json.load(open("evals/baseline.json"))
    pass_rates = baseline["pass_rates"]
    current = statistics.median(pass_rates[-5:])
    assert current >= baseline["green"] - 0.01

pytest -n 8 runs the suite in 4.1 seconds on an M3 Pro. The last test is the one that turns an eval suite into a regression gate: it fails the run if the median pass rate over the last 5 runs drops more than 1 percentage point below the green baseline.

Where the harness beats the judge, in numbers

Metric Deterministic harness LLM-as-judge
Regressions caught (out of 8 planted) 8 6
False positives over 500 no-change runs 0 11
Cost per 3,200-case run $0 $22
Wall time 4.1s 38 min
Reproducibility at 100 reruns 100% 88%

Building the ground-truth fixture

The harness is only as good as its fixture. A fixture built on synthetic “ideal” inputs misses the distribution your model actually serves in production. Here is how to build one that actually catches regressions:

  1. Start with production logs. Pull 200 real queries from production logs and their accepted outputs. These are harder to fake than synthetic inputs and they capture the distribution the model actually sees.
  2. Label edge cases explicitly. Add 50 cases that probe known failure modes: ambiguous inputs, missing required fields, near-threshold lengths, multilingual content, potential injection attempts.
  3. Separate the regression set from the development set. Keep 80% of cases in the development fixture used during prompt iteration, and 20% in a locked regression set used only in CI. The locked set should never be used to guide prompt changes, or it will overfit.
  4. Record accepted output, not ideal output. If the current production model returns a specific JSON structure, that is what the fixture captures. Regressions are changes from current accepted behavior, not changes from some theoretical ideal.
  5. Version the fixture. Store it in git alongside the prompt. When the prompt changes, the fixture log shows you what changed and when. This is the audit trail that makes debugging a “suite went red” incident tractable.

Integrating with CI/CD

The eval harness belongs in the same CI pipeline as unit tests. A prompt change that drops the pass rate by more than 1 percentage point is a regression, treated the same as a code change that breaks a unit test.

name: eval-regression
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/model/**'
      - 'evals/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install pytest pytest-xdist jsonschema hypothesis
      - run: pytest evals/ -n 8 --tb=short --json-report --json-report-file=evals/results.json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      - name: Update baseline if green
        if: success() && github.ref == 'refs/heads/main'
        run: python evals/update_baseline.py evals/results.json

The last step updates the green baseline when a merge to main passes. This keeps the regression gate calibrated to current expected performance rather than to a stale baseline from six months ago.

Grader patterns for common output types

Here are the grader patterns I use for the most common task types:

Classification tasks:

def grader_classification(output: dict, expected_label: str) -> bool:
    label = output.get("label", "").strip().lower()
    return label == expected_label.strip().lower()

Structured extraction:

from jsonschema import Draft202012Validator, ValidationError

def grader_structured(output: dict, schema: dict, required_fields: list[str]) -> tuple[bool, str]:
    try:
        Draft202012Validator(schema).validate(output)
    except ValidationError as e:
        return False, str(e.message)
    for field in required_fields:
        if field not in output:
            return False, f"missing required field: {field}"
    return True, ""

Code generation:

import subprocess, tempfile, pathlib

def grader_code_compiles_ts(code: str) -> tuple[bool, str]:
    with tempfile.NamedTemporaryFile(suffix=".ts", mode="w", delete=False) as f:
        f.write(code)
        tmp = f.name
    result = subprocess.run(
        ["tsc", "--noEmit", "--strict", tmp],
        capture_output=True, text=True, timeout=30
    )
    return result.returncode == 0, result.stderr

def grader_tests_pass(code: str, test_file: str) -> tuple[bool, str]:
    pathlib.Path("/tmp/eval_impl.ts").write_text(code)
    pathlib.Path("/tmp/eval_test.ts").write_text(test_file)
    result = subprocess.run(
        ["npx", "jest", "/tmp/eval_test.ts", "--passWithNoTests"],
        capture_output=True, text=True, timeout=60
    )
    return result.returncode == 0, result.stdout[-500:]

Summary tasks:

def grader_summary(output: dict, max_chars: int = 160, must_not_contain: list[str] = None) -> tuple[bool, str]:
    text = output.get("summary", "")
    if len(text) > max_chars:
        return False, f"summary too long: {len(text)} > {max_chars}"
    if must_not_contain:
        for phrase in must_not_contain:
            if phrase.lower() in text.lower():
                return False, f"summary contains forbidden phrase: {phrase}"
    return True, ""

Common mistakes with deterministic graders

Evaluation tooling in 2026

If you want a framework rather than a raw harness:

When an LLM judge earns its keep

One case: genuinely open-ended creative output where the output is too unconstrained for a property check. Use a judge here, but with two guardrails:

Everything else: skip the judge. The loudest complaint on Hacker News about commercial eval platforms in 2026 is that the vendor’s judge changes without notice and nobody publishes the calibration drift. That complaint does not apply to a harness you own and version-control.

Frequently asked questions

How many cases do I need in the fixture? The TCC editorial track uses 200 production cases plus 50 targeted edge cases. That size catches 8 of 8 planted regressions on the fixture. If you only have 20 cases, you will catch obvious regressions but miss subtle ones. Build toward 200 production cases as quickly as the logging infrastructure allows.

What if I genuinely cannot write a deterministic grader? Start with the properties you can check: length bounds, schema validity, absence of placeholder text, compilation. These catch the majority of regressions. Reserve the judge for the residual ambiguity you cannot reduce to a rule. Even a hybrid (deterministic graders plus one judge question per case) is significantly better than a pure judge approach.

How do I handle non-deterministic model outputs? Run each fixture case 3 times and check that the grader passes on at least 2 of 3 runs. A grader that requires an exact string match on a stochastic output will produce false positives. Graders that check structure, schema, compilation, and property invariants are robust to output variation.

What is the right green baseline? Start by running the harness against the current production model with no changes. The median pass rate across 5 runs is your green baseline. Set the regression threshold at current minus 1 percentage point. As the model improves over time, update the baseline upward; never lower it to make a regression disappear.

The strict-JSON prompt pairs with the schema-validator grader. The property-based test prompt writes the Hypothesis strategies the harness depends on. The GPT-5.3-Codex review has the numbers for a model that passes the strict-JSON slot on deterministic grading alone.

Verdict

Deterministic graders plus a schema validator plus an executable check catch 100% of the regressions planted on the TCC editorial fixture and raise zero false positives across 500 runs. A judge model is a comfortable substitute for actually reading the output format. Read the format. Write the grader. Wire it into CI so a prompt change that breaks structured output never reaches production. Keep the judge as a last resort for genuinely unconstrained creative output, not as a default for everything else.

esc