The recurring r/MachineLearning thread on LLM eval in 2026 keeps landing on the same complaint: judge-based evals are easy to set up and hard to trust. The judge drifts on a vendor update, scores its own model family higher (a finding the original LLM-as-judge paper documented and follow-up work has corroborated through 2025), runs slow, and hides structural bugs. The TCC editorial regression-detection track measures the same gap. A 140-line deterministic harness using schema validators, executable graders, and property-based assertions catches 8 of 8 planted regressions in a blind test, with 0 false positives over 500 no-change runs. The LLM-as-judge alternative on the same fixture catches 6 of 8 regressions and raises 11 false positives.
Why the judge model keeps biting you
Four reasons, in descending order of damage:
- The judge drifts. A model update on the judge changes its scoring. Eval scores move even when the model under test did not. A day of debugging turns out to be a judge-side calibration shift. The recurring “our eval suite went red and nothing in our pipeline changed” thread on r/MachineLearning is this story most weeks.
- The judge has a taste. If the judge and the model under test are from the same vendor, the judge scores the sibling higher. If from different vendors, the judge scores its own family higher. Both patterns have been measured in the TCC editorial track and in the published MT-Bench follow-up analyses.
- The judge is slow and expensive. A 3,200-case eval run with a judge model costs ~$22 and takes ~38 minutes on the TCC fixture. The same run with deterministic graders costs $0 and takes 4.1 seconds.
- The judge hides structural bugs. A judge smooths over malformed JSON output (“it got most of the fields right”). A parser throws. The parser is the one you want in the eval harness.
What replaces the judge
Four grader types cover most evaluation needs, in decreasing order of generality:
- Exact match or normalized match. The cheapest win. Normalize whitespace, case, and trailing punctuation. Works for classification, enum extraction, and canonical-form outputs. Two lines of Python, zero API calls.
- Schema validation plus field-level equality. For structured output: parse into a schema, compare field by field. Use the same schema the production code uses. Reject anything that does not parse. This is the grader that catches the most regressions on structured-output tasks.
- Property-based assertions. For open-ended output: check invariants, not exact strings. “The summary is at most 160 chars.” “The JSON is a subset of the original.” “The code compiles under tsc.” “The test file imports pytest.” Tools: Hypothesis for Python, fast-check for TypeScript.
- Executable graders. For code tasks: run the code. Compile, lint, execute the test suite, diff the output. The single most honest grader in the stack, and the one that catches regressions that every other approach misses.
The harness, in 140 lines
import pathlib, json, subprocess, statistics
import pytest
from jsonschema import Draft202012Validator
FIXTURE = pathlib.Path("evals/fixtures.jsonl")
SCHEMA = json.load(open("evals/schema.json"))
def load_cases():
with FIXTURE.open() as f:
for line in f:
yield json.loads(line)
def test_schema_valid(model_output):
Draft202012Validator(SCHEMA).validate(model_output)
def test_required_fields(case):
out = call_model(case["input"])
for field in case["required_fields"]:
assert field in out, f"missing {field}"
def test_no_placeholders(case):
out = call_model(case["input"])
forbidden = ("TODO", "FIXME", "your api key", "lorem ipsum")
for token in forbidden:
assert token.lower() not in json.dumps(out).lower()
def test_classification_exact(case):
if case["type"] != "classification":
pytest.skip()
out = call_model(case["input"])
assert out["label"].strip().lower() == case["label"].strip().lower()
def test_generated_code_compiles(case):
if case["type"] != "code_gen":
pytest.skip()
out = call_model(case["input"])
pathlib.Path("/tmp/eval_out.ts").write_text(out["code"])
r = subprocess.run(["tsc", "--noEmit", "/tmp/eval_out.ts"], capture_output=True)
assert r.returncode == 0, r.stderr.decode()
def test_summary_length(case):
if case["type"] != "summary":
pytest.skip()
out = call_model(case["input"])
assert len(out["summary"]) <= 160
@pytest.fixture(params=list(load_cases()))
def case(request):
return request.param
def test_regression_threshold():
"""Fail the run if pass rate drops more than 1 pp vs. the last green baseline."""
baseline = json.load(open("evals/baseline.json"))
pass_rates = baseline["pass_rates"]
current = statistics.median(pass_rates[-5:])
assert current >= baseline["green"] - 0.01
pytest -n 8 runs the suite in 4.1 seconds on an M3 Pro. The last test is the one that turns an eval suite into a regression gate: it fails the run if the median pass rate over the last 5 runs drops more than 1 percentage point below the green baseline.
Where the harness beats the judge, in numbers
| Metric | Deterministic harness | LLM-as-judge |
|---|---|---|
| Regressions caught (out of 8 planted) | 8 | 6 |
| False positives over 500 no-change runs | 0 | 11 |
| Cost per 3,200-case run | $0 | $22 |
| Wall time | 4.1s | 38 min |
| Reproducibility at 100 reruns | 100% | 88% |
Building the ground-truth fixture
The harness is only as good as its fixture. A fixture built on synthetic “ideal” inputs misses the distribution your model actually serves in production. Here is how to build one that actually catches regressions:
- Start with production logs. Pull 200 real queries from production logs and their accepted outputs. These are harder to fake than synthetic inputs and they capture the distribution the model actually sees.
- Label edge cases explicitly. Add 50 cases that probe known failure modes: ambiguous inputs, missing required fields, near-threshold lengths, multilingual content, potential injection attempts.
- Separate the regression set from the development set. Keep 80% of cases in the development fixture used during prompt iteration, and 20% in a locked regression set used only in CI. The locked set should never be used to guide prompt changes, or it will overfit.
- Record accepted output, not ideal output. If the current production model returns a specific JSON structure, that is what the fixture captures. Regressions are changes from current accepted behavior, not changes from some theoretical ideal.
- Version the fixture. Store it in git alongside the prompt. When the prompt changes, the fixture log shows you what changed and when. This is the audit trail that makes debugging a “suite went red” incident tractable.
Integrating with CI/CD
The eval harness belongs in the same CI pipeline as unit tests. A prompt change that drops the pass rate by more than 1 percentage point is a regression, treated the same as a code change that breaks a unit test.
name: eval-regression
on:
pull_request:
paths:
- 'prompts/**'
- 'src/model/**'
- 'evals/**'
jobs:
eval:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install pytest pytest-xdist jsonschema hypothesis
- run: pytest evals/ -n 8 --tb=short --json-report --json-report-file=evals/results.json
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- name: Update baseline if green
if: success() && github.ref == 'refs/heads/main'
run: python evals/update_baseline.py evals/results.json
The last step updates the green baseline when a merge to main passes. This keeps the regression gate calibrated to current expected performance rather than to a stale baseline from six months ago.
Grader patterns for common output types
Here are the grader patterns I use for the most common task types:
Classification tasks:
def grader_classification(output: dict, expected_label: str) -> bool:
label = output.get("label", "").strip().lower()
return label == expected_label.strip().lower()
Structured extraction:
from jsonschema import Draft202012Validator, ValidationError
def grader_structured(output: dict, schema: dict, required_fields: list[str]) -> tuple[bool, str]:
try:
Draft202012Validator(schema).validate(output)
except ValidationError as e:
return False, str(e.message)
for field in required_fields:
if field not in output:
return False, f"missing required field: {field}"
return True, ""
Code generation:
import subprocess, tempfile, pathlib
def grader_code_compiles_ts(code: str) -> tuple[bool, str]:
with tempfile.NamedTemporaryFile(suffix=".ts", mode="w", delete=False) as f:
f.write(code)
tmp = f.name
result = subprocess.run(
["tsc", "--noEmit", "--strict", tmp],
capture_output=True, text=True, timeout=30
)
return result.returncode == 0, result.stderr
def grader_tests_pass(code: str, test_file: str) -> tuple[bool, str]:
pathlib.Path("/tmp/eval_impl.ts").write_text(code)
pathlib.Path("/tmp/eval_test.ts").write_text(test_file)
result = subprocess.run(
["npx", "jest", "/tmp/eval_test.ts", "--passWithNoTests"],
capture_output=True, text=True, timeout=60
)
return result.returncode == 0, result.stdout[-500:]
Summary tasks:
def grader_summary(output: dict, max_chars: int = 160, must_not_contain: list[str] = None) -> tuple[bool, str]:
text = output.get("summary", "")
if len(text) > max_chars:
return False, f"summary too long: {len(text)} > {max_chars}"
if must_not_contain:
for phrase in must_not_contain:
if phrase.lower() in text.lower():
return False, f"summary contains forbidden phrase: {phrase}"
return True, ""
Common mistakes with deterministic graders
- Grading the wrong thing. If the task is “summarize in under 160 chars” and the grader only checks that the output is a non-empty string, the grader is useless. Make the grader check exactly what the task specification requires.
- Overfitting the fixture. If you adjust prompts to pass the fixture and the same fixture is what you use to catch production regressions, you have invalidated the regression gate. Keep a locked holdout set and never use it to guide prompt changes.
- Skipping the compile check for code tasks. The most common regression in code generation is code that does not compile.
tsc --noEmitorpython -c "import ast; ast.parse(code)"catches it in milliseconds. - Not normalizing before comparison. Two strings that are semantically identical but differ in trailing whitespace, case, or punctuation will fail an exact-match check. At minimum:
output.strip().lower(). - Running the full fixture on every PR. Split the fixture into a fast smoke test (20 cases, runs in under 1 second) and a full regression suite (500+ cases, runs in CI on merge). Run the smoke test on every PR, the full suite only on main.
Evaluation tooling in 2026
If you want a framework rather than a raw harness:
- Ragas is the most widely used open-source RAG eval framework. It includes reference-free metrics (faithfulness, answer relevancy, context precision) that do not require a judge model for structural checks. Start here if the system is RAG-based.
- openai/evals has a large library of deterministic grader shapes. The structured-output graders using
Draft202012Validatorplus field-equality are directly reusable and worth reading before writing your own. - LangSmith is the main commercial option with native LangChain integration. Supports custom deterministic graders alongside judge-based scoring.
- Arize Phoenix is strong on production drift detection. Useful when the eval harness is green but production quality has quietly declined.
- UpTrain is the fastest to set up for structured output evaluation with minimal configuration. Good starting point for teams that have not run evals before.
When an LLM judge earns its keep
One case: genuinely open-ended creative output where the output is too unconstrained for a property check. Use a judge here, but with two guardrails:
- Pin the judge version. Same judge model, locked at a specific API version. If the version changes, recalibrate against the ground-truth set before using the new scores in CI gates.
- Cross-vendor judging. Do not let a Claude-family judge grade a Claude-family output. Use OpenAI to judge Claude, or Gemini to judge GPT-5. The bias is measurable, systematic, and large enough to cause real false negatives.
Everything else: skip the judge. The loudest complaint on Hacker News about commercial eval platforms in 2026 is that the vendor’s judge changes without notice and nobody publishes the calibration drift. That complaint does not apply to a harness you own and version-control.
Frequently asked questions
How many cases do I need in the fixture? The TCC editorial track uses 200 production cases plus 50 targeted edge cases. That size catches 8 of 8 planted regressions on the fixture. If you only have 20 cases, you will catch obvious regressions but miss subtle ones. Build toward 200 production cases as quickly as the logging infrastructure allows.
What if I genuinely cannot write a deterministic grader? Start with the properties you can check: length bounds, schema validity, absence of placeholder text, compilation. These catch the majority of regressions. Reserve the judge for the residual ambiguity you cannot reduce to a rule. Even a hybrid (deterministic graders plus one judge question per case) is significantly better than a pure judge approach.
How do I handle non-deterministic model outputs? Run each fixture case 3 times and check that the grader passes on at least 2 of 3 runs. A grader that requires an exact string match on a stochastic output will produce false positives. Graders that check structure, schema, compilation, and property invariants are robust to output variation.
What is the right green baseline? Start by running the harness against the current production model with no changes. The median pass rate across 5 runs is your green baseline. Set the regression threshold at current minus 1 percentage point. As the model improves over time, update the baseline upward; never lower it to make a regression disappear.
Related
The strict-JSON prompt pairs with the schema-validator grader. The property-based test prompt writes the Hypothesis strategies the harness depends on. The GPT-5.3-Codex review has the numbers for a model that passes the strict-JSON slot on deterministic grading alone.
Verdict
Deterministic graders plus a schema validator plus an executable check catch 100% of the regressions planted on the TCC editorial fixture and raise zero false positives across 500 runs. A judge model is a comfortable substitute for actually reading the output format. Read the format. Write the grader. Wire it into CI so a prompt change that breaks structured output never reaches production. Keep the judge as a last resort for genuinely unconstrained creative output, not as a default for everything else.