How I score LLM pipelines without an LLM-as-judge. Deterministic graders, property-based checks, and the 4 reasons a judge model keeps biting you in production.
The chunk size, overlap, rerank, and top-k values that moved my retrieval accuracy from 74% to 91%. Tested on a 1,400-chunk corpus with a ground-truth answer set.
Autonomous coding agents still fail 1 in 9 production runs on my suite. The three failure modes that cause it, and where a bounded planner is the honest answer.
Three years of shipping LLM structured outputs in production. The one pattern that survived, the three that did not, and the strict-JSON failure rate I run at today.
The retry policy that cut my agent-loop 429 rate from 6% to 0.2% across 4 vendors. Jitter, step-budget interlock, and the one thing you should never retry.