APR 23
10 min
eval
eval
Evals without LLM judges: a harness that catches regressions
How I score LLM pipelines without an LLM-as-judge. Deterministic graders, property-based checks, and the 4 reasons a judge model keeps biting you in production.
7.9
peer score
read →10 min
APR 23
10 min
chunking
chunking
RAG defaults 2026: chunks, rerankers, 3 settings that matter
The chunk size, overlap, rerank, and top-k values that moved my retrieval accuracy from 74% to 91%. Tested on a 1,400-chunk corpus with a ground-truth answer set.
8.1
peer score
read →10 min
APR 23
13 min
intermediate
intermediate
Structured outputs, three years in: the one pattern that survived
Three years of shipping LLM structured outputs in production. The one pattern that survived, the three that did not, and the strict-JSON failure rate I run at today.
9.1
peer score
read →13 min