APR 23
11 min
analysis
analysis
Long-context evals diverge from reality: the 1M-token gap
Vendor 1M-context numbers keep outperforming my production RAG task by 30+ points. The three reasons the benchmarks lie, and what I trust instead.
—
unrated
read →11 min
APR 23
12 min
context
context
Gemini 3.1 Pro review: cheapest frontier token, 4 places it lags
Gemini 3.1 Pro scored 7.8 on refactoring and 7.9 on structured output at $0.21 per task. The domains where cheap wins and where you need to route traffic elsewhere.
—
unrated
read →12 min