Long-context evals keep diverging from reality: the 1M-token number nobody earns

Every vendor shipping a 1M+ context window in 2026 publishes a near-perfect “needle in a haystack” number. The recurring r/LocalLLaMA and Hacker News threads on long-context evaluation have spent the last 18 months pointing out the same gap: needle scores describe a structural lookup, production RAG describes paraphrased queries against a corpus with distractors. The TCC editorial RAG fixture (1,400 chunks, 12 queries with ground-truth answers, English + German) measures the divergence directly. Vendor needle numbers above 99%; the same models score 71-75% top-1 on the raw long-context path against the same task. The gap is not shrinking. It is growing.

The divergence, in numbers

Model	Vendor-claimed needle recall @ 1M	TCC RAG top-1 (raw long-context)	Gap
Claude Opus 4.7 (1M)	>99%	73%	26 pp
Gemini 3.1 Pro (1M)	>99%	75%	24 pp
GPT-5.3-Codex (400k)	>98%	71%	27 pp
GPT-5.5 (1M)	>99%	76%	23 pp

Note: the TCC top-1 here is the raw long-context answer without a reranker. With the reranker stack from the RAG defaults guide, top-1 moves into the 88-91% range for all three. That is the actual signal. Vendor needle numbers describe a benchmark that does not match production retrieval; the reranker stack is what closes the gap.

Three reasons the benchmarks lie

Needles are structural, not semantic. The standard test inserts an unambiguous sentence (“The secret key is BANANA47”) into a long irrelevant passage. Production retrieval is never this clean. Real queries are paraphrases, real corpora have boilerplate, the “correct” chunks sometimes half-overlap with distractor chunks that also mention the query entity.
Position bias is real, and the benchmarks front-load the good cases. Every frontier model is better at recalling content from the first 20% and the last 20% of the context. The middle 60% drops 10-15 points. The public benchmarks average across positions; the RULER benchmark explicitly tests this and shows numbers consistently below the vendor needle scores.
“Single needle” versus “multiple hops”. The TCC fixture has several queries where the correct answer requires reading two chunks 40k tokens apart. The public single-needle numbers do not test this. The openai/evals repository has a multi-hop variant that tracks real-world RAG better; vendors rarely publish numbers on it.

What to trust instead

Own corpus, own ground truth. One-time cost to label 50-200 queries against a representative corpus. The score from that labelled set is the only number that tracks production outcome.
Position-spread tests. Ask the same question with the correct chunk at 10%, 30%, 50%, 70%, 90%. If the model answers at 10% and 90% but not 50%, it is a position-bias story, not a context story.
Multi-hop queries on real data. If the production use case needs cross-document reasoning, benchmark cross-document reasoning. Not needles. The RULER benchmark’\”s multi-hop tests are the closest public analogue.

The implication

The 1M-context and “bigger is coming” marketing numbers are not wrong in the narrow sense. They describe a best-case scenario that does not match any production RAG pipeline shipped since 2023. When a vendor claims a RAG problem is solved by raising the context window, it is not. The retrieval stack (chunking, rerank, hybrid) still does most of the work. See the RAG defaults cheatsheet for the 91% recipe.

What the threads are saying

Several Hacker News threads through Q1 2026 converged on this divergence; the top comment on the most-read one is “needle-in-a-haystack is the MNIST of long-context evals”. On r/LocalLLaMA, the RULER benchmark is cited as a better long-context metric; it tests harder retrieval patterns and shows smaller numbers than the vendor decks. The Anthropic docs for Claude Opus 4.7 do publish RULER-style numbers alongside needle numbers; Google’\”s Gemini docs publish fewer.

The full RAG-accuracy stack and the settings that move the TCC top-1 from 74% to 91% are on the RAG defaults guide. The Gemini 3.1 Pro review has the 1M-native RAG number, which is the highest measured on the TCC task: Gemini 3.1 Pro review.

One-line takeaway

Do not plan a RAG pipeline against a vendor needle-in-a-haystack number. Label 100 of the team’\”s own queries, run them against three models, and pick the one that wins on the team’\”s data at the position that matters.