The recurring r/MachineLearning and Hacker News threads on production RAG in 2026 converge on the same advice: bump chunk size to 768, add a reranker, and stop benchmarking embedding models. The TCC editorial RAG fixture (12 retrieval queries over a 1,400-chunk corpus, mixed Markdown + PDF, English + German, Pinecone index, Cohere Rerank v4.0, OpenAI text-embedding-3-large by default) measures the same gap the threads describe. Naive defaults (512-token chunks, no overlap, top-k 5, no reranker) hit 74% of queries right. The settings below move it to 91% on the median of 5 runs. Without touching the embedding model.
The settings that moved the number
| Setting | Default | TCC editorial value | Accuracy delta |
|---|---|---|---|
| Chunk size (tokens) | 512 | 768 | +5.2 pp |
| Chunk overlap (tokens) | 0 | 128 | +3.1 pp |
| Retriever top-k | 5 | 20 (then rerank to 3) | +6.0 pp |
| Reranker | none | Cohere Rerank v4.0 | +4.5 pp |
| Hybrid BM25 weight | 0 | 0.35 | +1.8 pp (noisy) |
| Boilerplate filter | none | regex on headers + footers | +2.3 pp |
The sum of the deltas is more than 17 points, which is impossible, because the deltas are not additive. In practice the first four get the system to 91%. Boilerplate filtering and BM25 hybrid add small but real gains on top.
Why chunk size matters more than chunk strategy
The TCC fixture compared semantic chunking (break on sentence boundaries), structural chunking (break on Markdown headings), and naive fixed-size chunking at 512, 768, and 1024 tokens. On the TCC corpus, fixed-size at 768 with 128 overlap beat every semantic variant by 2-4 points. The same pattern is reported in the Pinecone chunking guide: fixed-size with overlap holds up well on prose-heavy corpora, semantic chunking only wins on highly structured documents.
The reason is alignment. Semantic chunkers split on prose sentence boundaries; the 12 queries in the TCC fixture are about technical facts that cross those boundaries. A 768-token fixed chunk captures the fact plus one sentence of context on either side. A 512-token chunk often splits the fact from its surrounding definitions. A 1024-token chunk pulls in too much boilerplate and dilutes the embedding.
Mileage will vary by corpus. The only way to know is to run a ground-truth eval. That brings the next point.
Rerankers are the biggest single lever
Top-k 5 without a reranker is the single worst default in the RAG stack. It forces the embedding model to be both a recall and a precision model. Embedding models are trained for recall. They pull in the right chunk inside the top-20 consistently; they do not order it at top-1 reliably.
The fix: set top-k to 20 on the retriever, then pass those 20 to a reranker and take the top 3. On the TCC fixture, Cohere Rerank v4.0 moved the top-1 accuracy from 77% to 91%. The total cost is an extra ~25ms of latency and $0.002 per query. The recurring “RAG in production” thread on r/MachineLearning has settled on this as table-stakes; the most-upvoted comment on the Q1 2026 thread reads “if you are not reranking, you are not doing RAG, you are doing vector search”.
Hybrid BM25 helps on specific corpora
If the corpus has long technical strings BM25 will match well (product SKUs, error codes, function signatures, class names) and that embeddings will miss (because they get chopped by the tokenizer), hybrid helps. On the TCC corpus, the weight of 0.35 on BM25 and 0.65 on embeddings was the ratio that worked. Below 0.2, BM25 does nothing. Above 0.5, BM25 starts pulling in keyword-match false positives and the number drops.
If the corpus is prose-heavy legal or medical text, skip BM25. On two internal corpora the TCC editorial track tested, hybrid dropped accuracy.
Embedding model choice in 2026
The embedding model matters less than chunking. On the TCC corpus, Voyage 4 Large was 0.8 points ahead of OpenAI text-embedding-3-large, both behind bge-large-en-v1.5 (open, self-hosted) by less than a point. Unless the deployment runs at a scale where the per-vector cost matters, use the one the existing vendor bills cleanly. Spend the engineering budget on the reranker and chunker.
The boilerplate filter that saved 2.3 points
PDFs in the TCC corpus had repeating headers (“Confidential draft — do not distribute”) and footers (“Page 12 of 84”) in every chunk. The embedding model treated those as signal. Four of the top-20 retrievals on an “error handling” query were boilerplate pages that happened to have the word “error” somewhere in the body.
The fix is a pre-indexing regex that strips the known boilerplate before chunking. Two patterns, two lines of Python, 2.3 points of accuracy.
import re
BOILERPLATE = [
re.compile(r"Confidential draft.*?distribute", re.I),
re.compile(r"Page \d+ of \d+"),
re.compile(r"Copyright \d{4}.*?All rights reserved.", re.I),
]
def clean(text: str) -> str:
for pat in BOILERPLATE:
text = pat.sub(" ", text)
return re.sub(r"\s+", " ", text).strip()
What the threads are saying
r/LocalLLaMA threads through 2026 converge on the same advice: bump chunk size to 768, add a reranker, do not burn time benchmarking embedding models. Hacker News threads on RAG in production in Q1 2026 spent most of the word count on the reranker step. The Pinecone learn series still has the best long-form writeups on hybrid tuning if the math is what is needed.
Cheatsheet and methodology
The short version of this piece, suitable for a Slack paste, is on the RAG defaults cheatsheet. The harness used to score the 12 queries is the same one on the methodology page, in the RAG (task 14) section.
Verdict
Bump chunk size to 768 with 128 overlap. Retrieve top-20, rerank with Cohere Rerank v4.0, take top-3. Run a BM25 hybrid at 0.35 only if the corpus has technical strings. Strip boilerplate before indexing. That is the default stack in April 2026, and it earns 15 points of accuracy over the naive path.
Frequently asked questions
What chunk size should I start with for RAG in 2026?
Start with 700 to 1000 tokens per chunk with 10 to 15 percent overlap. With 1M-token models (Claude Opus 4.7, Sonnet 4.6, Gemini 3.1 Pro, GPT-5.4) widely available, larger chunks pay off when your retrieval is noisy, but start small and measure recall first.
Do I still need a reranker if my embedding recall is already high?
Skip reranking only if your offline embedding recall at 50 is above about 0.95 on a real eval set. Otherwise, a cross-encoder rerank step from top-50 to top-8 is still the highest-ROI addition, even with frontier LLMs.
Should I pass more context now that 1M-token models exist?
Not automatically. Public long-context benchmarks saturate, but real performance on 60k-line codebases still degrades past about 128k tokens. Keep your retrieval tight and treat 1M as a safety margin, not a default.