~/cheatsheets/rag-defaults-2026-cheatsheet-copy-paste-ship
§ CHEATSHEET · APR 23, 2026 CHEATSHEET · RAG · RETRIEVAL v1.0

RAG defaults 2026 cheatsheet: copy, paste, ship

The RAG parameter defaults that moved my top-1 accuracy from 74% to 91% in 2026. Chunk size, overlap, rerank, hybrid BM25, and the 2 flags people forget.
Ryan CallowayStaff contributor
  10 min read

The Slack-pasteable version of TCC’s RAG defaults for 2026. The full reasoning, ablations, and per-parameter charts are on the RAG defaults guide. This cheatsheet has the table, the copy-paste YAML config, working LlamaIndex code with correct 2026 import paths, the latency budget, the cost breakdown, common errors with fixes, and the watch-outs that trip up most production deployments.

The settings that move top-1 accuracy

Parameter Naive default Ship in 2026 Direction of move
Chunk size (tokens) 512 768 +~5 pp on technical corpora
Chunk overlap (tokens) 0 128 +~3 pp; recovers boundary context
Chunking strategy semantic fixed-size +~2 pp; simpler, more predictable
Retriever top-k (before rerank) 5 20 +~6 pp when paired with rerank
Reranker none Cohere Rerank v4.0, final k=3 +~4-5 pp; the single biggest lever
Embedding model ada-002 era Voyage 4 Large or text-embedding-3-large +~1 pp; modern embeddings still help
Hybrid BM25 weight 0 0.35 (technical corpus only) +~2 pp on code/SKU; 0 on prose
Boilerplate strip off on +~2 pp; essential for PDF/HTML extraction

Deltas are from TCC editorial ablations on a mixed-technical corpus. Your fixture will produce different absolute numbers; the direction holds. If you only have time for one change, set top-k to 20 and add Cohere Rerank v4.0. That single change moves the needle more than every other setting combined.

YAML config: paste into rag.yaml

# rag.yaml — production defaults 2026
chunker:
  strategy: fixed_size
  size_tokens: 768
  overlap_tokens: 128
  strip_boilerplate: true
  normalize_whitespace: true   # run re.sub(r"s+", " ", text).strip() before chunking

embed:
  provider: voyage
  model: voyage-4-large
  # Alternative: openai / text-embedding-3-large at $0.13/M tokens
  # Budget option: bge-large-en-v1.5 (self-hosted, within 1 pp)

retrieve:
  top_k: 20
  hybrid_bm25_weight: 0.35    # set to 0 on prose-only corpora; try 0.5 on pure code/SKU

rerank:
  provider: cohere
  model: rerank-v4.0-pro
  final_k: 3

eval:
  ground_truth_path: evals/ragtruth.jsonl
  min_top_1: 0.88

Working LlamaIndex code with correct 2026 import paths

LlamaIndex 0.10.0+ restructured imports significantly. Several common import paths changed in ways that fail silently or throw confusing errors. This is the code that actually works:

# LlamaIndex v0.10.0+ — verified import paths for 2026
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.postprocessors.cohere import CohereRerank       # v0.10+ path
# NOT: llama_index.postprocessor.cohere_rerank — that path silently fails
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.postprocessor.long_context_reorder import LongContextReorder
import os, logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ProductionRAG:
    def __init__(self, docs_path: str):
        embed_model = OpenAIEmbedding(
            api_key=os.environ["OPENAI_API_KEY"],
            model="text-embedding-3-large"  # specify model explicitly; omitting causes mismatch
        )
        splitter = SentenceSplitter(chunk_size=768, chunk_overlap=128)
        documents = SimpleDirectoryReader(docs_path).load_data()
        logger.info(f"Loaded {len(documents)} documents")

        # embed_model= is required in v0.10+; omitting it causes silent embedding mismatches
        self.index = VectorStoreIndex.from_documents(
            documents,
            node_parser=splitter,
            embed_model=embed_model
        )

    def query(self, q: str, bm25_weight: float = 0.35) -> str:
        vector_ret = self.index.as_retriever(similarity_top_k=20)
        # docstore= argument is required; omitting it throws AttributeError
        bm25_ret = BM25Retriever.from_defaults(docstore=self.index.docstore)
        fusion_ret = QueryFusionRetriever(
            retrievers=[vector_ret, bm25_ret],
            similarity_top_k=20
        )

        reranker = CohereRerank(
            api_key=os.environ["COHERE_API_KEY"],
            top_n=3
        )

        query_engine = self.index.as_query_engine(
            retriever=fusion_ret,
            node_postprocessors=[reranker]
        )

        response = query_engine.query(q)
        return str(response)

# Usage
if __name__ == "__main__":
    rag = ProductionRAG("./docs")
    print(rag.query("What are the retry settings for agent loops?"))

The three import/parameter traps that catch most people:

Embedding model comparison

Model Cost per 1M tokens Context window Best for
text-embedding-3-small $0.02 8K Default; best cost/quality balance
text-embedding-3-large $0.13 8K Complex queries; higher dimensions
voyage-4-large $0.06 32K Long documents; 2.2x cheaper than OAI large
cohere embed-v3 $0.10 512 Multilingual corpora
bge-large-en-v1.5 $0 self-hosted 512 Cost-sensitive; within 1 pp of commercial

Architecture patterns comparison

Pattern When to use Pros Cons
Dense retrieval (vector DB) General semantic search Fast; scales to millions of docs Misses exact keyword matches
Hybrid (vector + BM25) Mixed semantic and keyword Captures both; +~2 pp on technical Higher latency; needs weight tuning
Reranking pipeline High-stakes queries, top-1 must be right 18-42% precision boost Adds 50-200ms latency per batch
Agentic RAG Multi-document synthesis, complex reasoning Handles iterative retrieval 1-3 extra LLM calls per query

Latency budget

Latency compounds quickly. Know the budget for each component before adding them:

Component Typical latency
Vector search (Pinecone, 1M vectors) ~10ms
BM25 retrieval (precomputed offline) ~5ms
Embedding query (text-embedding-3-large) ~100ms
Reranking (Cohere, 20 candidates) ~50-200ms
LLM generation (varies by output length) 500ms to 5s
Simple RAG total (P50 target) Under 2 seconds
Agentic RAG total (P50 target) Under 8 seconds

Production cost breakdown (100K queries/month)

Component Typical monthly cost
Embedding (queries, text-embedding-3-small at $0.02/M) ~$2-5
Vector DB (Pinecone, 1M vectors) ~$70
Reranking (Cohere, 100K queries) ~$100
LLM generation (GPT-5 or Claude, varies) $200-2,000

LLM generation is almost always the dominant cost. Reducing context from 4,000 to 2,000 tokens per query by reranking to k=3 before generation roughly halves LLM cost at scale. The reranker cost pays for itself in LLM savings above ~30K queries per month.

Common errors and fixes

Error Cause Fix
Vector dimension mismatch: expected 1536, got 384 Embedding model changed mid-pipeline; index and query use different models Recompute all document embeddings with the new model. Always specify model= explicitly.
No results returned from retriever top_k greater than total indexed nodes, or query and docs have no semantic overlap Check len(index.docstore.docs). Lower top_k. Test embedding directly: embed_model.get_text_embedding("test").
Context length exceeded Too many or too large chunks for the LLM context window Reduce chunk_size or top_k. Reranking to k=3 before generation usually fixes this.
Stale embeddings after document update Documents edited in-place without reindexing Delete old doc from vector store, add new doc. Always version documents and log rebuild times.
Poor retrieval quality on PDFs Ragged whitespace and repeating headers/footers from PDF extraction Run re.sub(r"s+", " ", text).strip() and strip boilerplate patterns before chunking.
AttributeError on BM25Retriever Missing docstore= argument (changed in v0.10+) Use BM25Retriever.from_defaults(docstore=index.docstore).

5 flags most people forget

  1. Normalize whitespace before chunking. PDF extraction leaves ragged spacing. Run re.sub(r"s+", " ", text).strip() before chunking. The top reply on most r/MachineLearning “RAG quality is bad on PDFs” threads points here.
  2. Store the chunk byte offset at index time. When you show a retrieved chunk to a user, you will want to link to the source document at the right position. Reconstructing the offset from the chunk text later is a multi-day project.
  3. Precompute the BM25 index offline. Building the BM25 index in the request path blocks query execution on large corpora. Precompute it offline and cache it. Same rule applies to embedding computation for large document sets.
  4. Chunk overlap grows the index. At 128 tokens of overlap on a 1,400-chunk corpus, the index grows ~18%. Budget vector store storage accordingly before enabling overlap.
  5. RAG does not guarantee truthfulness. The LLM can still ignore retrieved chunks, invent citations, or misinterpret documents. Validate outputs on structured extraction tasks. Do not rely on RAG as a hallucination prevention layer by itself.

Watch-outs

Boilerplate stripper (copy-paste ready)

import re

# Add patterns for your specific corpus
BOILERPLATE_PATTERNS = [
    re.compile(r"Confidential draft.*?distribute", re.I | re.S),
    re.compile(r"Page d+ of d+"),
    re.compile(r"Copyright d{4}.*?All rights reserved.", re.I | re.S),
    re.compile(r"---s*(?:header|footer|navigation)s*---", re.I),
]

def clean_doc(text: str) -> str:
    """Strip boilerplate and normalize whitespace before chunking."""
    for pat in BOILERPLATE_PATTERNS:
        text = pat.sub(" ", text)
    return re.sub(r"s+", " ", text).strip()

Frequently asked questions

Should I start with semantic or fixed-size chunking? Start with fixed-size at 768 tokens with 128 overlap. This is the baseline that beats semantic chunking on most technical corpora in 2026 benchmarks. Add semantic chunking only if you have ground-truth evidence that it helps on your specific data. The overhead of semantic chunking (3-5x more vector fragments, higher embedding cost, more retrieval noise) makes it a net negative until you have proof it helps.

When should I use top-k 5 instead of top-k 20? If you are not using a reranker, top-k 5 is the right call. Passing 20 chunks directly to the LLM without reranking bloats the context with noise. The pattern that works: retrieve 20, rerank to 3, pass 3 to the LLM. If you cannot add a reranker right now, use top-k 3-5 and accept lower recall until the reranker is in place.

What is the minimum viable RAG setup? A fixed-size chunker at 512-768 tokens with 10% overlap, any modern embedding model (text-embedding-3-small is $0.02/M tokens), a vector store (Chroma runs locally for free), and top-k 5 retrieval without a reranker. This gets you to ~74% accuracy on the TCC fixture. Add the reranker next; it is the single change that moves the number most.

Is Voyage AI worth the extra cost over text-embedding-3-small? For most corpora, no. On the TCC corpus, Voyage 4 Large was 0.8 points ahead of text-embedding-3-small at 3x the cost. That is not a good trade-off for most applications. Voyage earns its keep on long-document retrieval (legal, financial) where its 32K context window handles documents that other models must chunk more aggressively. Outside that use case, text-embedding-3-small is the right default.

How do I know if BM25 hybrid is helping? Run the eval before and after enabling BM25 at weight 0.35. If retrieval precision increases by more than 1 percentage point on your labeled eval set, keep it. If it is flat or negative, the corpus is prose-heavy and BM25 is adding noise. Never enable hybrid based on intuition alone; the BM25 weight needs to be tuned per corpus and validated against ground truth.

My RAG results look good on test queries but bad in production. Why? The most common causes: the test queries do not match the production query distribution, the production corpus has more boilerplate than the test corpus, or the embedding model is trained on a different domain. Pull 100 real production queries and build an eval set from them. Synthetic test queries almost always underrepresent the long tail of phrasing variations that real users actually send.

Reranker comparison in 2026

Reranker Type Latency (20 candidates) Cost Best for
Cohere Rerank v4.0 API (managed) ~25-50ms ~$0.002/query General use; lowest friction
Voyage AI Rerank-2 API (managed) ~30-60ms Similar to Cohere Long documents; if Voyage is the embedder
BGE-Reranker-v2-M3 Open (self-hosted) ~80-150ms on GPU Infra cost only Cost-sensitive; high volume
BAAI/bge-reranker-large Open (self-hosted) ~100-200ms on GPU Infra cost only Maximum accuracy; budget for GPU

The full methodology, ground-truth set, and per-parameter ablations are in the RAG defaults guide. The eval harness wrapped around every RAG change is on the evals-without-judges post. The Pinecone learn series has the long-form explanation of hybrid scoring if you want the underlying math on reciprocal rank fusion.

Measuring whether the defaults are working

These default settings earn 91% top-1 accuracy on the TCC editorial fixture. Your corpus will produce different absolute numbers. Here is the minimum setup to measure whether the settings are working on your data:

  1. Label 100 real production queries. For each query, mark which document chunk should appear in the top-3 retrievals. This is your ground truth. Do not use synthetic queries for this; they will not match the distribution of your actual users.
  2. Run two retrieval passes per query: one with naive defaults, one with 2026 defaults. Record whether the correct chunk appears in the top-3 for each pass.
  3. Compute top-3 recall rate. Fraction of queries where the correct chunk is in the top-3. This is your primary metric. The 2026 defaults should improve it by 10-15 percentage points over the naive baseline on technical corpora.
  4. Set a regression gate. Use the same ground-truth eval set to gate future changes. Any parameter change that drops top-3 recall by more than 1 percentage point is a regression. Treat it the same as a test failure in CI.

The evals-without-judges post covers the full harness for wiring this measurement into CI. The short version: deterministic retrieval evaluation (does the correct chunk appear in top-k?) costs $0 and runs in seconds. Do not use an LLM judge to evaluate retrieval quality; the chunk is either in the top-k or it is not.

One-line takeaway

Fixed-size 768-token chunks with 128 overlap, retrieve top-20, rerank to top-3 with Cohere Rerank v4.0, BM25 hybrid at 0.35 only on technical corpora, boilerplate strip and whitespace normalization on by default, chunk byte offsets stored at index time. That is the production-grade RAG default in 2026.

esc