RAG Chunking Strategies: What Actually Works at Scale

Why Chunking Matters More Than Your Model

You can swap GPT-4o for Claude or vice versa and see modest quality differences. Change your chunking strategy and you can see 30%+ swings in retrieval accuracy. Chunking determines what the retrieval system can find, and if retrieval fails, no amount of generation quality can compensate.

The problem is that chunking failures are silent. The system still returns answers — they're just based on the wrong context. Without systematic evaluation, you won't know your chunking strategy is the bottleneck.

Fixed-Size Chunking: The Baseline

Split every document into chunks of N tokens with M tokens of overlap. It's simple, predictable, and works surprisingly well as a starting point. The typical sweet spot is 512 tokens with 50-token overlap.

Where it fails: documents with variable structure. A 512-token chunk that starts in the middle of one section and ends in the middle of another contains a mix of topics that confuses the embedding model. The resulting vector doesn't represent either topic well.

Semantic Chunking: Follow the Topic Boundaries

Instead of splitting at fixed intervals, split where the topic changes. The approach: embed consecutive sentences, compute cosine similarity between adjacent embeddings, and split where similarity drops below a threshold.

This produces chunks that are topically coherent, which improves embedding quality. The trade-off is variable chunk sizes — some chunks are 100 tokens, others are 2,000. This requires your retrieval pipeline to handle variable-length inputs gracefully.

Hierarchical Chunking: Multiple Granularities

The strategy that consistently outperforms single-level chunking. Index the same document at multiple levels: document-level summaries, section-level chunks, and paragraph-level chunks. At retrieval time, match at the appropriate granularity based on the query.

Broad questions ("What is this document about?") → Match against document summaries
Section questions ("What does the compliance section say?") → Match against section chunks
Specific questions ("What is the threshold for Category A?") → Match against paragraph chunks

Hierarchical chunking isn't three times the storage cost for marginal improvement. It's the difference between a system that handles diverse query types and one that only works for a narrow query pattern.

Document-Aware Chunking: Respect the Structure

Legal contracts, financial reports, and technical manuals have explicit structure: sections, subsections, tables, and exhibits. Document-aware chunking parses this structure and uses it as the primary split boundary.

For PDFs, this means using layout analysis to identify headers, paragraphs, tables, and figures. Each structural element becomes a chunk, enriched with metadata about its position in the document hierarchy. A table stays intact as a single chunk rather than being split across arbitrary boundaries.

The Evaluation Framework

Choosing a chunking strategy without evaluation is guessing. The evaluation framework I use:

Build a golden dataset of 200+ question-answer pairs from your actual document corpus
For each question, annotate which specific passages contain the answer
Run each chunking strategy and measure retrieval precision@k: what percentage of retrieved chunks contain relevant information?
Measure end-to-end answer quality: does the generated answer correctly reflect the source material?

Run this evaluation every time you change your chunking strategy, add new document types, or update your embedding model. What works for legal documents may fail for technical manuals.

Want to Discuss This Topic?

I help enterprises architect production-grade AI systems that deliver measurable business impact.

Start a Conversation →