The Demo-to-Production Gap

There's a pattern I see in every enterprise RAG project. The proof-of-concept works beautifully: 50 documents, clean text, straightforward questions. The team demos it to leadership, gets funding, and then reality hits. The production corpus has 500,000 documents. They're PDFs with tables, scanned images, multi-column layouts, and regulatory jargon. The queries aren't clean either. Users ask ambiguous questions, reference internal acronyms, and expect the system to understand context from previous interactions.

The naive RAG pipeline that worked in the demo now returns irrelevant chunks, hallucinates citations, and occasionally surfaces confidential documents to unauthorized users. This is the demo-to-production gap, and closing it requires rethinking every layer of the pipeline.

Layer 1: Ingestion Is the Hardest Problem

Most teams underinvest in document ingestion. They use a basic PDF parser, chunk by token count, and embed everything into a vector database. This approach fails because it destroys the structure that makes documents meaningful.

Structure-Aware Parsing

In our pharma deployment, we used a multi-modal ingestion pipeline that treats different document types differently. Regulatory submissions get parsed with layout-aware models that preserve table structures and section hierarchies. Clinical trial reports get entity extraction for drug names, dosages, and adverse events. Internal SOPs get tagged with department and version metadata.

The key principle: parsing is not a generic problem. Every document type in your corpus deserves its own ingestion strategy.

Semantic Chunking

Fixed-size chunking (e.g., 512 tokens with 50-token overlap) is the default in every tutorial and the wrong choice for production. We use semantic chunking that respects document boundaries: sections, paragraphs, table rows. A chunk should contain a complete thought, not an arbitrary slice of text that starts mid-sentence.

For regulatory documents, we additionally create hierarchical chunks: a summary chunk for each section and detailed chunks for sub-sections. This allows the retrieval system to match at different levels of granularity depending on the query.

Layer 2: Hybrid Search Is Non-Negotiable

Pure vector search fails for enterprise use cases. When a user asks for "the Q3 2025 FDA 483 observation for facility 7," they need exact matching, not semantic similarity. Conversely, when they ask "what are the common quality issues across our manufacturing sites," they need semantic understanding.

Our production architecture uses a three-signal retrieval system:

These three signals are combined using reciprocal rank fusion, with weights tuned on a domain-specific evaluation set. Getting the weights right matters more than choosing the "best" embedding model.

Layer 3: Re-Ranking Changes Everything

Initial retrieval is fast but coarse. We retrieve 50 candidates and then re-rank them using a cross-encoder model that scores the relevance of each chunk against the full query. This step consistently improves retrieval accuracy by 15-20 percentage points in our benchmarks.

The cross-encoder is expensive per-query, which is why it runs on the top-50 candidates rather than the full corpus. The two-stage retrieve-then-rerank architecture gives you both speed and precision.

Retrieval accuracy is the ceiling for your entire RAG system. If the right information isn't in the context window, no amount of prompt engineering will save you.

Layer 4: Guardrails for Regulated Industries

In pharma, a hallucinated citation isn't just embarrassing; it's a compliance violation. Our guardrails layer addresses three categories of risk:

Citation Verification

Every claim in the generated response must be traceable to a specific chunk in the retrieved context. We implement this as a post-generation validation step that checks each factual statement against the source chunks using an NLI (natural language inference) model. Statements that can't be grounded in the sources get flagged or removed.

Access Control Enforcement

Before any chunk enters the context window, it passes through an access control check. The user's role, department, and clearance level are matched against the document's classification. This happens at retrieval time, not generation time, so the LLM never sees information the user isn't authorized to access.

Output Filtering

Even with perfect retrieval and grounding, the LLM can occasionally generate outputs that violate compliance policies. We use NeMo Guardrails to enforce topic boundaries and prevent the system from offering medical advice, making regulatory predictions, or generating content outside its authorized scope.

Layer 5: Evaluation Is Continuous

Production RAG systems need continuous evaluation, not just pre-launch testing. We maintain a living evaluation dataset of 500+ question-answer pairs, annotated by domain experts. Every week, we run the evaluation suite and track four metrics: retrieval precision@10, answer faithfulness (grounding score), answer relevance, and latency at p95.

When retrieval precision drops, it usually means the corpus has grown in a direction the embedding model wasn't tuned for. When faithfulness drops, it usually means the LLM is being asked questions outside its knowledge boundary. These signals drive continuous improvement of the pipeline.

The Numbers That Matter

After 18 months in production, our pharma RAG system delivers: 95.3% retrieval accuracy on the evaluation set, 97.1% citation faithfulness (every claim is grounded), sub-3-second response latency at p95 with 500K+ documents, and zero unauthorized information disclosures. These numbers didn't come from choosing the right model. They came from engineering every layer of the pipeline with production constraints in mind.

Need a RAG System That Works at Scale?

I've architected production RAG pipelines across pharma, financial services, and logistics.

Let's Discuss Your Pipeline →
← Agentic AI RevolutionNext: AI CoE Playbook →