Skip to main content

Production RAG Systems

Architecture patterns for building retrieval-augmented generation systems that perform reliably at scale — hybrid retrieval, re-ranking, caching, evaluation, and deployment strategies.

Why Most RAG Systems Fail in Production

The gap between a RAG demo and a production RAG system is enormous:

RAG DemoProduction RAG
Single document sourceDozens of sources, multiple formats
Simple similarity searchHybrid retrieval + re-ranking
No quality measurementAutomated evaluation pipelines
No cost trackingPer-query cost monitoring
Static knowledgeContinuous ingestion + indexing
No access controlRow-level document security
Manual updatesCI/CD for knowledge base changes

Architecture Diagram

┌──────────────────────────────────────────────────────────────┐
│ RAG Application Layer │
│ │
│ User Query │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Query Router │ ← Route by intent (search / QA / chat) │
│ └──────┬───────┘ │
│ │ │
│ ┌──────▼───────────────────────────────────────────────┐ │
│ │ Retrieval Pipeline │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ Query │ │ Hybrid │ │ Re-ranker │ │ │
│ │ │ Transform│→ │ Search │→ │ (Cross- │ │ │
│ │ │ (HyDE, │ │ (BM25 + │ │ Encoder) │ │ │
│ │ │ expand) │ │ Semantic) │ │ │ │ │
│ │ └──────────┘ └──────────────┘ └──────┬─────┘ │ │
│ │ │ │ │
│ │ ┌──────────────────────────────────────▼────────┐ │ │
│ │ │ Context Builder │ │ │
│ │ │ • Chunk deduplication │ │ │
│ │ │ • Relevance filtering (score > threshold) │ │ │
│ │ │ • Context window optimization │ │ │
│ │ │ • Citation source tracking │ │ │
│ │ └───────────────────────┬───────────────────────┘ │ │
│ └──────────────────────────┴──────────────────────────┘ │
│ │ │
│ ┌──────────────────────────▼──────────────────────────┐ │
│ │ Generation Pipeline │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌─────────────────┐ │ │
│ │ │ Prompt │ │ LLM │ │ Output │ │ │
│ │ │ Builder │→ │ Inference│→ │ Validation │ │ │
│ │ │ (with │ │ (cached │ │ (Guardrails, │ │ │
│ │ │ context) │ │ routing) │ │ citations) │ │ │
│ │ └──────────┘ └──────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Observability Layer │ │
│ │ Langfuse traces · Retrieval quality · Cost tracking │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘

Retrieval Patterns

1. Hybrid Search (BM25 + Semantic)

Combine keyword matching with embedding-based similarity for more robust retrieval:

from haystack.components.retrievers import (
InMemoryBM25Retriever,
InMemoryEmbeddingRetriever,
)
from haystack.components.joiners import DocumentJoiner

# BM25 for exact keyword / entity matches
bm25_retriever = InMemoryBM25Retriever(document_store=doc_store)

# Semantic for conceptual similarity
embedding_retriever = InMemoryEmbeddingRetriever(document_store=doc_store)

# Fuse results with reciprocal rank fusion
joiner = DocumentJoiner(join_mode="reciprocal_rank_fusion")

When to use: Always in production. Pure semantic search misses exact keyword matches; pure BM25 misses semantic intent.

2. Query Transformation

Improve retrieval quality by transforming the user query before search:

TechniqueHow It WorksWhen to Use
HyDEGenerate a hypothetical answer, embed thatAbstract or conceptual queries
Query expansionAdd synonyms and related termsShort or ambiguous queries
Sub-question decompositionBreak complex query into sub-queriesMulti-faceted questions
Step-back promptingGenerate a more general query firstNarrow technical questions

3. Re-ranking

Re-rank retrieval results using a cross-encoder model for higher precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_documents(query: str, docs: list, top_k: int = 5):
"""Re-rank retrieved documents using cross-encoder."""
pairs = [(query, doc.content) for doc in docs]
scores = reranker.predict(pairs)

ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]

Impact: Re-ranking typically improves retrieval precision by 10-25% over embedding-only retrieval.

Chunking Strategies

The single biggest factor in RAG quality is how you chunk documents:

StrategyChunk SizeBest For
Fixed-size256-512 tokensGeneral purpose, quick setup
Sentence-basedNatural sentence boundariesConversational content
SemanticEmbedding-based boundary detectionTechnical documentation
RecursiveHierarchical splitting with overlapLong-form content, code docs
Document-awareRespect headings, sections, paragraphsStructured documents (PDF, DOCX)

Rule of thumb: Start with recursive chunking (512 tokens, 50-token overlap), then optimize based on retrieval quality metrics.

Caching Architecture

Production RAG benefits from multi-layer caching:

User Query


┌──────────────────┐
│ Semantic Cache │ ← Cache similar queries (embedding similarity > 0.95)
│ (Redis + vectors) │
└────────┬─────────┘
│ miss

┌──────────────────┐
│ Retrieval Cache │ ← Cache retrieval results for identical queries
│ (Redis/Memcached) │
└────────┬─────────┘
│ miss

┌──────────────────┐
│ LLM Response Cache│ ← Cache deterministic (temp=0) responses
│ (Redis) │
└──────────────────┘

Impact: Semantic caching can reduce LLM costs by 30-60% for applications with repetitive query patterns.

Evaluation Pipeline

Measure RAG quality automatically to detect degradation:

MetricWhat It MeasuresTarget
Retrieval Recall@kRelevant docs in top-k results> 0.8
Retrieval Precision@kFraction of top-k that are relevant> 0.6
Answer RelevanceDoes the answer address the query?> 0.8 (LLM-judge)
FaithfulnessIs the answer supported by retrieved context?> 0.9 (LLM-judge)
Answer CompletenessDoes the answer cover all aspects of the query?> 0.7
Latency (p95)End-to-end response time< 3s
Cost per queryToken + compute costWithin budget
from langfuse import Langfuse

langfuse = Langfuse()

def evaluate_rag_response(trace_id, query, context_docs, response):
"""Score RAG response quality."""

# Faithfulness: Is the response grounded in retrieved context?
faithfulness = llm_judge(
criteria="Is the response fully supported by the provided context?",
response=response,
context=context_docs,
)
langfuse.score(trace_id=trace_id, name="faithfulness", value=faithfulness)

# Relevance: Does the response actually answer the query?
relevance = llm_judge(
criteria="Does the response directly answer the user's question?",
response=response,
query=query,
)
langfuse.score(trace_id=trace_id, name="relevance", value=relevance)

Example Stack Configuration

Startup RAG Stack

# Minimal production RAG
retrieval:
framework: LlamaIndex
vector_store: Pinecone (managed)
embedding: text-embedding-3-small
search: semantic only

generation:
model: gpt-4o-mini

observability:
tool: Langfuse Cloud

security:
input: basic rate limiting
output: none (internal tool)

Enterprise RAG Stack

# Full production RAG
retrieval:
framework: Haystack
vector_store: Elasticsearch (self-hosted)
embedding: text-embedding-3-large + fine-tuned
search: hybrid (BM25 + semantic)
reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
caching: Redis semantic cache

generation:
model: gpt-4o (primary) + Claude (fallback)
prompt_management: Langfuse versioned prompts

observability:
tracing: Langfuse (self-hosted)
evaluation: Arize Phoenix (dev) + automated evals
dashboards: Grafana

security:
input: Lakera Guard (prompt injection)
output: Guardrails AI (PII, toxicity)
access: row-level document security
audit: compliance logging

Deployment Checklist

  • Implement hybrid retrieval (BM25 + semantic) with reciprocal rank fusion
  • Add cross-encoder re-ranking for top-k results
  • Set up document processing pipeline with appropriate chunking strategy
  • Deploy observability (Langfuse) with retrieval quality metrics
  • Implement semantic caching for cost optimization
  • Create automated evaluation pipeline (faithfulness, relevance)
  • Add citation tracking for verifiable responses
  • Set up CI/CD for knowledge base updates
  • Implement access control for sensitive documents
  • Configure alerts for quality degradation and cost anomalies
CategoryToolPurpose
Vector DatabasePinecone →Managed vector database for production RAG
Vector DatabaseWeaviate →Open-source vector search engine
Vector DatabaseQdrant →High-performance vector similarity search
ObservabilityLangfuse →RAG pipeline tracing and evaluation
RAG PlatformsRAG Platforms →Haystack, LlamaIndex, and RAG framework tools
ComparisonPinecone vs Weaviate →Vector database comparison for RAG systems
ComparisonPinecone vs Qdrant →Managed vs open-source vector search
ComparisonHaystack vs LlamaIndex →RAG framework comparison