Skip to main content

Production RAG Architecture Blueprint: Retrieval-Augmented Generation at Scale

· 10 min read
Dinesh K
DevOps & AIOps Consultant
PatternRetrieval-Augmented Generation
ComplexityEnterprise
Infra TargetKubernetes / GPU
Latency ProfileP99 ≤ 3s E2E
Production CharacteristicsProduction ReadyObservability FirstKubernetes NativeSecurity HardenedLatency CriticalEnterprise Pattern

RAG systems fail in production for predictable reasons: retrieval quality degrades silently, embedding drift goes undetected, LLM latency spikes under load, and observability is bolted on after incidents. This blueprint addresses all four with a complete operational architecture.

System Architecture

The production RAG stack decomposes into six vertical layers. Each layer is independently scalable and has a defined observability contract — latency, error rate, and quality signal.

Client
API GatewayAuth / JWTRate LimiterSDK / Webhook
Orchestration
RAG OrchestratorQuery RouterContext BuilderGuard Rails
Retrieval
Embedding ServiceANN Index (HNSW)Re-rankerMetadata Filter
Generation
LLM RouterPrompt TemplateStream HandlerResponse Cache
Storage
pgvectorDocument Store (S3)Redis CachePostgres (metadata)
Observability
OTel CollectorPrometheusLokiTempo

Request Flow

Every production RAG query passes through six deterministic phases. Each phase adds measurable latency and is an explicit instrumentation point.

User Query
Input
Query Embed
Vectorize
ANN Search
Retrieve
Re-rank
Score
Build Context
Assemble
LLM Generate
Generate
Architecture Note

The embedding step accounts for 40–120ms in typical deployments. Run a dedicated embedding microservice with connection pooling — never co-locate the embedding model with your inference cluster or you will see cascading latency spikes under load.

Retrieval Layer Design

The retrieval layer is where most RAG systems degrade silently. Embedding model drift, index staleness, and retrieval diversity failure are the three root causes of output quality regression — none of them surface in standard HTTP metrics.

🔍
Query Embedtext-embed-3
🗄️
pgvectorHNSW Index
📊
Re-rankercross-encoder
🛡️
Filter Enginemetadata / ACL

The HNSW index delivers sub-10ms ANN search at 95th percentile for corpora up to 50M vectors. Beyond that threshold, shard the index horizontally — query all shards in parallel and merge ranked results before re-ranking.

Observability Stack

Production RAG requires four telemetry signal categories beyond standard HTTP metrics: query quality signals, retrieval health signals, generation safety signals, and infrastructure signals.

P95
Query Latency
Target ≤ 1.8s
0.87
Retrieval MRR
↑ Stable
99.4%
Cache Hit
↑ +3.2%
2.1%
Guard Block Rate
→ Normal
412ms
Embed P99
↑ Watch
6.8k
Req / Min
↑ Growing

Instrument every retrieval call with OpenTelemetry trace context from day one. Add custom span attributes for rag.query_tokens, rag.retrieved_chunks, rag.rerank_score_top1, and rag.context_tokens. These four attributes power the dashboard that tells you whether your RAG system is actually working.

Deployment Pipeline

The RAG deployment pipeline extends a standard CI/CD pipeline with three RAG-specific gates: retrieval quality validation, index warm-up, and canary answer quality observation.

01
Document Ingestion
Parse, chunk, and embed source documents. Validate chunk quality scores and metadata completeness before index insertion. Reject chunks below quality threshold.
S3LangChainpgvector
02
Index Warm-Up & Quality Gate
Run benchmark query set against the new index. Gate deployment on P95 retrieval latency ≤ 150ms and MRR ≥ 0.82 on the evaluation corpus.
HNSWEval SuiteA/B Gate
03
Canary Rollout — 5% Traffic
Route 5% of production traffic to the new RAG version for 30 minutes. Monitor answer quality scores, latency percentiles, and guard block rates. Automated rollback on SLO breach.
Argo RolloutsPrometheusSLO Gate
04
Full Production Promotion
Promote to 100% after the observation window clears. Retain the previous index snapshot for 24 hours with a sub-5-minute restore path.
KubernetesSnapshotRollback Ready
Observability First

Retrieval failures are silent by default. You will not know your RAG system is returning stale or irrelevant context without active distributed tracing. Add trace context from the API gateway through to the vector DB query — a single trace should show every retrieval call and its scores.

Production Readiness Indicators

A RAG system reaches production-ready status when it satisfies four operational contracts:

  1. Retrieval quality — Mean Reciprocal Rank (MRR) ≥ 0.80 on the evaluation set, measured at least weekly
  2. Latency budget — P99 end-to-end ≤ 3s under peak load (95th percentile traffic)
  3. Observability coverage — All four signal categories instrumented: queries, retrievals, generations, guard events
  4. Rollback capability — Previous index snapshot retained with a documented, tested sub-5-minute restore path
Operational Risk: Embedding Drift

The most common silent production failure: embedding model updated, retrieval quality degrades over 48 hours, P99 latency remains stable, users report "worse answers" before any metric fires an alert. Prevent this with weekly embedding drift detection — embed your evaluation corpus with the new model and compare MRR against the baseline before any model update reaches production.