Skip to main content

AI Observability Stack

How to build production-grade observability for LLM applications — tracing, evaluation, cost tracking, and quality monitoring.

Why Standard APM Falls Short

Traditional APM tools (Datadog, New Relic, Grafana) treat LLM calls as opaque HTTP requests. They can tell you latency and error rates, but not:

  • Which prompt version caused quality degradation
  • Whether retrieval is returning relevant documents
  • How much each feature costs in tokens
  • If the LLM is hallucinating more than last week
  • Why a specific user got a bad response

LLM observability requires trace-level understanding of the AI pipeline.

Architecture Diagram

┌─────────────────────────────────────────────────┐
│ Application Layer │
│ │
│ User Query → Retrieval → LLM → Post-process │
│ │ │ │ │ │
│ └────────────┴─────────┴─────────┘ │
│ │ │
│ OpenTelemetry Spans │
└────────────────────┬────────────────────────────┘

┌──────────┴──────────┐
│ │
┌─────▼──────┐ ┌──────▼──────┐
│ Langfuse │ │ Phoenix │
│ (Production│ │ (Dev-time │
│ Monitoring│ │ Analysis) │
├────────────┤ ├─────────────┤
│ Traces │ │ RAG Quality │
│ Costs │ │ Embeddings │
│ Prompts │ │ Evals │
│ Evals │ │ Hallucination│
└─────┬──────┘ └──────┬──────┘
│ │
┌─────▼─────────────────────▼─────┐
│ Alerting & Dashboards │
│ ┌──────────┐ ┌──────────────┐ │
│ │ Grafana │ │ Slack/PD │ │
│ │ Dashboards│ │ Alerts │ │
│ └──────────┘ └──────────────┘ │
└──────────────────────────────────┘

The Four Pillars of LLM Observability

1. Trace-Level Monitoring

Every LLM interaction must be traceable from user input to response:

from langfuse.decorators import observe, langfuse_context

@observe()
def handle_query(user_id: str, query: str) -> str:
"""Traced end-to-end: retrieval → augmentation → generation."""

# Each step becomes a span
docs = retrieve_documents(query)
context = format_context(docs)
prompt = build_prompt(query, context)
response = call_llm(prompt)

# Add metadata for analysis
langfuse_context.update_current_trace(
user_id=user_id,
metadata={"doc_count": len(docs), "model": "gpt-4"}
)

return response

@observe()
def retrieve_documents(query: str) -> list:
"""Retrieval span with relevance tracking."""
results = vector_store.similarity_search(query, k=5)

langfuse_context.update_current_observation(
metadata={"num_results": len(results)}
)

return results

2. Cost Analytics

Track token usage and cost per user, feature, and model:

# Tag every LLM call with cost dimensions
@observe()
def call_llm(prompt: str, feature: str = "default") -> str:
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
)

# Langfuse auto-tracks tokens, but add business context
langfuse_context.update_current_observation(
metadata={
"feature": feature,
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
}
)

return response.choices[0].message.content

3. Quality Evaluation

Automated quality scoring for LLM outputs:

from langfuse import Langfuse

langfuse = Langfuse()

def evaluate_response(trace_id: str, output: str, expected: str):
"""Score LLM output quality."""

# Semantic similarity score
similarity = compute_similarity(output, expected)
langfuse.score(
trace_id=trace_id,
name="semantic_similarity",
value=similarity
)

# Factuality check via LLM-as-judge
factuality = llm_judge_factuality(output)
langfuse.score(
trace_id=trace_id,
name="factuality",
value=factuality
)

4. Prompt Management

Version-controlled prompts linked to production performance:

# Fetch versioned prompt from Langfuse
prompt = langfuse.get_prompt("rag-answer-v2")

# Use in production — linked to traces for A/B comparison
response = call_llm(
prompt.compile(context=context, query=query)
)

Key Metrics to Track

MetricWhat It MeasuresAlert Threshold
Latency (p50/p95/p99)User experiencep95 > 5s
Token usage per requestCost efficiency> 2x baseline
Retrieval relevanceRAG qualityScore < 0.5
Hallucination rateOutput reliability> 5%
Error rateSystem reliability> 1%
Cost per userUnit economics> budget
Prompt injection blocksSecurityAny increase
Evaluation scoresQuality over timeDeclining trend

Alerting Patterns for LLM Systems

Define alerts that catch LLM-specific failures before users report them:

# prometheus-llm-alerts.yaml
groups:
- name: llm-observability-alerts
rules:
# Quality degradation — catch before users complain
- alert: LLMQualityDrop
expr: |
(rate(llm_quality_score_sum[1h]) / rate(llm_quality_score_count[1h])) < 0.7
and
(rate(llm_quality_score_sum[24h]) / rate(llm_quality_score_count[24h])) > 0.8
for: 30m
labels:
severity: warning
annotations:
summary: "LLM quality score dropped below 0.7 (24h avg was above 0.8)"

# Cost spike — prevent runaway spending
- alert: LLMCostSpike
expr: |
sum(increase(llm_cost_usd_total[1h])) >
2 * avg_over_time(sum(increase(llm_cost_usd_total[1h]))[7d:1h])
for: 15m
labels:
severity: critical
annotations:
summary: "LLM cost is 2x the 7-day hourly average"

# Retrieval quality — RAG returning irrelevant docs
- alert: LLMRetrievalQualityLow
expr: rate(llm_retrieval_relevance_score_sum[1h]) / rate(llm_retrieval_relevance_score_count[1h]) < 0.5
for: 20m
labels:
severity: warning
annotations:
summary: "RAG retrieval relevance score below 0.5 — check index freshness"

Production Dashboard Design

Structure your Grafana dashboard with these panels for complete LLM visibility:

RowPanelsPurpose
OverviewRequest rate, error rate, active requests, SLO burnAt-a-glance health
Latencyp50/p95/p99 time series, latency heatmapPerformance tracking
CostHourly cost by model, tokens breakdown, cost per requestSpend monitoring
QualityQuality score trend, hallucination rate, evaluation pass rateOutput reliability
RAGRetrieval relevance, cache hit rate, document countRetrieval health
# Export dashboard-ready metrics from your application
from prometheus_client import Histogram, Counter

llm_request_duration = Histogram(
"llm_request_duration_seconds",
"LLM request duration",
["model", "feature"],
buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0],
)

llm_cost_total = Counter(
"llm_cost_usd_total",
"Total LLM cost in USD",
["model", "feature"],
)

llm_retrieval_relevance = Histogram(
"llm_retrieval_relevance_score",
"RAG retrieval relevance score",
["collection"],
buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
)

Implementation Checklist

  • Deploy Langfuse (self-hosted or cloud) for production tracing
  • Instrument all LLM calls with OpenTelemetry spans
  • Set up cost tracking per user, feature, and model
  • Implement automated evaluation pipelines (quality scores)
  • Create Grafana dashboards for business metrics
  • Configure alerts for latency, cost, and quality anomalies
  • Set up prompt versioning and A/B testing
  • Use Phoenix for development-time RAG quality analysis
CategoryToolPurpose
TracingLangfuse →Open-source LLM tracing and analytics
EvaluationArize Phoenix →LLM observability and evaluation platform
ObservabilityAI Observability Tools →Comprehensive observability tool directory
SecuritySlashLLM →Gateway with built-in usage tracking and cost controls
MonitoringLangSmith →LLM development, testing, and monitoring platform