Observability Stack Guide

Observability is the ability to understand your system's internal state from its external outputs. This guide covers building a production observability stack.

The Three Pillars

┌──────────────────────────────────────────────────────────┐
│                  Observability Platform                    │
│                                                           │
│  ┌──────────────────────────────────────────────────┐    │
│  │                  Grafana                           │    │
│  │    Dashboards │ Alerts │ Explore │ Correlate       │    │
│  └─────────┬──────────┬──────────┬───────────────┘    │
│            │          │          │                      │
│  ┌─────────▼──┐ ┌─────▼─────┐ ┌─▼──────────┐         │
│  │ Prometheus  │ │   Loki     │ │   Tempo     │        │
│  │ (Metrics)   │ │  (Logs)    │ │  (Traces)   │        │
│  └─────────┬──┘ └─────┬─────┘ └─┬──────────┘         │
│            │          │          │                      │
│  ┌─────────▼──────────▼──────────▼───────────────┐    │
│  │            OpenTelemetry Collector              │    │
│  │     Receive │ Process │ Export                   │    │
│  └─────────────────────┬─────────────────────────┘    │
│                        │                               │
│  ┌─────────────────────▼─────────────────────────┐    │
│  │              Applications                       │    │
│  │   SDK Instrumentation │ Auto-instrumentation    │    │
│  └─────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────┘

Deployment with Helm

Prometheus Stack

# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack (Prometheus + Grafana + AlertManager)
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword=changeme \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=100Gi

Loki for Logs

helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set promtail.enabled=true \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=50Gi

Tempo for Traces

helm install tempo grafana/tempo \
  --namespace monitoring \
  --set tempo.storage.trace.backend=s3 \
  --set tempo.storage.trace.s3.bucket=my-tempo-traces

OpenTelemetry Instrumentation

Python Application

# Install: pip install opentelemetry-api opentelemetry-sdk \
#          opentelemetry-instrumentation-flask \
#          opentelemetry-exporter-otlp

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter
)
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import (
    OTLPMetricExporter
)
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader

# Setup tracing
trace_provider = TracerProvider()
trace_provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="otel-collector:4317")
    )
)
trace.set_tracer_provider(trace_provider)

# Setup metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="otel-collector:4317"),
    export_interval_millis=10000
)
metrics.set_meter_provider(MeterProvider(metric_readers=[metric_reader]))

# Usage
tracer = trace.get_tracer("my-service")
meter = metrics.get_meter("my-service")

request_counter = meter.create_counter(
    "requests_total",
    description="Total requests"
)

@app.route("/api/process")
def process():
    with tracer.start_as_current_span("process-request") as span:
        span.set_attribute("user.id", user_id)
        request_counter.add(1, {"endpoint": "/api/process"})
        result = do_processing()
        return result

SLO/SLI Configuration

Define SLOs

Service	SLI	SLO Target
API Gateway	Availability (2xx / total)	99.9%
API Gateway	Latency (p99 < 500ms)	99.0%
Payment Service	Availability	99.99%
Search Service	Latency (p95 < 200ms)	95.0%

Prometheus Recording Rules

# SLO recording rules
groups:
- name: slo-rules
  interval: 30s
  rules:
  # Availability SLI
  - record: sli:availability:ratio
    expr: |
      sum(rate(http_requests_total{status=~"2.."}[5m]))
      /
      sum(rate(http_requests_total[5m]))

  # Latency SLI (% of requests under 500ms)
  - record: sli:latency:ratio
    expr: |
      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
      /
      sum(rate(http_request_duration_seconds_count[5m]))

  # Error budget remaining (30-day window)
  - record: slo:error_budget:remaining
    expr: |
      1 - (
        (1 - sli:availability:ratio)
        /
        (1 - 0.999)
      )

Alerting Best Practices

Multi-Window Multi-Burn Rate Alerts

groups:
- name: slo-alerts
  rules:
  # Fast burn: 2% budget consumed in 1 hour
  - alert: SLOBurnRateFast
    expr: |
      (
        1 - sli:availability:ratio:1h > 14.4 * (1 - 0.999)
      )
      and
      (
        1 - sli:availability:ratio:5m > 14.4 * (1 - 0.999)
      )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error rate burning SLO budget fast"

  # Slow burn: 10% budget consumed in 3 days
  - alert: SLOBurnRateSlow
    expr: |
      (
        1 - sli:availability:ratio:3d > 1.0 * (1 - 0.999)
      )
      and
      (
        1 - sli:availability:ratio:6h > 1.0 * (1 - 0.999)
      )
    for: 1h
    labels:
      severity: warning

Key Dashboards to Build

Service Overview — Request rate, error rate, latency (RED method)
Infrastructure — CPU, memory, disk, network per node
SLO Dashboard — Error budget burn rate, SLI trends
Kubernetes — Pod health, deployment status, resource utilization
On-Call — Active alerts, recent incidents, runbook links

Next Steps

AIOps Architecture — Add AI to your observability
Kubernetes Operations — Cluster management
CI/CD Pipelines — Integrated deployments

The Three Pillars​

Deployment with Helm​

Prometheus Stack​

Loki for Logs​

Tempo for Traces​

OpenTelemetry Instrumentation​

Python Application​

SLO/SLI Configuration​

Define SLOs​

Prometheus Recording Rules​

Alerting Best Practices​

Multi-Window Multi-Burn Rate Alerts​

Key Dashboards to Build​

Next Steps​

The Three Pillars

Deployment with Helm

Prometheus Stack

Loki for Logs

Tempo for Traces

OpenTelemetry Instrumentation

Python Application

SLO/SLI Configuration

Define SLOs

Prometheus Recording Rules

Alerting Best Practices

Multi-Window Multi-Burn Rate Alerts

Key Dashboards to Build

Next Steps