Lab: Build an AIOps Monitoring Pipeline
Build a complete AIOps monitoring system that collects metrics, detects anomalies, and sends intelligent alerts.
Duration: 2-3 hours
Level: Intermediate
Prerequisites: Docker, Python 3.10+, basic Kubernetes knowledge
What You'll Build
┌──────────┐ ┌────────────┐ ┌───────────────┐
│ App + │───▶│ Prometheus │───▶│ Anomaly │
│ Metrics │ │ (Scrape) │ │ Detector │
└──────────┘ └────────────┘ │ (Python) │
│ └───────┬───────┘
▼ │
┌────────────┐ ┌──────▼───────┐
│ Grafana │ │ Alert │
│ Dashboard │ │ Manager │
└────────────┘ └──────────────┘
Step 1: Set Up the Application
Create a sample app that exposes Prometheus metrics:
# app.py
from prometheus_client import (
Counter, Histogram, Gauge, start_http_server
)
import random
import time
# Define metrics
REQUEST_COUNT = Counter(
'app_requests_total',
'Total requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'app_request_duration_seconds',
'Request latency',
['endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
ERROR_RATE = Gauge(
'app_error_rate',
'Current error rate'
)
def simulate_traffic():
"""Simulate realistic traffic with occasional anomalies."""
while True:
# Normal traffic
endpoint = random.choice(['/api/users', '/api/orders', '/api/health'])
latency = random.gauss(0.1, 0.02) # ~100ms avg
# Inject anomaly every ~5 minutes
if random.random() < 0.003:
latency = random.gauss(2.0, 0.5) # Spike to 2s
status = '500'
else:
status = '200'
REQUEST_COUNT.labels(
method='GET', endpoint=endpoint, status=status
).inc()
REQUEST_LATENCY.labels(endpoint=endpoint).observe(
max(latency, 0.001)
)
error_count = REQUEST_COUNT.labels('GET', endpoint, '500')._value.get()
total_count = sum(
REQUEST_COUNT.labels('GET', ep, s)._value.get()
for ep in ['/api/users', '/api/orders', '/api/health']
for s in ['200', '500']
) or 1
ERROR_RATE.set(error_count / total_count)
time.sleep(random.uniform(0.05, 0.2))
if __name__ == '__main__':
start_http_server(8000)
print("Metrics server started on :8000")
simulate_traffic()
Step 2: Docker Compose Stack
# docker-compose.yml
services:
app:
build: .
ports:
- "8000:8000"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
anomaly-detector:
build:
context: .
dockerfile: Dockerfile.detector
environment:
- PROMETHEUS_URL=http://prometheus:9090
volumes:
grafana-data:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'app'
static_configs:
- targets: ['app:8000']
Step 3: Anomaly Detector
# detector.py
import requests
import numpy as np
from datetime import datetime
import time
import os
PROMETHEUS_URL = os.getenv("PROMETHEUS_URL", "http://localhost:9090")
def query_prometheus(query: str) -> list:
"""Query Prometheus and return results."""
response = requests.get(
f"{PROMETHEUS_URL}/api/v1/query",
params={"query": query},
timeout=10
)
response.raise_for_status()
data = response.json()
return data.get("data", {}).get("result", [])
def query_range(query: str, duration: str = "30m") -> list:
"""Query Prometheus range."""
response = requests.get(
f"{PROMETHEUS_URL}/api/v1/query_range",
params={
"query": query,
"start": f"now-{duration}",
"end": "now",
"step": "15s"
},
timeout=10
)
response.raise_for_status()
return response.json().get("data", {}).get("result", [])
def detect_latency_anomaly(threshold_std: float = 3.0):
"""Detect latency anomalies using z-score method."""
results = query_range(
'rate(app_request_duration_seconds_sum[1m]) / '
'rate(app_request_duration_seconds_count[1m])'
)
for result in results:
values = [float(v[1]) for v in result["values"] if v[1] != "NaN"]
if len(values) < 10:
continue
mean = np.mean(values)
std = np.std(values)
current = values[-1]
if std > 0:
z_score = (current - mean) / std
if abs(z_score) > threshold_std:
endpoint = result["metric"].get("endpoint", "unknown")
print(f"[ANOMALY] {datetime.now()} | "
f"endpoint={endpoint} | "
f"latency={current:.3f}s | "
f"z_score={z_score:.2f} | "
f"mean={mean:.3f}s")
return True
return False
def detect_error_spike():
"""Detect sudden increase in error rate."""
results = query_prometheus('app_error_rate')
for result in results:
error_rate = float(result["value"][1])
if error_rate > 0.05: # 5% error threshold
print(f"[ALERT] Error rate spike: {error_rate:.2%}")
return True
return False
if __name__ == "__main__":
print("Anomaly detector started...")
while True:
detect_latency_anomaly()
detect_error_spike()
time.sleep(30)
Step 4: Run the Lab
# Start the stack
docker compose up -d
# View metrics
open http://localhost:8000/metrics
# Access Prometheus
open http://localhost:9090
# Access Grafana (admin/admin)
open http://localhost:3000
# Watch anomaly detector logs
docker compose logs -f anomaly-detector
Step 5: Create Grafana Dashboard
- Add Prometheus as a data source (
http://prometheus:9090) - Create a dashboard with these panels:
| Panel | Query | Type |
|---|---|---|
| Request Rate | rate(app_requests_total[1m]) | Time series |
| Latency P95 | histogram_quantile(0.95, rate(app_request_duration_seconds_bucket[5m])) | Time series |
| Error Rate | app_error_rate | Gauge |
| Request Count | app_requests_total | Stat |
Challenge Extensions
- Add Slack alerts — send anomaly notifications to a Slack channel
- ML-based detection — replace z-score with Isolation Forest
- Auto-remediation — restart pods when anomalies are detected
- Dashboard as code — export Grafana dashboard as JSON
Next Steps
- AIOps Architecture — Production AIOps design
- Observability Stack — Full observability guide