DevOps for AI Systems

How to build reliable CI/CD pipelines, testing strategies, and deployment infrastructure for AI applications — including autonomous agents, RAG pipelines, and LLM-powered services.

Why Agent DevOps Is Different

Traditional microservices are deterministic — same input, same output. AI agents are:

Non-deterministic: Same prompt can produce different tool calls
Multi-step: Agents chain decisions, amplifying error probability
Resource-unpredictable: Token usage varies wildly per execution
Tool-dependent: Agents interact with external APIs, databases, file systems
Hard to test: Success criteria are often qualitative, not binary

DevOps Concern	Microservice	AI Agent
Testing	Unit + integration	Unit + integration + eval
Deployment	Blue-green / canary	Shadow mode + gradual rollout
Monitoring	Latency, errors	Latency, errors, quality, cost, tool success
Rollback	Code version	Code + model + prompt version
Cost control	Compute	Compute + tokens + API calls
Safety	Input validation	Input validation + output guardrails + action limits

Architecture Diagram

┌──────────────────────────────────────────────────────┐
│                    Agent Gateway                      │
│  ┌─────────┐  ┌──────────┐  ┌────────────────────┐  │
│  │ Auth &   │  │ Rate     │  │ Action Approval    │  │
│  │ Routing  │  │ Limiting │  │ (Human-in-loop)    │  │
│  └────┬─────┘  └────┬─────┘  └────────┬───────────┘  │
│       └──────────────┴────────────────┘               │
└───────────────────────┬──────────────────────────────┘
                        │
          ┌─────────────┴─────────────┐
          │                           │
    ┌─────▼──────┐            ┌───────▼──────┐
    │  CrewAI     │            │  AutoGen     │
    │  Agents     │            │  Agents      │
    ├────────────┤            ├──────────────┤
    │ Role-based │            │ Multi-agent  │
    │ Sequential │            │ Conversational│
    │ Workflows  │            │ Workflows    │
    └─────┬──────┘            └───────┬──────┘
          │                           │
    ┌─────▼───────────────────────────▼──────┐
    │            Tool Registry                │
    │  ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐  │
    │  │Search│ │ DB   │ │ API  │ │ Code │  │
    │  │Tools │ │Tools │ │Tools │ │Exec  │  │
    │  └──────┘ └──────┘ └──────┘ └──────┘  │
    └─────────────────────────────────────────┘

Testing Strategy

Level 1: Unit Tests (Deterministic)

Test tool functions and data transformations independently:

import pytest

def test_search_tool_returns_results():
    """Tool functions are deterministic — test like normal code."""
    results = search_knowledge_base("kubernetes pod crash")
    assert len(results) > 0
    assert all("kubernetes" in r.content.lower() for r in results)

def test_prompt_template_formatting():
    """Verify prompt templates produce valid prompts."""
    prompt = build_agent_prompt(
        role="SRE",
        task="Investigate high latency",
        context="p99 latency spiked to 5s"
    )
    assert "SRE" in prompt
    assert "high latency" in prompt

Level 2: Integration Tests (Semi-Deterministic)

Test agent workflows with mocked LLM responses:

from unittest.mock import patch

def test_incident_investigation_workflow():
    """Test agent decision flow with controlled LLM responses."""
    mock_responses = [
        "I'll check the metrics dashboard first.",
        "High CPU on node-3. Let me check pod logs.",
        "Root cause: memory leak in payment-service v2.3.1.",
    ]
    
    with patch("agent.call_llm", side_effect=mock_responses):
        result = run_incident_agent(
            alert="High latency on payment-service"
        )
    
    assert result.root_cause is not None
    assert "payment-service" in result.root_cause
    assert len(result.tools_used) >= 2

Level 3: Evaluation Tests (Non-Deterministic)

Run agents against real LLMs and score outputs:

def test_agent_quality_evaluation():
    """Run 20 scenarios and assert quality thresholds."""
    scenarios = load_test_scenarios("incident_investigation")
    scores = []
    
    for scenario in scenarios:
        result = run_incident_agent(alert=scenario.alert)
        score = evaluate_response(
            result.summary,
            scenario.expected_findings,
            criteria=["accuracy", "completeness", "actionability"]
        )
        scores.append(score)
    
    avg_score = sum(scores) / len(scores)
    assert avg_score >= 0.75, f"Quality below threshold: {avg_score:.2f}"

CI/CD Pipeline

# .github/workflows/agent-ci.yml
name: Agent CI/CD
on: [push, pull_request]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r requirements.txt
      - run: pytest tests/unit/ -v

  integration-tests:
    runs-on: ubuntu-latest
    needs: unit-tests
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/integration/ -v --mock-llm

  evaluation:
    runs-on: ubuntu-latest
    needs: integration-tests
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - run: pytest tests/eval/ -v --real-llm
      - run: python scripts/compare_eval_results.py
      # Fail if quality regression detected

  deploy-shadow:
    needs: evaluation
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - run: |
          # Deploy to shadow environment
          # Mirror production traffic, compare results
          kubectl apply -f k8s/shadow-deployment.yaml

  promote-production:
    needs: deploy-shadow
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production  # Requires manual approval
    steps:
      - run: kubectl apply -f k8s/production-deployment.yaml

Deployment Patterns

Shadow Mode

Run new agent version alongside production without affecting users:

Mirror all production requests to shadow agent
Compare shadow responses to production responses
Score quality delta automatically
Promote only if shadow quality >= production quality

Gradual Rollout with Quality Gates

Phase 1: 5% traffic  → Monitor 1 hour → Quality check
Phase 2: 25% traffic → Monitor 2 hours → Quality check
Phase 3: 50% traffic → Monitor 4 hours → Quality check
Phase 4: 100% traffic

Each phase requires:

Error rate < 1%
Quality scores >= baseline
Cost per request within 20% of baseline
No critical safety incidents

Operational Safety

Action Budgets

Limit what agents can do per execution:

class AgentSafetyConfig:
    max_llm_calls: int = 10        # Prevent infinite loops
    max_tool_calls: int = 20       # Limit external interactions
    max_tokens: int = 50_000       # Cost ceiling
    max_execution_time: int = 300  # 5-minute timeout
    
    # Dangerous actions require human approval
    requires_approval: list = [
        "delete_resource",
        "modify_production",
        "send_notification",
    ]

Circuit Breakers

Automatically disable agents that malfunction:

class AgentCircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=300):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
    
    def record_failure(self):
        self.failures += 1
        if self.failures >= self.threshold:
            self.trip()
    
    def trip(self):
        """Disable agent, alert on-call, switch to fallback."""
        disable_agent()
        alert_oncall("Agent circuit breaker tripped")
        enable_fallback_workflow()

RAG and LLM Pipeline CI/CD

Agent workflows are not the only AI systems that need CI/CD. RAG pipelines and LLM-powered APIs also require specialized deployment practices:

RAG Pipeline Deployment Checklist

# rag-ci-checks.yml — CI checks for RAG systems
rag_pipeline_checks:
  - name: "Embedding model version"
    check: "Verify embedding model matches production index"
    
  - name: "Retrieval quality"
    check: "Run eval suite against 50 test queries, assert recall > 0.8"
    
  - name: "Index freshness"
    check: "Verify vector index was rebuilt after document updates"
    
  - name: "Chunk size consistency"
    check: "Validate chunk size and overlap match production config"
    
  - name: "Prompt template diff"
    check: "Show prompt changes in PR review, run A/B eval"

Prompt Version Management

Track prompt templates as versioned artifacts alongside code:

# prompts/v2.3/rag_answer.yaml  (checked into Git)
# name: rag-answer
# version: 2.3
# model: gpt-4o
# temperature: 0.3

PROMPT_TEMPLATE = """You are a helpful assistant. Answer the user's question
using ONLY the provided context. If the context doesn't contain the answer,
say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

Every prompt change triggers:

Automated evaluation against test dataset
Quality score comparison vs current production prompt
PR review with before/after quality metrics
Staged rollout (5% → 25% → 100%) with rollback on quality drop

Implementation Checklist

Implement three-level testing (unit, integration, evaluation)
Set up CI/CD pipeline with quality gates
Deploy shadow mode infrastructure
Configure action budgets and safety limits
Add circuit breakers for production agents
Set up cost monitoring and per-request budgets
Implement gradual rollout with automated quality checks
Create runbooks for agent incidents

Recommended Tools

Category	Tool	Purpose
Agent Frameworks	AI Agent Frameworks →	Production agent orchestration platforms
Observability	Langfuse →	Agent execution tracing and evaluation
Observability	AI Observability Tools →	Monitoring tools for AI agent systems
Security	SlashLLM →	Guardrails and cost controls for agent pipelines
Comparison	CrewAI vs AutoGen →	Agent framework comparison for deployment decisions

Why Agent DevOps Is Different​

Architecture Diagram​

Testing Strategy​

Level 1: Unit Tests (Deterministic)​

Level 2: Integration Tests (Semi-Deterministic)​

Level 3: Evaluation Tests (Non-Deterministic)​

CI/CD Pipeline​

Deployment Patterns​

Shadow Mode​

Gradual Rollout with Quality Gates​

Operational Safety​

Action Budgets​

Circuit Breakers​

RAG and LLM Pipeline CI/CD​

RAG Pipeline Deployment Checklist​

Prompt Version Management​

Implementation Checklist​

Recommended Tools​

Related Guides​