LLM Evaluation & Testing Architecture

Overview

LLM applications cannot be tested with traditional unit tests. The outputs are non-deterministic, quality is subjective, and regressions are subtle — a model update or prompt change can degrade answer quality without triggering any error. Production LLM systems require a dedicated evaluation architecture that measures output quality continuously, catches regressions before deployment, and provides confidence that prompt changes improve rather than harm performance.

This playbook covers the infrastructure for systematic LLM evaluation — from offline benchmark suites to online production monitoring, including automated scoring, human review workflows, and CI/CD integration that gates deployments on quality metrics.

Three evaluation dimensions require different approaches: correctness (does the output answer the question accurately), safety (does the output follow policies and avoid harmful content), and quality (is the output well-structured, concise, and useful). Each dimension needs its own evaluation methodology and metrics.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                   Evaluation Triggers                           │
│  ┌────────────┐  ┌──────────────┐  ┌──────────────────────────┐│
│  │ CI/CD      │  │ Prompt       │  │ Model Update             ││
│  │ Pipeline   │  │ Change PR    │  │ (Provider Release)       ││
│  └────────────┘  └──────────────┘  └──────────────────────────┘│
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                  Dataset Management                             │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────┐ │
│  │ Golden       │  │ Production   │  │ Adversarial           │ │
│  │ Test Sets    │  │ Samples      │  │ Test Cases            │ │
│  │ (curated)    │  │ (sampled)    │  │ (edge cases)          │ │
│  └──────────────┘  └──────────────┘  └───────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                  Evaluation Pipeline                            │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────┐ │
│  │ Run LLM      │  │ Score        │  │ Compare               │ │
│  │ Inference    │  │ Outputs      │  │ Against Baseline      │ │
│  └──────────────┘  └──────────────┘  └───────────────────────┘ │
│                                                                 │
│  Scoring Methods:                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────┐ │
│  │ LLM-as-Judge │  │ Heuristic    │  │ Human Review          │ │
│  │ (automated)  │  │ (regex/code) │  │ (annotation UI)       │ │
│  └──────────────┘  └──────────────┘  └───────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
                             │
┌────────────────────────────▼────────────────────────────────────┐
│                  Quality Gate                                   │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────────────┐ │
│  │ Pass/Fail    │  │ Regression   │  │ Report                │ │
│  │ Decision     │  │ Detection    │  │ Generation            │ │
│  └──────────────┘  └──────────────┘  └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

Evaluation Triggers initiate evaluation runs. CI/CD pipelines run evaluations on every prompt/code change. Prompt change PRs trigger comparison against the current baseline. Provider model updates (e.g., OpenAI releases a new GPT-4 version) trigger regression checks.

Dataset Management maintains curated test sets. Golden test sets contain human-verified input/expected-output pairs for core use cases. Production samples are randomly sampled from live traffic to represent real-world distribution. Adversarial test cases cover edge cases, prompt injections, and known failure modes.

Evaluation Pipeline runs inference against test sets and scores outputs. Three scoring methods: LLM-as-Judge uses a separate LLM (typically GPT-4) to evaluate quality against rubrics. Heuristic scoring applies deterministic checks (JSON validity, keyword presence, length constraints). Human review provides ground-truth annotation for high-stakes decisions.

Quality Gate makes the deploy/no-deploy decision. It compares scores against baselines, detects regressions across evaluation dimensions, and generates reports for stakeholders.

Infrastructure Components

Component	Purpose	Implementation
Test dataset store	Versioned storage for evaluation datasets	S3 + DVC, PostgreSQL, Langfuse datasets
Evaluation runner	Execute LLM inference across test sets	LangSmith evaluations, Langfuse, custom scripts
LLM-as-Judge	Automated quality scoring with LLM	GPT-4 with scoring rubrics, custom evaluators
Heuristic evaluators	Deterministic output checks	Python functions (regex, JSON schema, assertions)
Human review UI	Annotation interface for human scoring	Langfuse annotation, Argilla, Label Studio
Baseline store	Historical evaluation scores for comparison	PostgreSQL, S3 + Parquet
CI/CD integration	Run evaluations in pipeline, gate deployments	GitHub Actions, GitLab CI, custom hooks
Report generator	Evaluation summary with regressions highlighted	Custom templates, Langfuse dashboards
Production monitor	Online quality scoring on sampled traffic	Langfuse online scoring, custom pipeline
Safety evaluator	Test for harmful, biased, or policy-violating outputs	SlashLLM red teaming, custom safety suite

Recommended Tools

Evaluation Platforms

Layer	Recommended	Alternative
Evaluation framework	LangSmith — datasets, evaluators, comparison views	Langfuse evaluations
Production tracing with scoring	Langfuse — attach scores to production traces	LangSmith monitoring
Human annotation	Argilla — open-source annotation platform	Label Studio, Langfuse annotation queue
Safety testing	SlashLLM — red teaming and adversarial testing	Garak, custom prompt injection suite

Evaluation Methods

Method	Best For	Trade-offs
LLM-as-Judge	Scalable quality scoring, style assessment	Costs tokens, judge LLM has its own biases
Heuristic/code	Format validation, keyword checks, length	Limited to measurable properties
Human review	Ground-truth quality, subjective dimensions	Expensive, slow, does not scale
Reference comparison	Factual accuracy (compare to known answer)	Requires curated reference answers
Pairwise comparison	Comparing two model/prompt versions	Requires paired outputs, LLM judge

CI/CD Integration

Layer	Recommended	Alternative
Pipeline	GitHub Actions with evaluation step	GitLab CI, Jenkins
Quality gate	Custom script reading LangSmith/Langfuse scores	Deployment webhook
Alerting	Slack/PagerDuty on regression detection	Email reports

Deployment Workflow

Phase 1 — Build Evaluation Foundation (Week 1-2)

Curate initial golden test set — 50-100 input/expected-output pairs covering core use cases
Implement LLM-as-Judge evaluator with typed rubrics (correctness 1-5, helpfulness 1-5, safety pass/fail)
Run first baseline evaluation and record scores in version-controlled baseline store
Set up LangSmith or Langfuse evaluation project

Example LLM-as-Judge Rubric:

correctness_rubric = """
Score the response's factual correctness on a scale of 1-5:
5: Completely accurate, all facts verifiable
4: Mostly accurate, minor details may be imprecise
3: Partially accurate, mix of correct and incorrect
2: Mostly inaccurate, key facts wrong
1: Completely inaccurate or fabricated

Input: {input}
Expected: {expected_output}
Actual: {actual_output}

Score (1-5):
Explanation:
"""

Phase 2 — CI/CD Integration (Week 3-4)

Add evaluation step to CI/CD pipeline — triggered on prompt changes and code changes
Compare new evaluation scores against stored baseline
Implement quality gate — block deployment if any dimension regresses beyond threshold
Generate evaluation report as PR comment showing dimension-by-dimension comparison
Add adversarial test cases — prompt injection attempts, edge cases, boundary inputs

GitHub Actions Integration Example:

- name: Run LLM Evaluation
  run: |
    python eval/run_evaluation.py \
      --dataset eval/golden_test_set.jsonl \
      --baseline eval/baseline_scores.json \
      --output eval/results.json

- name: Check Quality Gate
  run: |
    python eval/quality_gate.py \
      --results eval/results.json \
      --threshold-correctness 3.8 \
      --threshold-safety 1.0 \
      --max-regression 0.2

Phase 3 — Production Quality Monitoring (Month 2+)

Sample 1-5% of production traffic for automated evaluation
Apply LLM-as-Judge scoring to sampled traces in Langfuse
Build quality dashboard showing daily scores across evaluation dimensions
Set up alerts when production quality drops below evaluation thresholds
Implement human review queue — route low-scoring production outputs to annotators
Build feedback loop — add corrected production samples to golden test set

Phase 4 — Advanced Evaluation (Month 3+)

Implement pairwise evaluation for A/B testing prompt versions
Build RAG-specific evaluators: retrieval relevance, answer faithfulness, context utilization
Add safety evaluation suite using SlashLLM red teaming capabilities
Create domain-specific evaluators for specialized use cases
Build evaluation leaderboard comparing model/prompt versions across all dimensions

Security Considerations

Evaluation data security — Golden test sets may contain sensitive data from production. Encrypt datasets at rest and restrict access to the evaluation pipeline.
LLM-as-Judge manipulation — If evaluation inputs are derived from user data, adversaries could craft inputs that score artificially high with the judge LLM. Use diverse judges and heuristic cross-checks.
Safety evaluation coverage — Ensure evaluation includes adversarial prompts, prompt injection attacks, and policy-violating inputs. Safety evaluation is not optional — it gates production deployment.
Evaluation cost — LLM-as-Judge evaluations consume tokens. Budget for evaluation costs (typically 5-10% of production LLM spend) and use caching for repeated evaluations.
Bias in evaluation — LLM judges can have systematic biases (preferring verbose responses, specific formatting). Calibrate judges with human-annotated reference scores and monitor for drift.

Overview​

Architecture Diagram​

Infrastructure Components​

Recommended Tools​

Evaluation Platforms​

Evaluation Methods​

CI/CD Integration​

Deployment Workflow​

Phase 1 — Build Evaluation Foundation (Week 1-2)​

Phase 2 — CI/CD Integration (Week 3-4)​

Phase 3 — Production Quality Monitoring (Month 2+)​

Phase 4 — Advanced Evaluation (Month 3+)​

Security Considerations​

Related Guides​