LLM Evaluation & Testing Architecture
Overview
LLM applications cannot be tested with traditional unit tests. The outputs are non-deterministic, quality is subjective, and regressions are subtle — a model update or prompt change can degrade answer quality without triggering any error. Production LLM systems require a dedicated evaluation architecture that measures output quality continuously, catches regressions before deployment, and provides confidence that prompt changes improve rather than harm performance.
This playbook covers the infrastructure for systematic LLM evaluation — from offline benchmark suites to online production monitoring, including automated scoring, human review workflows, and CI/CD integration that gates deployments on quality metrics.
Three evaluation dimensions require different approaches: correctness (does the output answer the question accurately), safety (does the output follow policies and avoid harmful content), and quality (is the output well-structured, concise, and useful). Each dimension needs its own evaluation methodology and metrics.
Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Evaluation Triggers │
│ ┌────────────┐ ┌──────────────┐ ┌──────────────────────────┐│
│ │ CI/CD │ │ Prompt │ │ Model Update ││
│ │ Pipeline │ │ Change PR │ │ (Provider Release) ││
│ └────────────┘ └──────────────┘ └──────────────────────────┘│
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────────┐
│ Dataset Management │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Golden │ │ Production │ │ Adversarial │ │
│ │ Test Sets │ │ Samples │ │ Test Cases │ │
│ │ (curated) │ │ (sampled) │ │ (edge cases) │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────────┐
│ Evaluation Pipeline │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Run LLM │ │ Score │ │ Compare │ │
│ │ Inference │ │ Outputs │ │ Against Baseline │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
│ │
│ Scoring Methods: │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ LLM-as-Judge │ │ Heuristic │ │ Human Review │ │
│ │ (automated) │ │ (regex/code) │ │ (annotation UI) │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────────┐
│ Quality Gate │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────────┐ │
│ │ Pass/Fail │ │ Regression │ │ Report │ │
│ │ Decision │ │ Detection │ │ Generation │ │
│ └──────────────┘ └──────────────┘ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Evaluation Triggers initiate evaluation runs. CI/CD pipelines run evaluations on every prompt/code change. Prompt change PRs trigger comparison against the current baseline. Provider model updates (e.g., OpenAI releases a new GPT-4 version) trigger regression checks.
Dataset Management maintains curated test sets. Golden test sets contain human-verified input/expected-output pairs for core use cases. Production samples are randomly sampled from live traffic to represent real-world distribution. Adversarial test cases cover edge cases, prompt injections, and known failure modes.
Evaluation Pipeline runs inference against test sets and scores outputs. Three scoring methods: LLM-as-Judge uses a separate LLM (typically GPT-4) to evaluate quality against rubrics. Heuristic scoring applies deterministic checks (JSON validity, keyword presence, length constraints). Human review provides ground-truth annotation for high-stakes decisions.
Quality Gate makes the deploy/no-deploy decision. It compares scores against baselines, detects regressions across evaluation dimensions, and generates reports for stakeholders.
Infrastructure Components
| Component | Purpose | Implementation |
|---|---|---|
| Test dataset store | Versioned storage for evaluation datasets | S3 + DVC, PostgreSQL, Langfuse datasets |
| Evaluation runner | Execute LLM inference across test sets | LangSmith evaluations, Langfuse, custom scripts |
| LLM-as-Judge | Automated quality scoring with LLM | GPT-4 with scoring rubrics, custom evaluators |
| Heuristic evaluators | Deterministic output checks | Python functions (regex, JSON schema, assertions) |
| Human review UI | Annotation interface for human scoring | Langfuse annotation, Argilla, Label Studio |
| Baseline store | Historical evaluation scores for comparison | PostgreSQL, S3 + Parquet |
| CI/CD integration | Run evaluations in pipeline, gate deployments | GitHub Actions, GitLab CI, custom hooks |
| Report generator | Evaluation summary with regressions highlighted | Custom templates, Langfuse dashboards |
| Production monitor | Online quality scoring on sampled traffic | Langfuse online scoring, custom pipeline |
| Safety evaluator | Test for harmful, biased, or policy-violating outputs | SlashLLM red teaming, custom safety suite |
Recommended Tools
Evaluation Platforms
| Layer | Recommended | Alternative |
|---|---|---|
| Evaluation framework | LangSmith — datasets, evaluators, comparison views | Langfuse evaluations |
| Production tracing with scoring | Langfuse — attach scores to production traces | LangSmith monitoring |
| Human annotation | Argilla — open-source annotation platform | Label Studio, Langfuse annotation queue |
| Safety testing | SlashLLM — red teaming and adversarial testing | Garak, custom prompt injection suite |
Evaluation Methods
| Method | Best For | Trade-offs |
|---|---|---|
| LLM-as-Judge | Scalable quality scoring, style assessment | Costs tokens, judge LLM has its own biases |
| Heuristic/code | Format validation, keyword checks, length | Limited to measurable properties |
| Human review | Ground-truth quality, subjective dimensions | Expensive, slow, does not scale |
| Reference comparison | Factual accuracy (compare to known answer) | Requires curated reference answers |
| Pairwise comparison | Comparing two model/prompt versions | Requires paired outputs, LLM judge |
CI/CD Integration
| Layer | Recommended | Alternative |
|---|---|---|
| Pipeline | GitHub Actions with evaluation step | GitLab CI, Jenkins |
| Quality gate | Custom script reading LangSmith/Langfuse scores | Deployment webhook |
| Alerting | Slack/PagerDuty on regression detection | Email reports |
Deployment Workflow
Phase 1 — Build Evaluation Foundation (Week 1-2)
- Curate initial golden test set — 50-100 input/expected-output pairs covering core use cases
- Implement LLM-as-Judge evaluator with typed rubrics (correctness 1-5, helpfulness 1-5, safety pass/fail)
- Run first baseline evaluation and record scores in version-controlled baseline store
- Set up LangSmith or Langfuse evaluation project
Example LLM-as-Judge Rubric:
correctness_rubric = """
Score the response's factual correctness on a scale of 1-5:
5: Completely accurate, all facts verifiable
4: Mostly accurate, minor details may be imprecise
3: Partially accurate, mix of correct and incorrect
2: Mostly inaccurate, key facts wrong
1: Completely inaccurate or fabricated
Input: {input}
Expected: {expected_output}
Actual: {actual_output}
Score (1-5):
Explanation:
"""
Phase 2 — CI/CD Integration (Week 3-4)
- Add evaluation step to CI/CD pipeline — triggered on prompt changes and code changes
- Compare new evaluation scores against stored baseline
- Implement quality gate — block deployment if any dimension regresses beyond threshold
- Generate evaluation report as PR comment showing dimension-by-dimension comparison
- Add adversarial test cases — prompt injection attempts, edge cases, boundary inputs
GitHub Actions Integration Example:
- name: Run LLM Evaluation
run: |
python eval/run_evaluation.py \
--dataset eval/golden_test_set.jsonl \
--baseline eval/baseline_scores.json \
--output eval/results.json
- name: Check Quality Gate
run: |
python eval/quality_gate.py \
--results eval/results.json \
--threshold-correctness 3.8 \
--threshold-safety 1.0 \
--max-regression 0.2
Phase 3 — Production Quality Monitoring (Month 2+)
- Sample 1-5% of production traffic for automated evaluation
- Apply LLM-as-Judge scoring to sampled traces in Langfuse
- Build quality dashboard showing daily scores across evaluation dimensions
- Set up alerts when production quality drops below evaluation thresholds
- Implement human review queue — route low-scoring production outputs to annotators
- Build feedback loop — add corrected production samples to golden test set
Phase 4 — Advanced Evaluation (Month 3+)
- Implement pairwise evaluation for A/B testing prompt versions
- Build RAG-specific evaluators: retrieval relevance, answer faithfulness, context utilization
- Add safety evaluation suite using SlashLLM red teaming capabilities
- Create domain-specific evaluators for specialized use cases
- Build evaluation leaderboard comparing model/prompt versions across all dimensions
Security Considerations
- Evaluation data security — Golden test sets may contain sensitive data from production. Encrypt datasets at rest and restrict access to the evaluation pipeline.
- LLM-as-Judge manipulation — If evaluation inputs are derived from user data, adversaries could craft inputs that score artificially high with the judge LLM. Use diverse judges and heuristic cross-checks.
- Safety evaluation coverage — Ensure evaluation includes adversarial prompts, prompt injection attacks, and policy-violating inputs. Safety evaluation is not optional — it gates production deployment.
- Evaluation cost — LLM-as-Judge evaluations consume tokens. Budget for evaluation costs (typically 5-10% of production LLM spend) and use caching for repeated evaluations.
- Bias in evaluation — LLM judges can have systematic biases (preferring verbose responses, specific formatting). Calibrate judges with human-annotated reference scores and monitor for drift.