Pillar 3: Quality & Lifecycle
Philosophy
"Reliability is a Feature" - Reliability competes with velocity for engineering resources. Treat it as a first-class requirement, not an afterthought.
Quality in AI systems cannot rely on "vibes" or spot checks. Unlike traditional software where correctness is binary, AI correctness is subjective and probabilistic. Quality & Lifecycle practices move development from intuition to rigorous, measurable engineering.
The goal: Measurable, reproducible, and improvable systems through automated testing and feedback loops.
Core Concepts
1. Evals-Driven Deployments
Principle: Never deploy without passing a regression suite. Vibes are not a deployment strategy.
CI/CD gates for AI systems must measure output correctness, not just code correctness.
Deployment Pipeline:
graph TD
A[Code Changes] --> B[Unit Tests]
B -->|Pass| C[Offline Evals]
C -->|Pass| D[Staging]
D --> E[Online Evals]
E -->|Pass| F[Canary 5%]
F --> G[Monitor 24h]
G -->|OK| H[Gradual 50%]
H --> I[Monitor 48h]
I -->|OK| J[Full 100%]
C -->|Fail| K[Block]
E -->|Fail| K
G -->|Degrade| L[Rollback]
I -->|Degrade| L
style K fill:#f44336
style L fill:#ff9800
Eval-Driven Deployment Checklist:
| Stage | Action | Pass Criteria | Rollback Trigger |
|---|---|---|---|
| Offline Evals | Run on golden dataset | Accuracy >95% | <95% accuracy |
| Staging | Deploy to internal users | No crashes | Critical errors |
| Canary (5%) | Monitor P95 latency, error rate | Within 10% of baseline | >10% degradation |
| Gradual (50%) | Monitor hallucination rate | <0.1% | >0.15% |
| Full (100%) | Monitor user satisfaction | >80% | <75% |
Rollback Triggers: Automatic rollback if any metric degrades beyond threshold.
2. Golden Datasets
Principle: Your eval suite is only as good as your test data.
Golden datasets are curated regression suites of inputs with labeled expected outputs. They're the foundation of offline evals.
Composition:
- 60% Core Capabilities: Common queries representing primary use cases
- 30% Edge Cases: Rare but important scenarios (e.g., ambiguous inputs, multi-step reasoning)
- 10% Adversarial: Jailbreak attempts, prompt injection, nonsense inputs
Maintenance Triggers:
- Production failures (add failed examples weekly)
- HITL escalations (add human-corrected cases)
- Quarterly review (remove stale examples, add new patterns)
Size Guidelines:
| System Complexity | Minimum Dataset Size | Recommended |
|---|---|---|
| Simple classifier | 50 examples | 100-200 |
| Multi-turn agent | 100 examples | 200-500 |
| Complex workflow | 200 examples | 500-1000 |
Version Control: Store in Git with semantic versioning (v1.2.3). Track changes in CHANGELOG.
Example Structure:
goldenDataset = [
{
"id": "refund-001",
"input": "I want to refund order #12345",
"expected_action": "initiate_refund",
"expected_confidence": ">0.8",
"tags": ["refund", "core"],
"added_date": "2024-01-15"
},
# ... more examples
]
3. Unit Testing Agents
Principle: Test components in isolation before testing end-to-end workflows.
Three Types of Unit Tests:
| Test Type | What It Tests | Example |
|---|---|---|
| Tool Calling | Agent selects correct tool | Query "What's the weather?" → calls getWeather() |
| Prompt Adherence | Agent follows instructions | "Respond in JSON" → output is valid JSON |
| Synthetic Data | Agent handles edge cases | Empty input, special characters, long text |
Example: Tool Calling Test
function testToolSelection():
testCases = [
{"input": "What's the weather?", "expected_tool": "getWeather"},
{"input": "Send email to john@example.com", "expected_tool": "sendEmail"},
{"input": "Book a flight to NYC", "expected_tool": "bookFlight"}
]
for testCase in testCases:
output = agent.process(testCase.input)
assert output.toolCalled == testCase.expected_tool
4. Online vs Offline Evals
Principle: Offline evals catch regressions. Online evals catch unknowns.
| Aspect | Offline Evals | Online Evals |
|---|---|---|
| When | Pre-deployment (CI/CD) | Post-deployment (production) |
| Data | Golden dataset (known examples) | Live traffic (real users) |
| Cost | Cheap (run on fixed dataset) | Expensive (run on all traffic) |
| Purpose | Catch regressions | Detect drift and unknowns |
| Feedback Loop | Immediate (blocks deployment) | Delayed (triggers alerts) |
Offline Eval Strategy:
- Run on every pull request
- Block merge if accuracy drops >5%
- Fast feedback (<5 minutes)
Online Eval Strategy:
- Sample 10% of production traffic
- Run async (don't block user responses)
- Alert if hallucination rate >0.1%
- Feed failures back to golden dataset
5. Feedback Loops for Continuous Improvement
Principle: Production failures should automatically improve your system.
graph LR
A[Production Traffic] --> B[Collect Failures]
B --> C[HITL Review]
C --> D[Update Golden Dataset]
D --> E[Retrain/Fine-Tune]
E --> F[Deploy New Version]
F --> A
Feedback Loop Velocity:
| Activity | Frequency | Owner |
|---|---|---|
| Failure Collection | Real-time | Automated monitoring |
| HITL Review | Daily | Human reviewers |
| Golden Dataset Updates | Weekly | ML engineers |
| Model Retraining | Monthly | ML engineers |
Key Metrics:
- Feedback Loop Latency: Time from production failure to golden dataset inclusion (target: <7 days)
- Dataset Growth Rate: New examples added per month (target: 10-20%)
- Improvement Rate: Accuracy gain per retrain cycle (target: 1-3%)
Metrics & Observability
Track these metrics to measure quality and lifecycle maturity:
| Metric | Target | Measurement |
|---|---|---|
| Golden Dataset Accuracy | >95% | % correct predictions on golden dataset |
| Deployment Success Rate | >90% | % deployments that don't rollback |
| User Satisfaction | >80% | NPS, thumbs up/down, explicit feedback |
| Feedback Loop Latency | <7 days | Time from failure to dataset inclusion |
| Eval Runtime | <5 minutes | P95 time for offline evals in CI/CD |
| Cost per Eval | <$1 | Average cost to run golden dataset eval |
Observability Requirements:
- Chain of Thought (CoT) Logging: Capture agent reasoning, not just inputs/outputs
- Cost Tracking: Monitor per-workflow and per-tenant costs
- Eval Dashboards: Real-time view of offline/online eval results
Common Pitfalls
-
No Golden Dataset
- Problem: Deploying changes without regression testing
- Fix: Build golden dataset with 100+ examples before first deployment
-
Static Golden Dataset
- Problem: Dataset becomes stale; doesn't reflect production queries
- Fix: Weekly updates from production failures and HITL escalations
-
Insufficient Coverage
- Problem: Golden dataset only tests happy paths, not edge cases
- Fix: 60% core, 30% edge, 10% adversarial distribution
-
Ignoring Online Metrics
- Problem: Offline evals pass but production performance degrades
- Fix: Set up online eval sampling and drift alerts
-
Slow Feedback Loops
- Problem: Months between production failure and model improvement
- Fix: Automate failure collection, weekly dataset updates
-
No Eval-Driven Gates
- Problem: Deploying to production without passing evals
- Fix: Block deployments if offline evals fail
This pillar is part of the AI Reliability Engineering (AIRE) Standards. Licensed under the AIRE Standards License v1.0.