LLM-as-Judge Architecture
Version: 2.0.0 Status: Production Last Updated: 2026-02-03
Overview
LLM-as-Judge evaluates AI outputs using AI judges. This module provides G-Eval, QAG patterns, bias mitigation, and production utilities with enterprise-grade security.
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ LLM-as-Judge Components │
├─────────────────────────────────────────────────────────────────┤
│ Evaluation Patterns Production Utilities │
│ ├── gEval() ├── JudgeCircuitBreaker │
│ ├── qagEvaluate() ├── evaluateWithRetry (60s cap) │
│ ├── mitigatedPairwiseEval() ├── runCanaryEvaluations │
│ └── panelEvaluation() └── LOG_LEVEL configuration │
├─────────────────────────────────────────────────────────────────┤
│ Security Layer │
│ ├── 14 prompt injection patterns + Unicode TR39 │
│ ├── Input validation (64KB total, 10KB/field, 20 context) │
│ ├── Safe JSON parsing (depth=5, optimized iteration) │
│ └── 30s default timeout on all LLM calls │
└─────────────────────────────────────────────────────────────────┘
Core Types
interface EvaluationEvent {
timestamp: string;
evaluationName: string; // "relevance", "faithfulness"
scoreValue?: number; // 0.0 - 1.0
scoreLabel?: string; // "pass", "fail"
explanation?: string;
evaluator?: string;
evaluatorType?: 'llm' | 'human' | 'rule' | 'classifier';
traceId?: string;
sessionId?: string;
durationMs?: number;
errorType?: string;
}
interface GEvalConfig {
name: string;
criteria: string;
evaluationParams: ('input' | 'output' | 'context' | 'expectedOutput')[];
temperature?: number; // 0.1-0.2 for consistency
}
interface LLMProvider {
generate(prompt: string, options?: { temperature?: number; logprobs?: boolean }): Promise<GenerateResult>;
}
Evaluation Patterns
G-Eval (Chain-of-Thought + Logprobs)
Input → Generate eval steps → Evaluate with CoT → Normalize via logprobs → Score
buildEvalPrompt() - Constructs prompts with sanitizationnormalizeWithLogprobs() - Probability-weighted normalizationgEval() - Full implementation
QAG (Question-Answer Generation)
Output → Extract statements → Generate yes/no questions → Answer from context → Score
extractStatements() - Atomic claims extractiongenerateVerificationQuestion() - Statement to questionanswerQuestion() - Context-based verificationqagEvaluate() - Full faithfulness evaluation
Bias Mitigation
| Strategy | Function | Description |
|---|
| Position bias | mitigatedPairwiseEval() | Double evaluation with order swap |
| Multi-judge | panelEvaluation() | Median score from multiple models |
Production Utilities
Circuit Breaker
const breaker = new JudgeCircuitBreaker({
threshold: 5, // failures before opening
resetTimeout: 30000, // ms before retry
fallbackModel: 'gpt-4o-mini'
});
- Ignores transient 429 rate limit errors
- Supports fallback model switching
Retry Logic
await evaluateWithRetry(testCase, config);
// - Exponential backoff (capped at 60s)
// - Preserves error.cause chain
// - Validates score range
Canary Evaluations
const report = await runCanaryEvaluations();
// Built-in test cases:
// - Perfect answer (should score high)
// - Hallucination (should score low)
// - Off-topic (should score very low)
Security
| Protection | Implementation |
|---|
| Prompt injection | 14 detection patterns + Unicode normalization |
| Size limits | 64KB total, 10KB/field, 20 context items |
| JSON safety | Depth limit (5), optimized parsing |
| Timeouts | 30s default on all LLM calls |
Configuration
| Variable | Default | Description |
|---|
LLM_JUDGE_LOG_LEVEL | warn | debug, info, warn, error, silent |
Files
| File | Lines | Description |
|---|
src/lib/llm-as-judge.ts | 1,699 | Main implementation |
src/lib/llm-as-judge.test.ts | 3,171 | Test suite (108 tests) |
Test Coverage
| Category | Tests |
|---|
| Security utilities | 28 |
| G-Eval pattern | 9 |
| QAG pattern | 11 |
| Bias mitigation | 17 |
| Production utilities | 27 |
| Canary evaluations | 7 |
| Performance benchmarks | 5 |
| Logging configuration | 4 |