LLM-as-Judge Architecture

Version: 2.0.0 Status: Production Last Updated: 2026-02-03

Overview

LLM-as-Judge evaluates AI outputs using AI judges. This module provides G-Eval, QAG patterns, bias mitigation, and production utilities with enterprise-grade security.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    LLM-as-Judge Components                       │
├─────────────────────────────────────────────────────────────────┤
│  Evaluation Patterns          Production Utilities              │
│  ├── gEval()                  ├── JudgeCircuitBreaker           │
│  ├── qagEvaluate()            ├── evaluateWithRetry (60s cap)   │
│  ├── mitigatedPairwiseEval()  ├── runCanaryEvaluations          │
│  └── panelEvaluation()        └── LOG_LEVEL configuration       │
├─────────────────────────────────────────────────────────────────┤
│  Security Layer                                                  │
│  ├── 14 prompt injection patterns + Unicode TR39                │
│  ├── Input validation (64KB total, 10KB/field, 20 context)      │
│  ├── Safe JSON parsing (depth=5, optimized iteration)           │
│  └── 30s default timeout on all LLM calls                       │
└─────────────────────────────────────────────────────────────────┘

Core Types

interface EvaluationEvent {
  timestamp: string;
  evaluationName: string;        // "relevance", "faithfulness"
  scoreValue?: number;           // 0.0 - 1.0
  scoreLabel?: string;           // "pass", "fail"
  explanation?: string;
  evaluator?: string;
  evaluatorType?: 'llm' | 'human' | 'rule' | 'classifier';
  traceId?: string;
  sessionId?: string;
  durationMs?: number;
  errorType?: string;
}

interface GEvalConfig {
  name: string;
  criteria: string;
  evaluationParams: ('input' | 'output' | 'context' | 'expectedOutput')[];
  temperature?: number;          // 0.1-0.2 for consistency
}

interface LLMProvider {
  generate(prompt: string, options?: { temperature?: number; logprobs?: boolean }): Promise<GenerateResult>;
}

Evaluation Patterns

G-Eval (Chain-of-Thought + Logprobs)

Input → Generate eval steps → Evaluate with CoT → Normalize via logprobs → Score
  • buildEvalPrompt() - Constructs prompts with sanitization
  • normalizeWithLogprobs() - Probability-weighted normalization
  • gEval() - Full implementation

QAG (Question-Answer Generation)

Output → Extract statements → Generate yes/no questions → Answer from context → Score
  • extractStatements() - Atomic claims extraction
  • generateVerificationQuestion() - Statement to question
  • answerQuestion() - Context-based verification
  • qagEvaluate() - Full faithfulness evaluation

Bias Mitigation

StrategyFunctionDescription
Position biasmitigatedPairwiseEval()Double evaluation with order swap
Multi-judgepanelEvaluation()Median score from multiple models

Production Utilities

Circuit Breaker

const breaker = new JudgeCircuitBreaker({
  threshold: 5,           // failures before opening
  resetTimeout: 30000,    // ms before retry
  fallbackModel: 'gpt-4o-mini'
});
  • Ignores transient 429 rate limit errors
  • Supports fallback model switching

Retry Logic

await evaluateWithRetry(testCase, config);
// - Exponential backoff (capped at 60s)
// - Preserves error.cause chain
// - Validates score range

Canary Evaluations

const report = await runCanaryEvaluations();
// Built-in test cases:
// - Perfect answer (should score high)
// - Hallucination (should score low)
// - Off-topic (should score very low)

Security

ProtectionImplementation
Prompt injection14 detection patterns + Unicode normalization
Size limits64KB total, 10KB/field, 20 context items
JSON safetyDepth limit (5), optimized parsing
Timeouts30s default on all LLM calls

Configuration

VariableDefaultDescription
LLM_JUDGE_LOG_LEVELwarndebug, info, warn, error, silent

Files

FileLinesDescription
src/lib/llm-as-judge.ts1,699Main implementation
src/lib/llm-as-judge.test.ts3,171Test suite (108 tests)

Test Coverage

CategoryTests
Security utilities28
G-Eval pattern9
QAG pattern11
Bias mitigation17
Production utilities27
Canary evaluations7
Performance benchmarks5
Logging configuration4