Agent-as-Judge Architecture

Version: 2.0.0 Status: Production Last Updated: 2026-02-03

Overview

Agent-as-Judge evaluates agentic AI systems using autonomous judge agents with planning, tool use, memory, and multi-agent collaboration. Addresses limitations of single-pass LLM evaluation for complex agent trajectories.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Agent-as-Judge Components                     │
├─────────────────────────────────────────────────────────────────┤
│  Judge Classes                  Tool Verification                │
│  ├── AgentJudge (base)         ├── verifyToolCall()             │
│  ├── ProceduralJudge           ├── verifyToolCalls()            │
│  └── ReactiveJudge             └── Weighted scoring             │
├─────────────────────────────────────────────────────────────────┤
│  Step Scoring                   Trajectory Analysis              │
│  ├── scoreStep()               ├── analyzeTrajectory()          │
│  ├── aggregateStepScores()     ├── Redundancy detection         │
│  └── Weighted aggregation      └── Loopiness metrics            │
├─────────────────────────────────────────────────────────────────┤
│  Multi-Agent Collaboration      Production Utilities             │
│  ├── collectiveConsensus()     ├── AgentEvalTimeoutError        │
│  ├── Convergence detection     ├── withAgentTimeout()           │
│  └── Variance tracking         └── LRU memory management        │
└─────────────────────────────────────────────────────────────────┘

Core Types

// src/backends/index.ts

interface StepScore {
  step: string | number;       // Step identifier
  score: number;               // Score for this step (0-1)
  evidence?: EvidenceValue;    // Supporting evidence
  explanation?: string;        // Human-readable explanation
}

interface ToolVerification {
  toolName: string;            // Actual tool called
  toolCallId?: string;         // Unique call ID
  toolCorrect: boolean;        // Correct tool selected?
  argsCorrect: boolean;        // Arguments valid?
  resultCorrect?: boolean;     // Result matched expectations?
  score: number;               // Overall score (0-1)
  expectedTool?: string;       // Expected tool
  evidence?: EvidenceValue;    // Supporting evidence
}

interface TrajectoryMetrics {
  trajectoryLength: number;    // Number of steps
  redundancyScore?: number;    // Repeated action detection
  loopinessScore?: number;     // Circular path detection
  efficiency?: number;         // Steps vs optimal path
}

Judge Patterns

Procedural Judge

Fixed evaluation pipeline with predefined stages.

Input → Stage 1 → Stage 2 → ... → Stage N → Aggregate
  • Early termination on critical failures
  • Context passed between stages

Reactive Judge

Adaptive evaluation based on intermediate feedback.

Input → Router → [Selected Specialists] → Deep Dive (if needed) → Synthesize
  • Dynamic specialist routing
  • Memory-tracked state

Consensus Judge

Multi-agent debate with convergence detection.

Judges → Round 1 → ... → Round N → Median (when variance < threshold)
  • Variance tracking per round
  • Early convergence exit

Metric Categories

CategoryMetricsScope
Single-turnTask completion, tool correctness, argument validityPer-action
Multi-turnConversation completeness, turn relevancy, context retentionSession
Multi-agentHandoff correctness, collaboration efficiency, role adherenceCross-agent

MCP Integration

Query Filters

ParameterTypeDescription
agentIdstringSubject agent ID (max 128 chars)
agentNamestringSubject agent name (max 256 chars)
evaluatorTypeenumllm, human, rule, classifier

Response Fields

FieldTypeDescription
stepScoresStepScore[]Per-step evaluation scores
toolVerificationsToolVerification[]Tool call verification results
trajectoryLengthnumberSteps in agent trajectory

Example Queries

// Query agent evaluations by subject
const agentEvals = await obs_query_evaluations({
  agentId: 'agent-123',
  evaluatorType: 'llm',
  aggregation: 'avg',
  groupBy: ['evaluationName']
});

// Task completion by agent
const taskScores = await obs_query_evaluations({
  evaluationName: 'task_completion',
  agentName: 'CustomerSupportBot',
  aggregation: 'p95'
});

Export Integration

All export tools support agent evaluation results:

PlatformAgent-Specific Features
LangfuseStep scores as span annotations
Confident AIAgent trajectory in test case metadata
Arize PhoenixTool verifications as evaluation feedback
DatadogAgent metrics with ML app segmentation

Files

FileLinesDescription
src/lib/agent-as-judge.ts~600Main implementation
src/lib/agent-as-judge.test.ts~1,200Test suite
src/backends/index.ts-StepScore, ToolVerification types

Test Coverage

CategoryTests
Tool verification25+
Step scoring20+
Trajectory analysis15+
Consensus evaluation20+
Judge classes30+
Error handling10+