Agent-as-Judge Architecture

Version: 2.0.0 Status: Production Last Updated: 2026-02-03

Overview

Agent-as-Judge evaluates agentic AI systems using autonomous judge agents with planning, tool use, memory, and multi-agent collaboration. Addresses limitations of single-pass LLM evaluation for complex agent trajectories.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Agent-as-Judge Components                     │
├─────────────────────────────────────────────────────────────────┤
│  Judge Classes                  Tool Verification                │
│  ├── AgentJudge (base)         ├── verifyToolCall()             │
│  ├── ProceduralJudge           ├── verifyToolCalls()            │
│  └── ReactiveJudge             └── Weighted scoring             │
├─────────────────────────────────────────────────────────────────┤
│  Step Scoring                   Trajectory Analysis              │
│  ├── scoreStep()               ├── analyzeTrajectory()          │
│  ├── aggregateStepScores()     ├── Redundancy detection         │
│  └── Weighted aggregation      └── Loopiness metrics            │
├─────────────────────────────────────────────────────────────────┤
│  Multi-Agent Collaboration      Production Utilities             │
│  ├── collectiveConsensus()     ├── AgentEvalTimeoutError        │
│  ├── Convergence detection     ├── withAgentTimeout()           │
│  └── Variance tracking         └── LRU memory management        │
└─────────────────────────────────────────────────────────────────┘

Core Types

// src/backends/index.ts

interface StepScore {
  step: string | number;       // Step identifier
  score: number;               // Score for this step (0-1)
  evidence?: EvidenceValue;    // Supporting evidence
  explanation?: string;        // Human-readable explanation
}

interface ToolVerification {
  toolName: string;            // Actual tool called
  toolCallId?: string;         // Unique call ID
  toolCorrect: boolean;        // Correct tool selected?
  argsCorrect: boolean;        // Arguments valid?
  resultCorrect?: boolean;     // Result matched expectations?
  score: number;               // Overall score (0-1)
  expectedTool?: string;       // Expected tool
  evidence?: EvidenceValue;    // Supporting evidence
}

interface TrajectoryMetrics {
  trajectoryLength: number;    // Number of steps
  redundancyScore?: number;    // Repeated action detection
  loopinessScore?: number;     // Circular path detection
  efficiency?: number;         // Steps vs optimal path
}

Judge Patterns

Procedural Judge

Fixed evaluation pipeline with predefined stages.

Input → Stage 1 → Stage 2 → ... → Stage N → Aggregate

Early termination on critical failures
Context passed between stages

Reactive Judge

Adaptive evaluation based on intermediate feedback.

Input → Router → [Selected Specialists] → Deep Dive (if needed) → Synthesize

Dynamic specialist routing
Memory-tracked state

Consensus Judge

Multi-agent debate with convergence detection.

Judges → Round 1 → ... → Round N → Median (when variance < threshold)

Variance tracking per round
Early convergence exit

Metric Categories

Category	Metrics	Scope
Single-turn	Task completion, tool correctness, argument validity	Per-action
Multi-turn	Conversation completeness, turn relevancy, context retention	Session
Multi-agent	Handoff correctness, collaboration efficiency, role adherence	Cross-agent

MCP Integration

Query Filters

Parameter	Type	Description
`agentId`	string	Subject agent ID (max 128 chars)
`agentName`	string	Subject agent name (max 256 chars)
`evaluatorType`	enum	llm, human, rule, classifier

Response Fields

Field	Type	Description
`stepScores`	StepScore[]	Per-step evaluation scores
`toolVerifications`	ToolVerification[]	Tool call verification results
`trajectoryLength`	number	Steps in agent trajectory

Example Queries

// Query agent evaluations by subject
const agentEvals = await obs_query_evaluations({
  agentId: 'agent-123',
  evaluatorType: 'llm',
  aggregation: 'avg',
  groupBy: ['evaluationName']
});

// Task completion by agent
const taskScores = await obs_query_evaluations({
  evaluationName: 'task_completion',
  agentName: 'CustomerSupportBot',
  aggregation: 'p95'
});

Export Integration

All export tools support agent evaluation results:

Platform	Agent-Specific Features
Langfuse	Step scores as span annotations
Confident AI	Agent trajectory in test case metadata
Arize Phoenix	Tool verifications as evaluation feedback
Datadog	Agent metrics with ML app segmentation

Files

File	Lines	Description
`src/lib/agent-as-judge.ts`	~600	Main implementation
`src/lib/agent-as-judge.test.ts`	~1,200	Test suite
`src/backends/index.ts`	-	StepScore, ToolVerification types

Test Coverage

Category	Tests
Tool verification	25+
Step scoring	20+
Trajectory analysis	15+
Consensus evaluation	20+
Judge classes	30+
Error handling	10+

LLM-as-Judge Architecture - Single-pass evaluation baseline
Quality Evaluation Architecture - Evaluation storage and export