Quality Evaluation Architecture

Version: 2.0.0 Status: Production Last Updated: 2026-02-03

Overview

Quality evaluation addresses the “invisible failure” problem where LLM systems appear operational but produce low-quality outputs. This layer provides evaluation event storage, multi-platform export, and LLM-as-Judge patterns.

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                        Quality Evaluation Layer                               │
├──────────────────────────────────────────────────────────────────────────────┤
│  Storage & Query                                                              │
│  ├── EvaluationResult schema (OTel GenAI semantic conventions)               │
│  ├── queryEvaluations() - JSONL backend with streaming aggregation           │
│  └── obs_query_evaluations - MCP tool with filters + aggregations            │
├──────────────────────────────────────────────────────────────────────────────┤
│  Platform Exports                                                             │
│  ├── Langfuse        obs_export_langfuse    OTLP + basic auth               │
│  ├── Confident AI    obs_export_confident   API key + env tagging           │
│  ├── Arize Phoenix   obs_export_phoenix     Bearer auth + project org       │
│  └── Datadog         obs_export_datadog     Two-phase spans + eval metrics  │
├──────────────────────────────────────────────────────────────────────────────┤
│  LLM-as-Judge (see quality/llm-as-judge.md)                                  │
│  ├── G-Eval (CoT + logprobs)    ├── Circuit breaker                         │
│  ├── QAG (faithfulness)         ├── Retry with backoff                       │
│  ├── Position bias mitigation   └── Canary evaluations                       │
│  └── Multi-judge panels                                                       │
├──────────────────────────────────────────────────────────────────────────────┤
│  Compliance                                                                   │
│  └── obs_query_verifications - EU AI Act human verification tracking         │
└──────────────────────────────────────────────────────────────────────────────┘

Core Types

// src/backends/index.ts

interface EvaluationResult {
  timestamp: string;
  evaluationName: string;       // gen_ai.evaluation.name (required)
  scoreValue?: number;          // gen_ai.evaluation.score.value
  scoreLabel?: string;          // gen_ai.evaluation.score.label
  scoreUnit?: string;           // "ratio_0_1", "percentage"
  explanation?: string;         // gen_ai.evaluation.explanation
  evaluator?: string;           // Model, human, system identity
  evaluatorType?: 'llm' | 'human' | 'rule' | 'classifier';
  responseId?: string;          // gen_ai.response.id (correlation)
  traceId?: string;
  spanId?: string;
  sessionId?: string;
}

type EvaluatorType = 'llm' | 'human' | 'rule' | 'classifier';

MCP Tools

obs_query_evaluations

Query evaluation events with filtering and aggregation.

Parameter	Type	Description
`evaluationName`	string	Substring match
`scoreMin/scoreMax`	number	Score range filter
`scoreLabel`	string	Exact match
`evaluatorType`	enum	llm, human, rule, classifier
`aggregation`	enum	avg, min, max, count, p50, p95, p99
`groupBy`	array	evaluationName, scoreLabel, evaluator

Export Tools

All export tools share common filters plus platform-specific options:

Tool	Auth	Key Features
`obs_export_langfuse`	Basic auth (pk:sk)	OTLP /v1/traces, retry with backoff
`obs_export_confident`	API key	Environment tagging, metric collections
`obs_export_phoenix`	Bearer token	Project organization, legacy auth support
`obs_export_datadog`	DD_API_KEY	Two-phase export, auto metric type detection

obs_query_verifications

EU AI Act compliance - query human verification events.

Parameter	Type	Description
`sessionId`	string	Session filter
`verificationType`	enum	approval, rejection, override, review

Environment Variables

Langfuse

Confident AI

| Variable | Required | Description | |———-|———-|————-| | CONFIDENT_ENDPOINT | No | Custom endpoint | | CONFIDENT_API_KEY | Yes | API key |

Arize Phoenix

| Variable | Required | Description | |———-|———-|————-| | PHOENIX_COLLECTOR_ENDPOINT | Yes | Collector URL | | PHOENIX_API_KEY | Yes | Bearer token |

Datadog

OTel Event Structure

Trace: Customer Support Query
├── Span: invoke_agent CustomerSupportBot
│   ├── Span: chat claude-3-opus
│   │   └── Event: gen_ai.evaluation.result
│   │       ├── gen_ai.evaluation.name: "Relevance"
│   │       ├── gen_ai.evaluation.score.value: 0.92
│   │       ├── gen_ai.evaluation.score.label: "relevant"
│   │       └── gen_ai.evaluation.explanation: "Response addresses query"
│   │
│   └── Span: execute_tool lookup_customer
│       └── Event: gen_ai.evaluation.result
│           ├── gen_ai.evaluation.name: "ToolCorrectness"
│           └── gen_ai.evaluation.score.label: "pass"

Security

Feature	Implementation
DNS rebinding	URL validation, HTTPS-only for exports
Memory protection	600MB OOM threshold, MAX_AGGREGATION_GROUPS=10,000
Credential sanitization	Masked in logs, error messages
Timestamp validation	Year 2000-3000 range

File Structure

src/
├── backends/
│   ├── index.ts                # EvaluationResult, EvaluatorType types
│   └── local-jsonl.ts          # queryEvaluations() method
├── lib/
│   ├── constants.ts            # Export env vars, HttpStatus
│   ├── export-utils.ts         # Shared export utilities
│   ├── langfuse-export.ts      # Langfuse OTLP export
│   ├── confident-export.ts     # Confident AI export
│   ├── phoenix-export.ts       # Arize Phoenix export
│   ├── datadog-export.ts       # Datadog LLM Obs export
│   ├── llm-as-judge.ts         # LLM-as-Judge patterns
│   └── verification-events.ts  # EU AI Act compliance
└── tools/
    ├── query-evaluations.ts    # obs_query_evaluations
    ├── export-langfuse.ts      # obs_export_langfuse
    ├── export-confident.ts     # obs_export_confident
    ├── export-phoenix.ts       # obs_export_phoenix
    ├── export-datadog.ts       # obs_export_datadog
    └── query-verifications.ts  # obs_query_verifications

Test Coverage

Component	Tests	File
Query evaluations	45+	query-evaluations.test.ts
Langfuse export	71	langfuse-export.test.ts
Confident AI export	40+	confident-export.test.ts
Phoenix export	50+	phoenix-export.test.ts
Datadog export	160	datadog-export.test.ts
LLM-as-Judge	108	llm-as-judge.test.ts
Verifications	30+	query-verifications.test.ts

LLM-as-Judge Architecture - G-Eval, QAG, bias mitigation
Agent-as-Judge Architecture - Multi-agent evaluation patterns