Quality Evaluation Architecture

Version: 2.0.0 Status: Production Last Updated: 2026-02-03

Overview

Quality evaluation addresses the “invisible failure” problem where LLM systems appear operational but produce low-quality outputs. This layer provides evaluation event storage, multi-platform export, and LLM-as-Judge patterns.

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                        Quality Evaluation Layer                               │
├──────────────────────────────────────────────────────────────────────────────┤
│  Storage & Query                                                              │
│  ├── EvaluationResult schema (OTel GenAI semantic conventions)               │
│  ├── queryEvaluations() - JSONL backend with streaming aggregation           │
│  └── obs_query_evaluations - MCP tool with filters + aggregations            │
├──────────────────────────────────────────────────────────────────────────────┤
│  Platform Exports                                                             │
│  ├── Langfuse        obs_export_langfuse    OTLP + basic auth               │
│  ├── Confident AI    obs_export_confident   API key + env tagging           │
│  ├── Arize Phoenix   obs_export_phoenix     Bearer auth + project org       │
│  └── Datadog         obs_export_datadog     Two-phase spans + eval metrics  │
├──────────────────────────────────────────────────────────────────────────────┤
│  LLM-as-Judge (see quality/llm-as-judge.md)                                  │
│  ├── G-Eval (CoT + logprobs)    ├── Circuit breaker                         │
│  ├── QAG (faithfulness)         ├── Retry with backoff                       │
│  ├── Position bias mitigation   └── Canary evaluations                       │
│  └── Multi-judge panels                                                       │
├──────────────────────────────────────────────────────────────────────────────┤
│  Compliance                                                                   │
│  └── obs_query_verifications - EU AI Act human verification tracking         │
└──────────────────────────────────────────────────────────────────────────────┘

Core Types

// src/backends/index.ts

interface EvaluationResult {
  timestamp: string;
  evaluationName: string;       // gen_ai.evaluation.name (required)
  scoreValue?: number;          // gen_ai.evaluation.score.value
  scoreLabel?: string;          // gen_ai.evaluation.score.label
  scoreUnit?: string;           // "ratio_0_1", "percentage"
  explanation?: string;         // gen_ai.evaluation.explanation
  evaluator?: string;           // Model, human, system identity
  evaluatorType?: 'llm' | 'human' | 'rule' | 'classifier';
  responseId?: string;          // gen_ai.response.id (correlation)
  traceId?: string;
  spanId?: string;
  sessionId?: string;
}

type EvaluatorType = 'llm' | 'human' | 'rule' | 'classifier';

MCP Tools

obs_query_evaluations

Query evaluation events with filtering and aggregation.

ParameterTypeDescription
evaluationNamestringSubstring match
scoreMin/scoreMaxnumberScore range filter
scoreLabelstringExact match
evaluatorTypeenumllm, human, rule, classifier
aggregationenumavg, min, max, count, p50, p95, p99
groupByarrayevaluationName, scoreLabel, evaluator

Export Tools

All export tools share common filters plus platform-specific options:

ToolAuthKey Features
obs_export_langfuseBasic auth (pk:sk)OTLP /v1/traces, retry with backoff
obs_export_confidentAPI keyEnvironment tagging, metric collections
obs_export_phoenixBearer tokenProject organization, legacy auth support
obs_export_datadogDD_API_KEYTwo-phase export, auto metric type detection

obs_query_verifications

EU AI Act compliance - query human verification events.

ParameterTypeDescription
sessionIdstringSession filter
verificationTypeenumapproval, rejection, override, review

Environment Variables

Langfuse

| Variable | Required | Description | |———-|———-|————-| | LANGFUSE_ENDPOINT | Yes | API endpoint URL | | LANGFUSE_PUBLIC_KEY | Yes | Public key (pk-lf-…) | | LANGFUSE_SECRET_KEY | Yes | Secret key (sk-lf-…) |

Confident AI

| Variable | Required | Description | |———-|———-|————-| | CONFIDENT_ENDPOINT | No | Custom endpoint | | CONFIDENT_API_KEY | Yes | API key |

Arize Phoenix

| Variable | Required | Description | |———-|———-|————-| | PHOENIX_COLLECTOR_ENDPOINT | Yes | Collector URL | | PHOENIX_API_KEY | Yes | Bearer token |

Datadog

| Variable | Required | Description | |———-|———-|————-| | DD_API_KEY | Yes | Datadog API key | | DD_SITE | No | Site (datadoghq.com, eu, us3, us5, ap1) | | DD_LLMOBS_ML_APP | No | LLM application name |

OTel Event Structure

Trace: Customer Support Query
├── Span: invoke_agent CustomerSupportBot
│   ├── Span: chat claude-3-opus
│   │   └── Event: gen_ai.evaluation.result
│   │       ├── gen_ai.evaluation.name: "Relevance"
│   │       ├── gen_ai.evaluation.score.value: 0.92
│   │       ├── gen_ai.evaluation.score.label: "relevant"
│   │       └── gen_ai.evaluation.explanation: "Response addresses query"
│   │
│   └── Span: execute_tool lookup_customer
│       └── Event: gen_ai.evaluation.result
│           ├── gen_ai.evaluation.name: "ToolCorrectness"
│           └── gen_ai.evaluation.score.label: "pass"

Security

FeatureImplementation
DNS rebindingURL validation, HTTPS-only for exports
Memory protection600MB OOM threshold, MAX_AGGREGATION_GROUPS=10,000
Credential sanitizationMasked in logs, error messages
Timestamp validationYear 2000-3000 range

File Structure

src/
├── backends/
│   ├── index.ts                # EvaluationResult, EvaluatorType types
│   └── local-jsonl.ts          # queryEvaluations() method
├── lib/
│   ├── constants.ts            # Export env vars, HttpStatus
│   ├── export-utils.ts         # Shared export utilities
│   ├── langfuse-export.ts      # Langfuse OTLP export
│   ├── confident-export.ts     # Confident AI export
│   ├── phoenix-export.ts       # Arize Phoenix export
│   ├── datadog-export.ts       # Datadog LLM Obs export
│   ├── llm-as-judge.ts         # LLM-as-Judge patterns
│   └── verification-events.ts  # EU AI Act compliance
└── tools/
    ├── query-evaluations.ts    # obs_query_evaluations
    ├── export-langfuse.ts      # obs_export_langfuse
    ├── export-confident.ts     # obs_export_confident
    ├── export-phoenix.ts       # obs_export_phoenix
    ├── export-datadog.ts       # obs_export_datadog
    └── query-verifications.ts  # obs_query_verifications

Test Coverage

ComponentTestsFile
Query evaluations45+query-evaluations.test.ts
Langfuse export71langfuse-export.test.ts
Confident AI export40+confident-export.test.ts
Phoenix export50+phoenix-export.test.ts
Datadog export160datadog-export.test.ts
LLM-as-Judge108llm-as-judge.test.ts
Verifications30+query-verifications.test.ts