Quality Metrics Dashboard

Version: 2.0.1 Status: Production Last Updated: 2026-02-06

Overview

Programmatic quality monitoring across 7 pre-defined LLM evaluation metrics with configurable alert thresholds.

  • Entry point: computeDashboardSummary() consumes evaluation results grouped by metric name
  • Output: QualityDashboardSummary with aggregated values, triggered alerts, and health status
  • Extensible: MetricConfigBuilder fluent API for custom metrics
  • Integrations: SigNoz, Grafana, Datadog

Architecture

┌──────────────────────────────────────────────────────────────────────────────┐
│                     Quality Metrics Dashboard Pipeline                       │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Data Source                                                                 │
│  ├── EvaluationResult[] from JSONL backend / obs_query_evaluations           │
│  └── Map<string, EvaluationResult[]> keyed by metric name                    │
│                                                                              │
│         │                                                                    │
│         ▼                                                                    │
│  ┌────────────────────────────────────────────────────────────────────────┐  │
│  │                    computeDashboardSummary()                            │  │
│  │                                                                        │  │
│  │  For each metric in QUALITY_METRICS + customMetrics:                    │  │
│  │  ┌───────────────┐  ┌───────────────────┐  ┌───────────────────────┐  │  │
│  │  │ Extract       │  │ Compute           │  │ Check Alert           │  │  │
│  │  │ scoreValue[]  │─▶│ Aggregations      │─▶│ Thresholds            │  │  │
│  │  │               │  │ (avg,p50,p95,...) │  │ (warning/critical)    │  │  │
│  │  └───────────────┘  └───────────────────┘  └──────────┬────────────┘  │  │
│  │                                                        │              │  │
│  │                                            ┌───────────▼───────────┐  │  │
│  │                                            │ Determine Health      │  │  │
│  │                                            │ Status                │  │  │
│  │                                            └───────────────────────┘  │  │
│  └────────────────────────────────────────────────────────────────────────┘  │
│         │                                                                    │
│         ▼                                                                    │
│  Output: QualityDashboardSummary                                             │
│  ├── overallStatus (worst of all metrics)                                    │
│  ├── metrics[] (QualityMetricResult per metric)                              │
│  ├── alerts[] (all triggered alerts with metricName)                         │
│  └── summary { totalMetrics, healthyMetrics, warningMetrics, ... }           │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Built-in Metrics

7 pre-defined metrics with recommended thresholds based on industry best practices for LLM evaluation. Available as the exported constant QUALITY_METRICS: Record<string, QualityMetricConfig>.

MetricDisplay NameUnitRangeAggregations
relevanceResponse Relevancescore0-1avg, p50, p95, min, count
task_completionTask Completion Raterate0-1avg, p50, count
tool_correctnessTool Selection Accuracyrate0-1avg, p50, count
hallucinationHallucination Raterate0-1avg, p95, max, count
evaluation_latencyEvaluation Latencyseconds0-60avg, p50, p95, p99, max, count
faithfulnessResponse Faithfulnessscore0-1avg, p50, p95, count
coherenceResponse Coherencescore0-1avg, p50, p95, count

Aggregation System

AggregationDescriptionAlgorithm
avgArithmetic meansum / count
minMinimum valueFirst element of sorted array
maxMaximum valueLast element of sorted array
countSample countArray length (always computed when data exists)
p5050th percentile (median)R-7 linear interpolation
p9595th percentileR-7 linear interpolation
p9999th percentileR-7 linear interpolation

Percentile calculation: rank = (percentile / 100) * (n - 1), interpolate between sorted[floor(rank)] and sorted[ceil(rank)]. All values except count are rounded to 4 decimal places.

Alert Thresholds

Warning Thresholds

MetricAggregationDirectionThresholdMessage Template
relevancep50below0.7Relevance p50 ({value}) below 0.7 threshold
task_completionavgbelow0.85Task completion rate ({value}) below 85% target
tool_correctnessavgbelow0.95Tool correctness ({value}) below 95% target
hallucinationavgabove0.1Hallucination rate ({value}) above 10% threshold
evaluation_latencyp95above5.0Evaluation latency p95 ({value}s) exceeds 5s target
faithfulnessp50below0.8Faithfulness p50 ({value}) below 0.8 threshold
coherencep50below0.75Coherence p50 ({value}) below 0.75 threshold

Critical Thresholds

MetricAggregationDirectionThresholdMessage Template
relevancep50below0.5Relevance p50 ({value}) critically low
task_completionavgbelow0.70Task completion rate ({value}) critically low
tool_correctnessavgbelow0.85Tool correctness ({value}) critically low
hallucinationavgabove0.2Hallucination rate ({value}) critically high
evaluation_latencyp95above10.0Evaluation latency p95 ({value}s) critically high
faithfulnessp50below0.6Faithfulness p50 ({value}) critically low

Coherence has no critical threshold by default; add one via MetricConfigBuilder if needed.

Alert Direction

  • below: Fires when the computed value is less than the threshold. Used for quality metrics where higher is better (relevance, faithfulness, task completion, tool correctness, coherence).
  • above: Fires when the computed value exceeds the threshold. Used for metrics where lower is better (hallucination rate, evaluation latency).

The {value} placeholder in message templates is replaced with the actual value formatted to 4 decimal places.

Triggered alerts within each metric are sorted by severity: critical first, then warning, then info.

Health Status Hierarchy

After evaluating all thresholds, each metric receives a health status:

PriorityStatusCondition
1no_dataNo evaluation scores available
2criticalAny triggered alert has severity critical
3warningAny triggered alert has severity warning
4healthyNo alerts triggered, data present

The dashboard overallStatus is the worst status across all metrics. If any metric is critical, the dashboard is critical. If all metrics have no data, the dashboard is no_data.

Core Interfaces

// src/lib/quality-metrics.ts

type AlertSeverity = 'info' | 'warning' | 'critical';
type ThresholdDirection = 'above' | 'below';

interface QualityMetricConfig {
  name: string;                    // Metric name (matches gen_ai.evaluation.name)
  displayName: string;             // Human-readable display name
  description: string;             // Description for dashboard tooltips
  aggregations: EvaluationAggregation[];  // Functions to compute
  alerts: AlertThreshold[];        // Alert thresholds
  range: { min: number; max: number };    // Expected score range
  unit: 'score' | 'rate' | 'seconds' | 'percentage';
}

interface AlertThreshold {
  aggregation: EvaluationAggregation;  // Which aggregation to monitor
  value: number;                        // Threshold value
  direction: 'above' | 'below';        // Alert condition direction
  severity: 'info' | 'warning' | 'critical';
  message: string;                      // Template with {value} placeholder
}

interface QualityMetricResult {
  name: string;
  displayName: string;
  values: Record<EvaluationAggregation, number | null>;  // Computed aggregations
  sampleCount: number;             // Number of evaluations used
  alerts: TriggeredAlert[];        // Alerts that fired
  status: 'healthy' | 'warning' | 'critical' | 'no_data';
  period?: { start: string; end: string };
}

interface TriggeredAlert {
  severity: 'info' | 'warning' | 'critical';
  message: string;                 // Formatted message with actual value
  aggregation: EvaluationAggregation;
  threshold: number;               // Configured threshold
  actualValue: number;             // Current value
  direction: 'above' | 'below';
}

interface QualityDashboardSummary {
  overallStatus: 'healthy' | 'warning' | 'critical' | 'no_data';
  metrics: QualityMetricResult[];
  alerts: Array<TriggeredAlert & { metricName: string }>;
  summary: {
    totalMetrics: number;
    healthyMetrics: number;
    warningMetrics: number;
    criticalMetrics: number;
    noDataMetrics: number;
  };
  timestamp: string;               // ISO timestamp when computed
}

Usage Examples

Computing a Dashboard Summary

import { computeDashboardSummary } from './lib/quality-metrics.js';
import type { EvaluationResult } from './backends/index.js';

const evaluationsByMetric = new Map<string, EvaluationResult[]>();

evaluationsByMetric.set('relevance', [
  { timestamp: '2026-02-06T10:00:00Z', evaluationName: 'relevance', scoreValue: 0.85 },
  { timestamp: '2026-02-06T10:01:00Z', evaluationName: 'relevance', scoreValue: 0.92 },
  { timestamp: '2026-02-06T10:02:00Z', evaluationName: 'relevance', scoreValue: 0.78 },
]);

evaluationsByMetric.set('hallucination', [
  { timestamp: '2026-02-06T10:00:00Z', evaluationName: 'hallucination', scoreValue: 0.05 },
  { timestamp: '2026-02-06T10:01:00Z', evaluationName: 'hallucination', scoreValue: 0.08 },
]);

const dashboard = computeDashboardSummary(evaluationsByMetric);
// dashboard.overallStatus: 'healthy'
// dashboard.summary: { totalMetrics: 7, healthyMetrics: 2, noDataMetrics: 5, ... }
// dashboard.alerts: []

// With optional parameters:
const period = { start: '2026-02-06T00:00:00Z', end: '2026-02-06T23:59:59Z' };
const dashboardWithPeriod = computeDashboardSummary(evaluationsByMetric, undefined, period);

Creating a Custom Metric

import {
  createMetricConfig,
  registerQualityMetric,
  getAllQualityMetrics,
  computeDashboardSummary,
} from './lib/quality-metrics.js';

const toxicity = createMetricConfig('toxicity')
  .displayName('Toxicity Score')
  .description('Measures harmful or toxic content in responses')
  .aggregations('avg', 'p50', 'p95', 'max', 'count')
  .range(0, 1)
  .unit('score')
  .alertAbove('avg', 0.1, 'warning', 'Toxicity avg ({value}) above 10% threshold')
  .alertAbove('avg', 0.25, 'critical', 'Toxicity avg ({value}) critically high')
  .build();

// Option 1: Register in module-scoped registry (for getQualityMetric / getAllQualityMetrics)
registerQualityMetric(toxicity);

// Option 2: Pass custom metrics directly to computeDashboardSummary
const dashboard = computeDashboardSummary(evaluationsByMetric, { toxicity });

// Bridge: use the registry to feed custom metrics into the dashboard
const allMetrics = getAllQualityMetrics();
const dashboardFromRegistry = computeDashboardSummary(evaluationsByMetric, allMetrics);

Note: registerQualityMetric populates a module-scoped registry for getQualityMetric / getAllQualityMetrics. It does not automatically feed into computeDashboardSummary. Pass custom metrics explicitly via the second parameter, or bridge with getAllQualityMetrics().

Interpreting Alerts

import {
  computeDashboardSummary,
  formatMetricValue,
  getQualityMetric,
} from './lib/quality-metrics.js';

const dashboard = computeDashboardSummary(evaluationsByMetric);

for (const alert of dashboard.alerts) {
  console.log(`[${alert.severity.toUpperCase()}] ${alert.metricName}: ${alert.message}`);
  // [CRITICAL] relevance: Relevance p50 (0.4500) critically low
}

for (const metric of dashboard.metrics) {
  const config = getQualityMetric(metric.name);
  if (config && metric.values.avg !== null) {
    console.log(`${metric.displayName}: ${formatMetricValue(metric.values.avg, config.unit)}`);
  }
}

Edge Cases

  • Empty data: Metrics with no evaluations return no_data status and null aggregation values
  • Null scores: Evaluations where scoreValue is undefined or null are filtered out before aggregation
  • Empty Map: Calling computeDashboardSummary(new Map()) returns all 7 built-in metrics with no_data status
  • Duplicate registration: registerQualityMetric() throws if metric name matches a built-in or already-registered custom metric
  • Unregister built-in: unregisterQualityMetric('relevance') returns false because built-in metrics are not stored in the custom registry
  • Registry scope: The custom metric registry is module-scoped. In multi-worker environments, each worker maintains its own registry

Value Formatting

formatMetricValue() produces display strings based on unit type:

UnitFormatExample InputExample Output
score4 decimal places0.85670.8567
ratePercentage, 1 decimal0.9595.0%
percentagePercentage, 1 decimal0.8585.0%
seconds2 decimals + “s”3.4563.46s
(null)N/A literalnullN/A

MetricConfigBuilder API

Fluent builder for creating custom QualityMetricConfig objects. Defaults: aggregations: ['avg', 'count'], range: { min: 0, max: 1 }, unit: 'score', alerts: [].

MethodSignatureDescription
displayName()(name: string): thisSet display name
description()(desc: string): thisSet description
aggregations()(...aggs: EvaluationAggregation[]): thisSet aggregations to compute
range()(min: number, max: number): thisSet expected value range
unit()(unit: 'score' \| 'rate' \| 'seconds' \| 'percentage'): thisSet measurement unit
alertBelow()(agg, value, severity, message?): thisAlert when aggregation falls below value
alertAbove()(agg, value, severity, message?): thisAlert when aggregation exceeds value
build()(): QualityMetricConfigFinalize and return config

Computation Functions

FunctionSignatureDescription
computeDashboardSummary(evaluationsByMetric: Map<string, EvaluationResult[]>, customMetrics?: Record<string, QualityMetricConfig>, period?: { start: string; end: string }) => QualityDashboardSummaryCompute dashboard across all built-in + custom metrics.
computeQualityMetric(evaluations: EvaluationResult[], config: QualityMetricConfig, period?: { start: string; end: string }) => QualityMetricResultCompute a single metric from evaluation results.
computeAggregations(scores: number[], aggregations: EvaluationAggregation[]) => Record<EvaluationAggregation, number \| null>Compute aggregation values from raw scores.
checkAlertThresholds(values: Record<EvaluationAggregation, number \| null>, thresholds: AlertThreshold[]) => TriggeredAlert[]Check values against thresholds, return triggered alerts (sorted by severity).
determineHealthStatus(alerts: TriggeredAlert[], hasData: boolean) => 'healthy' \| 'warning' \| 'critical' \| 'no_data'Derive health status from triggered alerts.
formatMetricValue(value: number \| null, unit: QualityMetricConfig['unit']) => stringFormat a metric value for display.
createMetricConfig(name: string) => MetricConfigBuilderCreate a fluent builder for custom metric configs.

Metric Registration API

FunctionSignatureDescription
registerQualityMetric(config: QualityMetricConfig) => voidRegister custom metric. Validates via Zod schema. Throws if name exists.
unregisterQualityMetric(name: string) => booleanRemove custom metric. Returns true if removed. Cannot remove built-in metrics.
getAllQualityMetrics() => Record<string, QualityMetricConfig>Get all metrics (built-in + custom).
getQualityMetric(name: string) => QualityMetricConfig \| undefinedGet a specific metric by name. Checks built-in first, then custom.

Zod validation via the exported qualityMetricConfigSchema enforces: name (1-100 chars), displayName (1-200 chars), description (0-1000 chars, empty string allowed), aggregations (min 1 entry), message (max 500 chars).

Files

All tests in src/lib/quality-metrics.test.ts.

FileLinesDescription
src/lib/quality-metrics.ts721Metric configs, aggregation, alerting, builder, registration
src/lib/quality-metrics.test.ts523Test suite (50 tests)

Test Coverage

CategoryTests
Pre-defined metrics (QUALITY_METRICS)6
Aggregation computation10
Alert threshold checking6
Health status determination4
Single metric computation5
Dashboard summary5
Metric registration7
Value formatting5
MetricConfigBuilder2
Total50