Technical White Paper v1.0

Abstract

As Large Language Model (LLM) applications transition from experimental deployments to production-critical systems, the need for standardized observability practices has become paramount. This paper examines the current state of LLM observability standards, with particular focus on OpenTelemetry’s emerging GenAI semantic conventions, agent tracking methodologies, and quality measurement frameworks. We evaluate the observability-toolkit MCP server against these industry standards, identifying alignment areas and gaps. This document serves as an index to deeper technical analyses across five key domains: semantic conventions, agent observability, quality metrics, performance optimization, and tooling ecosystem.

Keywords: LLM observability, OpenTelemetry, GenAI semantic conventions, agent tracking, AI quality metrics, distributed tracing


1. Introduction

1.1 Problem Statement

The rapid adoption of LLM-based applications has outpaced the development of observability tooling, creating a fragmented landscape where teams rely on vendor-specific instrumentation, proprietary formats, and ad-hoc monitoring solutions. This fragmentation leads to:

  • Vendor lock-in through non-standard telemetry formats
  • Incomplete visibility into multi-step agent workflows
  • Inability to compare performance across providers and models
  • Quality blind spots where systems appear operational but produce low-quality outputs

1.2 Scope

This paper focuses on three primary areas:

  1. Standardization: OpenTelemetry GenAI semantic conventions (v1.39.0)
  2. Agent Tracking: Multi-turn, tool-use, and reasoning chain observability
  3. Quality Measurement: Production evaluation metrics beyond latency and throughput

1.3 Methodology

Research was conducted through:

  • Analysis of OpenTelemetry specification documents and GitHub discussions
  • Review of industry tooling (Langfuse, Arize Phoenix, Datadog LLM Observability)
  • Examination of academic literature on hallucination detection and LLM evaluation
  • Comparative analysis against the observability-toolkit MCP server implementation

2. Background: The Evolution of LLM Observability

2.1 Traditional ML Observability vs. LLM Observability

Traditional machine learning observability focused on:

  • Model accuracy metrics (precision, recall, F1)
  • Feature drift detection
  • Inference latency and throughput
  • Resource utilization

LLM applications introduce fundamentally different observability challenges:

DimensionTraditional MLLLM Applications
Input NatureStructured featuresUnstructured natural language
Output NatureDiscrete classes/valuesFree-form generated text
EvaluationGround truth comparisonSubjective quality assessment
Cost ModelCompute-basedToken-based pricing
Failure ModesClassification errorsHallucinations, toxicity, irrelevance
Execution PatternSingle inferenceMulti-turn, tool-augmented chains

2.2 The Three Pillars Extended

The traditional observability pillars (metrics, traces, logs) require extension for LLM systems:

┌─────────────────────────────────────────────────────────────────┐
│                    LLM Observability Pillars                     │
├─────────────────┬─────────────────┬─────────────────────────────┤
│     TRACES      │     METRICS     │           LOGS              │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ • Prompt chains │ • Token usage   │ • Prompt/completion content │
│ • Tool calls    │ • Latency (TTFT)│ • Error details             │
│ • Agent loops   │ • Cost per req  │ • Reasoning chains          │
│ • Retrieval     │ • Quality scores│ • Human feedback            │
└─────────────────┴─────────────────┴─────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EVALUATION LAYER (NEW)                        │
├─────────────────────────────────────────────────────────────────┤
│ • Hallucination detection    • Answer relevancy                 │
│ • Factual accuracy           • Task completion                  │
│ • Tool correctness           • Safety/toxicity                  │
└─────────────────────────────────────────────────────────────────┘

2.3 Key Industry Developments (2024-2026)

DateDevelopmentImpact
Apr 2024OTel GenAI SIG formationStandardization effort begins
Jun 2024GenAI semantic conventions draftInitial attribute definitions
Oct 2024Langfuse OTel supportOpen-source adoption
Dec 2024Datadog native OTel GenAI supportEnterprise validation
Jan 2025OTel v1.37+ GenAI conventionsProduction-ready standards
Mar 2025Agent framework conventions proposedMulti-agent standardization
Dec 2025OTel v1.39 GenAI conventionsAgent/tool span semantics

3. OpenTelemetry GenAI Semantic Conventions

Deep Dive Reference: See Appendix A: OTel GenAI Attribute Reference

3.1 Overview

The OpenTelemetry GenAI semantic conventions (currently in Development status) establish a standardized schema for:

  • Spans: LLM inference calls, tool executions, agent invocations
  • Metrics: Token usage histograms, operation duration, latency breakdowns
  • Events: Input/output messages, system instructions, tool definitions
  • Attributes: Model parameters, provider metadata, conversation context

3.2 Core Span Attributes

3.2.1 Required Attributes

AttributeTypeDescriptionExample
gen_ai.operation.namestringOperation typechat, invoke_agent, execute_tool
gen_ai.provider.namestringProvider identifieranthropic, openai, aws.bedrock

3.2.2 Conditionally Required Attributes

AttributeConditionTypeExample
gen_ai.request.modelIf availablestringclaude-3-opus-20240229
gen_ai.conversation.idWhen availablestringconv_5j66UpCpwteGg4YSxUnt7lPY
error.typeIf error occurredstringtimeout, rate_limit
AttributeTypeDescription
gen_ai.request.temperaturedoubleSampling temperature
gen_ai.request.max_tokensintMaximum output tokens
gen_ai.request.top_pdoubleNucleus sampling parameter
gen_ai.response.modelstringActual model that responded
gen_ai.response.finish_reasonsstring[]Why generation stopped
gen_ai.usage.input_tokensintPrompt token count
gen_ai.usage.output_tokensintCompletion token count

3.3 Operation Types

The specification defines seven standard operation names:

gen_ai.operation.name:
├── chat                 # Chat completion (most common)
├── text_completion      # Legacy completion API
├── generate_content     # Multimodal generation
├── embeddings           # Vector embeddings
├── create_agent         # Agent instantiation
├── invoke_agent         # Agent execution
└── execute_tool         # Tool/function execution

3.4 Provider Identifiers

Standardized gen_ai.provider.name values:

ProviderValueNotes
AnthropicanthropicClaude models
OpenAIopenaiGPT models
AWS Bedrockaws.bedrockMulti-model
Azure OpenAIazure.ai.openaiAzure-hosted
Google Geminigcp.geminiAI Studio API
Google Vertex AIgcp.vertex_aiEnterprise API
Coherecohere 
Mistral AImistral_ai 

3.5 Standard Metrics

3.5.1 Client Metrics

MetricTypeUnitBuckets
gen_ai.client.token.usageHistogram{token}[1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, …]
gen_ai.client.operation.durationHistograms[0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, …]

3.5.2 Server Metrics (for model hosting)

MetricTypeUnitPurpose
gen_ai.server.request.durationHistogramsTotal request time
gen_ai.server.time_to_first_tokenHistogramsPrefill + queue latency
gen_ai.server.time_per_output_tokenHistogramsDecode phase performance

3.6 Content Handling

The specification addresses sensitive content through three approaches:

  1. Default: Do not capture prompts/completions
  2. Opt-in attributes: Record on spans (gen_ai.input.messages, gen_ai.output.messages)
  3. External storage: Upload to secure storage, record references
Recommended for production:
┌─────────────────────────────────────────────────────────┐
│  Span: gen_ai.operation.name = "chat"                   │
│  ├── gen_ai.input.messages.uri = "s3://bucket/msg/123"  │
│  └── gen_ai.output.messages.uri = "s3://bucket/msg/124" │
└─────────────────────────────────────────────────────────┘

4. Agent Observability Standards

Deep Dive Reference: See Appendix B: Agent Span Hierarchies

4.1 The Agent Observability Challenge

AI agents introduce observability complexity through:

  • Non-deterministic execution: Same input may produce different tool call sequences
  • Multi-turn reasoning: Extended context across many LLM calls
  • Tool orchestration: External system interactions within agent loops
  • Framework diversity: LangGraph, CrewAI, AutoGen, etc. have different patterns

4.2 Agent Application vs. Framework Distinction

The OpenTelemetry specification distinguishes:

ConceptDefinitionExamples
Agent ApplicationSpecific AI-driven entityCustomer support bot, coding assistant
Agent FrameworkInfrastructure for building agentsLangGraph, CrewAI, Claude Code

4.3 Agent Span Semantics

4.3.1 Agent Creation Span

Span: create_agent {agent_name}
├── gen_ai.operation.name: "create_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.agent.description: "Handles tier-1 support queries"

4.3.2 Agent Invocation Span

Span: invoke_agent {agent_name}
├── gen_ai.operation.name: "invoke_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.conversation.id: "conv_xyz789"
    │
    ├── Child Span: chat claude-3-opus
    │   └── gen_ai.operation.name: "chat"
    │
    ├── Child Span: execute_tool get_customer_info
    │   ├── gen_ai.tool.name: "get_customer_info"
    │   ├── gen_ai.tool.type: "function"
    │   └── gen_ai.tool.call.id: "call_abc"
    │
    └── Child Span: chat claude-3-opus
        └── gen_ai.operation.name: "chat"

4.4 Tool Execution Attributes

AttributeTypeDescription
gen_ai.tool.namestringTool identifier
gen_ai.tool.typestringfunction, extension, datastore
gen_ai.tool.descriptionstringHuman-readable description
gen_ai.tool.call.idstringUnique call identifier
gen_ai.tool.call.argumentsanyInput parameters (opt-in, sensitive)
gen_ai.tool.call.resultanyOutput (opt-in, sensitive)

4.5 Framework Instrumentation Approaches

ApproachProsConsExamples
Baked-inZero config, consistentBloat, version lagCrewAI
External OTelDecoupled, community-maintainedIntegration complexityOpenLLMetry
OTel ContribOfficial support, best practicesReview queue delaysinstrumentation-genai

4.6 Claude Code as Agent System

Claude Code exhibits agent characteristics:

  • Multi-turn conversation management
  • Tool execution (Bash, Read, Write, Edit, etc.)
  • Reasoning chains across tool calls
  • Session-based context

Current gap: Claude Code telemetry doesn’t emit standardized agent spans.


5. Quality and Evaluation Metrics

Deep Dive Reference: See Appendix C: LLM Evaluation Frameworks

5.1 The Quality Visibility Problem

Traditional observability answers: “Is the system up and performing?”

LLM observability must also answer: “Is the system producing good outputs?”

System Status Matrix:
                    │ Quality: Good    │ Quality: Bad
────────────────────┼──────────────────┼──────────────────
Performance: Good   │ Healthy          │ INVISIBLE FAILURE
Performance: Bad    │ Investigate      │ Obvious failure

The “invisible failure” quadrant is uniquely dangerous for LLM systems.

5.2 Core Quality Metrics

MetricDescriptionMeasurement Method
Answer RelevancyOutput addresses input intentLLM-as-judge, embedding similarity
FaithfulnessOutput grounded in provided contextLLM-as-judge, NLI models
HallucinationFabricated or false informationLLM-as-judge, fact verification
Task CompletionAgent accomplished stated goalRule-based + LLM assessment
Tool CorrectnessCorrect tools called with valid argsDeterministic validation
Toxicity/SafetyOutput meets safety guidelinesClassifier models, guardrails

5.3 LLM-as-Judge Pattern

The dominant approach for quality evaluation:

┌─────────────────────────────────────────────────────────────┐
│                    LLM-as-Judge Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Production LLM Call                                         │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐             │
│  │  Input   │───▶│  Model A  │───▶│  Output  │             │
│  └──────────┘    └───────────┘    └──────────┘             │
│       │                                │                     │
│       │         Evaluation LLM         │                     │
│       │    ┌───────────────────────┐   │                     │
│       └───▶│       Model B         │◀──┘                     │
│            │  (Judge: GPT-4, etc.) │                         │
│            └───────────┬───────────┘                         │
│                        │                                     │
│                        ▼                                     │
│            ┌───────────────────────┐                         │
│            │   Quality Scores      │                         │
│            │ • Relevancy: 0.85     │                         │
│            │ • Faithfulness: 0.92  │                         │
│            │ • Hallucination: 0.08 │                         │
│            └───────────────────────┘                         │
└─────────────────────────────────────────────────────────────┘

5.4 Evaluation Tool Landscape (2026)

ToolTypeKey Features
LangfuseOpen SourceTracing, prompt management, evals, 19k+ stars
Arize PhoenixOpen SourceOTel-native, OTLP ingestion, 7.8k stars
DeepEvalOpen Source14+ metrics, CI/CD integration
MLflow 3.0Open SourceGenAI evals, experiment tracking
Datadog LLM ObsCommercialHallucination detection, production monitoring
BraintrustCommercialEval datasets, prompt playground

5.5 Production Evaluation Architecture

┌─────────────────────────────────────────────────────────────────┐
│                  Production Evaluation Flow                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. CAPTURE                2. EVALUATE              3. ITERATE   │
│  ┌─────────────┐          ┌─────────────┐         ┌───────────┐ │
│  │ Production  │          │ Async Eval  │         │ Feedback  │ │
│  │   Traces    │─────────▶│   Workers   │────────▶│   Loop    │ │
│  └─────────────┘          └─────────────┘         └───────────┘ │
│        │                        │                       │        │
│        │                        │                       │        │
│        ▼                        ▼                       ▼        │
│  ┌─────────────┐          ┌─────────────┐         ┌───────────┐ │
│  │   Span +    │          │   Quality   │         │  Prompt   │ │
│  │  Metadata   │          │   Scores    │         │ Iteration │ │
│  └─────────────┘          └─────────────┘         └───────────┘ │
│                                                                  │
│  Promote interesting traces to evaluation datasets               │
└─────────────────────────────────────────────────────────────────┘

5.6 Hallucination Detection Challenges

Recent research (arXiv:2504.18114) reveals limitations:

  • Metrics often fail to align with human judgments
  • Inconsistent gains with model parameter scaling
  • Myopic view focusing on surface-level patterns
  • GPT-4 as judge yields best overall results but has cost implications

6. Comparative Analysis: observability-toolkit MCP

6.1 Architecture Overview

The observability-toolkit MCP server provides:

┌─────────────────────────────────────────────────────────────┐
│                  observability-toolkit v1.6.0               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Data Sources:                                               │
│  ├── Local JSONL files (~/.claude/telemetry/)               │
│  └── SigNoz Cloud API (optional)                            │
│                                                              │
│  Tools:                                                      │
│  ├── obs_query_traces      - Distributed trace queries      │
│  ├── obs_query_metrics     - Metric aggregation             │
│  ├── obs_query_logs        - Log search with boolean ops    │
│  ├── obs_query_llm_events  - LLM-specific event queries     │
│  ├── obs_health_check      - System health + cache stats    │
│  ├── obs_context_stats     - Context window utilization     │
│  └── obs_get_trace_url     - SigNoz trace viewer links      │
│                                                              │
│  Performance Features:                                       │
│  ├── LRU query caching                                       │
│  ├── File indexing (.idx sidecars)                          │
│  ├── Gzip compression support                               │
│  ├── Streaming with early termination                       │
│  └── Circuit breaker for SigNoz                             │
└─────────────────────────────────────────────────────────────┘

6.2 OTel GenAI Compliance Matrix

RequirementSpecImplementationStatus
gen_ai.operation.nameRequiredNot capturedGap
gen_ai.provider.nameRequiredUses gen_ai.systemPartial
gen_ai.request.modelCond. RequiredCapturedCompliant
gen_ai.conversation.idCond. RequiredNot capturedGap
gen_ai.usage.input_tokensRecommendedCapturedCompliant
gen_ai.usage.output_tokensRecommendedCapturedCompliant
gen_ai.response.modelRecommendedNot capturedGap
gen_ai.response.finish_reasonsRecommendedNot capturedGap
gen_ai.request.temperatureRecommendedNot capturedGap
gen_ai.request.max_tokensRecommendedNot capturedGap

Compliance Score: 4/10 attributes implemented

6.3 Agent Tracking Analysis

CapabilitySpec RequirementImplementationStatus
Agent spans (create_agent, invoke_agent)DefinedNot implementedGap
Tool execution spans (execute_tool)DefinedNot implementedGap
gen_ai.agent.idRecommendedNot capturedGap
gen_ai.agent.nameRecommendedNot capturedGap
gen_ai.tool.nameRecommendedNot capturedGap
gen_ai.tool.call.idRecommendedNot capturedGap
Session correlationCustomUses session.idPartial

Agent Compliance: Minimal - relies on generic trace attributes

6.4 Metrics Compliance

MetricSpecImplementationStatus
gen_ai.client.token.usageHistogram w/ bucketsFlat storagePartial
gen_ai.client.operation.durationHistogram w/ bucketsAvailable via tracesPartial
gen_ai.server.time_to_first_tokenHistogramNot implementedGap
gen_ai.server.time_per_output_tokenHistogramNot implementedGap
Aggregation supportsum, avg, p50, p95, p99sum, avg, min, max, countPartial

6.5 Quality/Eval Capabilities

CapabilityIndustry StandardImplementationStatus
Hallucination detectionLLM-as-judgeNot implementedGap
Answer relevancy scoringAutomated evalNot implementedGap
Human feedback collectionAnnotation APINot implementedGap
Eval dataset managementTrace promotionNot implementedGap
Cost trackingPrice * tokensToken counts onlyPartial

6.6 Strengths Relative to Industry

StrengthDescriptionCompetitive Position
Multi-directory scanningAggregates telemetry across locationsUnique
Gzip supportTransparent compression handlingStandard
Index filesFast lookups via .idx sidecarsAbove average
Query cachingLRU with TTL and statsStandard
OTLP exportStandard format outputCompliant
Local-firstNo cloud dependency requiredDifferentiator
Claude Code integrationPurpose-built for CC sessionsUnique

7. Recommendations and Roadmap

7.1 Priority Matrix

                        Impact
                    Low         High
                ┌───────────┬───────────┐
           High │ P3: Nice  │ P1: Do    │
    Effort      │  to have  │   First   │
                ├───────────┼───────────┤
            Low │ P4: Maybe │ P2: Quick │
                │   later   │    Wins   │
                └───────────┴───────────┘

7.2 Phase 1: OTel GenAI Compliance (P1/P2)

Goal: Achieve 80%+ compliance with GenAI semantic conventions

TaskPriorityEffortImpact
Add gen_ai.operation.name to LLM eventsP1LowHigh
Rename gen_ai.systemgen_ai.provider.nameP2LowMedium
Capture gen_ai.conversation.idP1MediumHigh
Add gen_ai.response.modelP2LowMedium
Add gen_ai.response.finish_reasonsP2LowMedium

Estimated effort: 1-2 development cycles

7.3 Phase 2: Agent Observability (P1)

Goal: First-class support for agent/tool span semantics

TaskPriorityEffortImpact
Define agent span schemaP1MediumHigh
Tool execution span trackingP1MediumHigh
Agent invocation correlationP1HighHigh
Multi-agent workflow visualizationP3HighMedium

Estimated effort: 2-3 development cycles

7.4 Phase 3: Metrics Enhancement (P2)

Goal: Standard histogram metrics with OTel bucket boundaries

TaskPriorityEffortImpact
Implement histogram aggregationP2MediumMedium
Add p50/p95/p99 percentilesP2LowMedium
Time-to-first-token metricP2MediumMedium
Cost estimation layerP3LowLow

Estimated effort: 1-2 development cycles

7.5 Phase 4: Quality Layer (P3)

Goal: Optional integration with evaluation frameworks

TaskPriorityEffortImpact
Langfuse integration researchP3LowMedium
Eval score storage schemaP3MediumMedium
LLM-as-judge hook supportP4HighHigh
Human feedback APIP4HighMedium

Estimated effort: 3+ development cycles (optional)

7.6 Implementation Roadmap

2026 Q1                    Q2                    Q3
───────────────────────────────────────────────────────────
│ Phase 1: OTel Compliance │ Phase 2: Agents    │ Phase 3  │
│ ├── gen_ai.operation.name│ ├── Agent spans    │ Metrics  │
│ ├── gen_ai.provider.name │ ├── Tool spans     │ ├── Hist │
│ ├── gen_ai.conversation  │ └── Correlation    │ └── TTFT │
│ └── finish_reasons       │                    │          │
───────────────────────────────────────────────────────────
                                                     │
                                                     ▼
                                              Phase 4: Quality
                                              (Future/Optional)

8. Future Research Directions

8.1 Emerging Standards

  1. MCP Observability Conventions: As Model Context Protocol gains adoption, standardized telemetry for MCP tool calls may emerge
  2. Agentic System Semantics: OTel Issue #2664 proposes comprehensive agent conventions including tasks, artifacts, and memory
  3. Multi-Agent Coordination: Observability for agent-to-agent communication patterns

8.2 Quality Measurement Evolution

  1. Real-time hallucination detection: Moving from batch eval to streaming assessment
  2. Automated regression detection: Identifying quality degradation across model updates
  3. Domain-specific evaluators: Specialized judges for code, medical, legal domains

8.3 Cost Optimization

  1. Token budget observability: Real-time context window utilization tracking
  2. Model routing telemetry: Visibility into cost/quality tradeoff decisions
  3. Caching effectiveness: Measuring semantic cache hit rates

8.4 Privacy and Compliance

  1. Content redaction pipelines: OTel Collector processors for PII removal
  2. Audit trail requirements: Regulatory compliance for AI decisions
  3. Differential privacy: Aggregated telemetry without individual exposure

9. References

9.1 OpenTelemetry Specifications

  1. OpenTelemetry. “Semantic conventions for generative AI systems.” https://opentelemetry.io/docs/specs/semconv/gen-ai/

  2. OpenTelemetry. “Semantic conventions for generative client AI spans.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/

  3. OpenTelemetry. “Semantic conventions for generative AI metrics.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/

  4. OpenTelemetry. “Gen AI Registry Attributes.” https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/

9.2 Industry Publications

  1. Liu, G. & Solomon, S. “AI Agent Observability - Evolving Standards and Best Practices.” OpenTelemetry Blog, March 2025. https://opentelemetry.io/blog/2025/ai-agent-observability/

  2. Jain, I. “An Introduction to Observability for LLM-based applications using OpenTelemetry.” OpenTelemetry Blog, June 2024. https://opentelemetry.io/blog/2024/llm-observability/

  3. Datadog. “Datadog LLM Observability natively supports OpenTelemetry GenAI Semantic Conventions.” December 2025. https://www.datadoghq.com/blog/llm-otel-semantic-convention/

  4. Horovits, D. “OpenTelemetry for GenAI and the OpenLLMetry project.” Medium, November 2025. https://horovits.medium.com/opentelemetry-for-genai-and-the-openllmetry-project-81b9cea6a771

9.3 Evaluation and Quality

  1. Confident AI. “LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide.” https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation

  2. DeepEval. “Hallucination Metric Documentation.” https://deepeval.com/docs/metrics-hallucination

  3. “Evaluating Evaluation Metrics – The Mirage of Hallucination Detection.” arXiv:2504.18114, 2025.

9.4 Tools and Frameworks

  1. Langfuse. “OpenTelemetry (OTel) for LLM Observability.” https://langfuse.com/blog/2024-10-opentelemetry-for-llm-observability

  2. Traceloop. “OpenLLMetry: Open-source observability for GenAI.” https://github.com/traceloop/openllmetry

  3. Anthropic. “Building effective agents.” https://www.anthropic.com/research/building-effective-agents


10. Appendices

Appendix A: OTel GenAI Attribute Reference

Status: Index entry for future deep dive

Complete reference of all gen_ai.* attributes with:

  • Full attribute list with types and examples
  • Requirement levels by operation type
  • Provider-specific extensions
  • Migration guide from pre-1.37 conventions

Appendix B: Agent Span Hierarchies

Status: Index entry for future deep dive

Detailed span hierarchy patterns for:

  • Single-agent workflows
  • Multi-agent orchestration
  • Tool execution chains
  • Error propagation patterns
  • Correlation strategies

Appendix C: LLM Evaluation Frameworks

Status: Index entry for future deep dive

Comparative analysis of:

  • Langfuse evaluation capabilities
  • Arize Phoenix integration patterns
  • DeepEval metric implementations
  • Custom evaluator development
  • Production deployment patterns

Appendix D: observability-toolkit Schema Migration

Status: Index entry for future deep dive

Migration guide covering:

  • Current schema documentation
  • Target OTel-compliant schema
  • Backward compatibility strategy
  • Data migration procedures
  • Validation test suites

Appendix E: Cost Tracking Implementation

Status: Index entry for future deep dive

Cost observability implementation covering:

  • Provider pricing models
  • Token-to-cost calculation
  • Budget alerting patterns
  • Cost attribution by session/user
  • Optimization recommendations

Document History

VersionDateAuthorChanges
1.02026-01-29Research AnalysisInitial publication

This document was produced through systematic web research and comparative analysis. It represents the state of LLM observability standards as of January 2026 and should be reviewed periodically as the field evolves rapidly.

was published on .