LLM Observability Best Practices: A Comparative Analysis

Technical White Paper v1.0

Abstract

As Large Language Model (LLM) applications transition from experimental deployments to production-critical systems, the need for standardized observability practices has become paramount. This paper examines the current state of LLM observability standards, with particular focus on OpenTelemetry’s emerging GenAI semantic conventions, agent tracking methodologies, and quality measurement frameworks. We evaluate the observability-toolkit MCP server against these industry standards, identifying alignment areas and gaps. This document serves as an index to deeper technical analyses across five key domains: semantic conventions, agent observability, quality metrics, performance optimization, and tooling ecosystem.

Keywords: LLM observability, OpenTelemetry, GenAI semantic conventions, agent tracking, AI quality metrics, distributed tracing

1. Introduction

1.1 Problem Statement

The rapid adoption of LLM-based applications has outpaced the development of observability tooling, creating a fragmented landscape where teams rely on vendor-specific instrumentation, proprietary formats, and ad-hoc monitoring solutions. This fragmentation leads to:

Vendor lock-in through non-standard telemetry formats
Incomplete visibility into multi-step agent workflows
Inability to compare performance across providers and models
Quality blind spots where systems appear operational but produce low-quality outputs

1.2 Scope

This paper focuses on three primary areas:

Standardization: OpenTelemetry GenAI semantic conventions (v1.39.0)
Agent Tracking: Multi-turn, tool-use, and reasoning chain observability
Quality Measurement: Production evaluation metrics beyond latency and throughput

1.3 Methodology

Research was conducted through:

Analysis of OpenTelemetry specification documents and GitHub discussions
Review of industry tooling (Langfuse, Arize Phoenix, Datadog LLM Observability)
Examination of academic literature on hallucination detection and LLM evaluation
Comparative analysis against the observability-toolkit MCP server implementation

2. Background: The Evolution of LLM Observability

2.1 Traditional ML Observability vs. LLM Observability

Traditional machine learning observability focused on:

Model accuracy metrics (precision, recall, F1)
Feature drift detection
Inference latency and throughput
Resource utilization

LLM applications introduce fundamentally different observability challenges:

Dimension	Traditional ML	LLM Applications
Input Nature	Structured features	Unstructured natural language
Output Nature	Discrete classes/values	Free-form generated text
Evaluation	Ground truth comparison	Subjective quality assessment
Cost Model	Compute-based	Token-based pricing
Failure Modes	Classification errors	Hallucinations, toxicity, irrelevance
Execution Pattern	Single inference	Multi-turn, tool-augmented chains

2.2 The Three Pillars Extended

The traditional observability pillars (metrics, traces, logs) require extension for LLM systems:

┌─────────────────────────────────────────────────────────────────┐
│                    LLM Observability Pillars                     │
├─────────────────┬─────────────────┬─────────────────────────────┤
│     TRACES      │     METRICS     │           LOGS              │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ • Prompt chains │ • Token usage   │ • Prompt/completion content │
│ • Tool calls    │ • Latency (TTFT)│ • Error details             │
│ • Agent loops   │ • Cost per req  │ • Reasoning chains          │
│ • Retrieval     │ • Quality scores│ • Human feedback            │
└─────────────────┴─────────────────┴─────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    EVALUATION LAYER (NEW)                        │
├─────────────────────────────────────────────────────────────────┤
│ • Hallucination detection    • Answer relevancy                 │
│ • Factual accuracy           • Task completion                  │
│ • Tool correctness           • Safety/toxicity                  │
└─────────────────────────────────────────────────────────────────┘

2.3 Key Industry Developments (2024-2026)

Date	Development	Impact
Apr 2024	OTel GenAI SIG formation	Standardization effort begins
Jun 2024	GenAI semantic conventions draft	Initial attribute definitions
Oct 2024	Langfuse OTel support	Open-source adoption
Dec 2024	Datadog native OTel GenAI support	Enterprise validation
Jan 2025	OTel v1.37+ GenAI conventions	Production-ready standards
Mar 2025	Agent framework conventions proposed	Multi-agent standardization
Dec 2025	OTel v1.39 GenAI conventions	Agent/tool span semantics

3. OpenTelemetry GenAI Semantic Conventions

Deep Dive Reference: See Appendix A: OTel GenAI Attribute Reference

3.1 Overview

The OpenTelemetry GenAI semantic conventions (currently in Development status) establish a standardized schema for:

Spans: LLM inference calls, tool executions, agent invocations
Metrics: Token usage histograms, operation duration, latency breakdowns
Events: Input/output messages, system instructions, tool definitions
Attributes: Model parameters, provider metadata, conversation context

3.2 Core Span Attributes

3.2.1 Required Attributes

Attribute	Type	Description	Example
`gen_ai.operation.name`	string	Operation type	`chat`, `invoke_agent`, `execute_tool`
`gen_ai.provider.name`	string	Provider identifier	`anthropic`, `openai`, `aws.bedrock`

3.2.2 Conditionally Required Attributes

Attribute	Condition	Type	Example
`gen_ai.request.model`	If available	string	`claude-3-opus-20240229`
`gen_ai.conversation.id`	When available	string	`conv_5j66UpCpwteGg4YSxUnt7lPY`
`error.type`	If error occurred	string	`timeout`, `rate_limit`

3.2.3 Recommended Attributes

Attribute	Type	Description
`gen_ai.request.temperature`	double	Sampling temperature
`gen_ai.request.max_tokens`	int	Maximum output tokens
`gen_ai.request.top_p`	double	Nucleus sampling parameter
`gen_ai.response.model`	string	Actual model that responded
`gen_ai.response.finish_reasons`	string[]	Why generation stopped
`gen_ai.usage.input_tokens`	int	Prompt token count
`gen_ai.usage.output_tokens`	int	Completion token count

3.3 Operation Types

The specification defines seven standard operation names:

gen_ai.operation.name:
├── chat                 # Chat completion (most common)
├── text_completion      # Legacy completion API
├── generate_content     # Multimodal generation
├── embeddings           # Vector embeddings
├── create_agent         # Agent instantiation
├── invoke_agent         # Agent execution
└── execute_tool         # Tool/function execution

3.4 Provider Identifiers

Standardized gen_ai.provider.name values:

Provider	Value	Notes
Anthropic	`anthropic`	Claude models
OpenAI	`openai`	GPT models
AWS Bedrock	`aws.bedrock`	Multi-model
Azure OpenAI	`azure.ai.openai`	Azure-hosted
Google Gemini	`gcp.gemini`	AI Studio API
Google Vertex AI	`gcp.vertex_ai`	Enterprise API
Cohere	`cohere`
Mistral AI	`mistral_ai`

3.5 Standard Metrics

3.5.1 Client Metrics

Metric	Type	Unit	Buckets
`gen_ai.client.token.usage`	Histogram	`{token}`	[1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, …]
`gen_ai.client.operation.duration`	Histogram	`s`	[0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, …]

3.5.2 Server Metrics (for model hosting)

Metric	Type	Unit	Purpose
`gen_ai.server.request.duration`	Histogram	`s`	Total request time
`gen_ai.server.time_to_first_token`	Histogram	`s`	Prefill + queue latency
`gen_ai.server.time_per_output_token`	Histogram	`s`	Decode phase performance

3.6 Content Handling

The specification addresses sensitive content through three approaches:

Default: Do not capture prompts/completions
Opt-in attributes: Record on spans (gen_ai.input.messages, gen_ai.output.messages)
External storage: Upload to secure storage, record references

Recommended for production:
┌─────────────────────────────────────────────────────────┐
│  Span: gen_ai.operation.name = "chat"                   │
│  ├── gen_ai.input.messages.uri = "s3://bucket/msg/123"  │
│  └── gen_ai.output.messages.uri = "s3://bucket/msg/124" │
└─────────────────────────────────────────────────────────┘

4. Agent Observability Standards

Deep Dive Reference: See Appendix B: Agent Span Hierarchies

4.1 The Agent Observability Challenge

AI agents introduce observability complexity through:

Non-deterministic execution: Same input may produce different tool call sequences
Multi-turn reasoning: Extended context across many LLM calls
Tool orchestration: External system interactions within agent loops
Framework diversity: LangGraph, CrewAI, AutoGen, etc. have different patterns

4.2 Agent Application vs. Framework Distinction

The OpenTelemetry specification distinguishes:

Concept	Definition	Examples
Agent Application	Specific AI-driven entity	Customer support bot, coding assistant
Agent Framework	Infrastructure for building agents	LangGraph, CrewAI, Claude Code

4.3 Agent Span Semantics

4.3.1 Agent Creation Span

Span: create_agent {agent_name}
├── gen_ai.operation.name: "create_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.agent.description: "Handles tier-1 support queries"

4.3.2 Agent Invocation Span

Span: invoke_agent {agent_name}
├── gen_ai.operation.name: "invoke_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.conversation.id: "conv_xyz789"
    │
    ├── Child Span: chat claude-3-opus
    │   └── gen_ai.operation.name: "chat"
    │
    ├── Child Span: execute_tool get_customer_info
    │   ├── gen_ai.tool.name: "get_customer_info"
    │   ├── gen_ai.tool.type: "function"
    │   └── gen_ai.tool.call.id: "call_abc"
    │
    └── Child Span: chat claude-3-opus
        └── gen_ai.operation.name: "chat"

4.4 Tool Execution Attributes

Attribute	Type	Description
`gen_ai.tool.name`	string	Tool identifier
`gen_ai.tool.type`	string	`function`, `extension`, `datastore`
`gen_ai.tool.description`	string	Human-readable description
`gen_ai.tool.call.id`	string	Unique call identifier
`gen_ai.tool.call.arguments`	any	Input parameters (opt-in, sensitive)
`gen_ai.tool.call.result`	any	Output (opt-in, sensitive)

4.5 Framework Instrumentation Approaches

Approach	Pros	Cons	Examples
Baked-in	Zero config, consistent	Bloat, version lag	CrewAI
External OTel	Decoupled, community-maintained	Integration complexity	OpenLLMetry
OTel Contrib	Official support, best practices	Review queue delays	`instrumentation-genai`

4.6 Claude Code as Agent System

Claude Code exhibits agent characteristics:

Multi-turn conversation management
Tool execution (Bash, Read, Write, Edit, etc.)
Reasoning chains across tool calls
Session-based context

Current gap: Claude Code telemetry doesn’t emit standardized agent spans.

5. Quality and Evaluation Metrics

Deep Dive Reference: See Appendix C: LLM Evaluation Frameworks

5.1 The Quality Visibility Problem

Traditional observability answers: “Is the system up and performing?”

LLM observability must also answer: “Is the system producing good outputs?”

System Status Matrix:
                    │ Quality: Good    │ Quality: Bad
────────────────────┼──────────────────┼──────────────────
Performance: Good   │ Healthy          │ INVISIBLE FAILURE
Performance: Bad    │ Investigate      │ Obvious failure

The “invisible failure” quadrant is uniquely dangerous for LLM systems.

5.2 Core Quality Metrics

Metric	Description	Measurement Method
Answer Relevancy	Output addresses input intent	LLM-as-judge, embedding similarity
Faithfulness	Output grounded in provided context	LLM-as-judge, NLI models
Hallucination	Fabricated or false information	LLM-as-judge, fact verification
Task Completion	Agent accomplished stated goal	Rule-based + LLM assessment
Tool Correctness	Correct tools called with valid args	Deterministic validation
Toxicity/Safety	Output meets safety guidelines	Classifier models, guardrails

5.3 LLM-as-Judge Pattern

The dominant approach for quality evaluation:

┌─────────────────────────────────────────────────────────────┐
│                    LLM-as-Judge Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Production LLM Call                                         │
│  ┌──────────┐    ┌───────────┐    ┌──────────┐             │
│  │  Input   │───▶│  Model A  │───▶│  Output  │             │
│  └──────────┘    └───────────┘    └──────────┘             │
│       │                                │                     │
│       │         Evaluation LLM         │                     │
│       │    ┌───────────────────────┐   │                     │
│       └───▶│       Model B         │◀──┘                     │
│            │  (Judge: GPT-4, etc.) │                         │
│            └───────────┬───────────┘                         │
│                        │                                     │
│                        ▼                                     │
│            ┌───────────────────────┐                         │
│            │   Quality Scores      │                         │
│            │ • Relevancy: 0.85     │                         │
│            │ • Faithfulness: 0.92  │                         │
│            │ • Hallucination: 0.08 │                         │
│            └───────────────────────┘                         │
└─────────────────────────────────────────────────────────────┘

5.4 Evaluation Tool Landscape (2026)

Tool	Type	Key Features
Langfuse	Open Source	Tracing, prompt management, evals, 19k+ stars
Arize Phoenix	Open Source	OTel-native, OTLP ingestion, 7.8k stars
DeepEval	Open Source	14+ metrics, CI/CD integration
MLflow 3.0	Open Source	GenAI evals, experiment tracking
Datadog LLM Obs	Commercial	Hallucination detection, production monitoring
Braintrust	Commercial	Eval datasets, prompt playground

5.5 Production Evaluation Architecture

┌─────────────────────────────────────────────────────────────────┐
│                  Production Evaluation Flow                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. CAPTURE                2. EVALUATE              3. ITERATE   │
│  ┌─────────────┐          ┌─────────────┐         ┌───────────┐ │
│  │ Production  │          │ Async Eval  │         │ Feedback  │ │
│  │   Traces    │─────────▶│   Workers   │────────▶│   Loop    │ │
│  └─────────────┘          └─────────────┘         └───────────┘ │
│        │                        │                       │        │
│        │                        │                       │        │
│        ▼                        ▼                       ▼        │
│  ┌─────────────┐          ┌─────────────┐         ┌───────────┐ │
│  │   Span +    │          │   Quality   │         │  Prompt   │ │
│  │  Metadata   │          │   Scores    │         │ Iteration │ │
│  └─────────────┘          └─────────────┘         └───────────┘ │
│                                                                  │
│  Promote interesting traces to evaluation datasets               │
└─────────────────────────────────────────────────────────────────┘

5.6 Hallucination Detection Challenges

Recent research (arXiv:2504.18114) reveals limitations:

Metrics often fail to align with human judgments
Inconsistent gains with model parameter scaling
Myopic view focusing on surface-level patterns
GPT-4 as judge yields best overall results but has cost implications

6. Comparative Analysis: observability-toolkit MCP

6.1 Architecture Overview

The observability-toolkit MCP server provides:

┌─────────────────────────────────────────────────────────────┐
│                  observability-toolkit v1.6.0               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Data Sources:                                               │
│  ├── Local JSONL files (~/.claude/telemetry/)               │
│  └── SigNoz Cloud API (optional)                            │
│                                                              │
│  Tools:                                                      │
│  ├── obs_query_traces      - Distributed trace queries      │
│  ├── obs_query_metrics     - Metric aggregation             │
│  ├── obs_query_logs        - Log search with boolean ops    │
│  ├── obs_query_llm_events  - LLM-specific event queries     │
│  ├── obs_health_check      - System health + cache stats    │
│  ├── obs_context_stats     - Context window utilization     │
│  └── obs_get_trace_url     - SigNoz trace viewer links      │
│                                                              │
│  Performance Features:                                       │
│  ├── LRU query caching                                       │
│  ├── File indexing (.idx sidecars)                          │
│  ├── Gzip compression support                               │
│  ├── Streaming with early termination                       │
│  └── Circuit breaker for SigNoz                             │
└─────────────────────────────────────────────────────────────┘

6.2 OTel GenAI Compliance Matrix

Requirement	Spec	Implementation	Status
`gen_ai.operation.name`	Required	Not captured	Gap
`gen_ai.provider.name`	Required	Uses `gen_ai.system`	Partial
`gen_ai.request.model`	Cond. Required	Captured	Compliant
`gen_ai.conversation.id`	Cond. Required	Not captured	Gap
`gen_ai.usage.input_tokens`	Recommended	Captured	Compliant
`gen_ai.usage.output_tokens`	Recommended	Captured	Compliant
`gen_ai.response.model`	Recommended	Not captured	Gap
`gen_ai.response.finish_reasons`	Recommended	Not captured	Gap
`gen_ai.request.temperature`	Recommended	Not captured	Gap
`gen_ai.request.max_tokens`	Recommended	Not captured	Gap

Compliance Score: 4/10 attributes implemented

6.3 Agent Tracking Analysis

Capability	Spec Requirement	Implementation	Status
Agent spans (`create_agent`, `invoke_agent`)	Defined	Not implemented	Gap
Tool execution spans (`execute_tool`)	Defined	Not implemented	Gap
`gen_ai.agent.id`	Recommended	Not captured	Gap
`gen_ai.agent.name`	Recommended	Not captured	Gap
`gen_ai.tool.name`	Recommended	Not captured	Gap
`gen_ai.tool.call.id`	Recommended	Not captured	Gap
Session correlation	Custom	Uses `session.id`	Partial

Agent Compliance: Minimal - relies on generic trace attributes

6.4 Metrics Compliance

Metric	Spec	Implementation	Status
`gen_ai.client.token.usage`	Histogram w/ buckets	Flat storage	Partial
`gen_ai.client.operation.duration`	Histogram w/ buckets	Available via traces	Partial
`gen_ai.server.time_to_first_token`	Histogram	Not implemented	Gap
`gen_ai.server.time_per_output_token`	Histogram	Not implemented	Gap
Aggregation support	sum, avg, p50, p95, p99	sum, avg, min, max, count	Partial

6.5 Quality/Eval Capabilities

Capability	Industry Standard	Implementation	Status
Hallucination detection	LLM-as-judge	Not implemented	Gap
Answer relevancy scoring	Automated eval	Not implemented	Gap
Human feedback collection	Annotation API	Not implemented	Gap
Eval dataset management	Trace promotion	Not implemented	Gap
Cost tracking	Price * tokens	Token counts only	Partial

6.6 Strengths Relative to Industry

Strength	Description	Competitive Position
Multi-directory scanning	Aggregates telemetry across locations	Unique
Gzip support	Transparent compression handling	Standard
Index files	Fast lookups via .idx sidecars	Above average
Query caching	LRU with TTL and stats	Standard
OTLP export	Standard format output	Compliant
Local-first	No cloud dependency required	Differentiator
Claude Code integration	Purpose-built for CC sessions	Unique

7. Recommendations and Roadmap

7.1 Priority Matrix

                        Impact
                    Low         High
                ┌───────────┬───────────┐
           High │ P3: Nice  │ P1: Do    │
    Effort      │  to have  │   First   │
                ├───────────┼───────────┤
            Low │ P4: Maybe │ P2: Quick │
                │   later   │    Wins   │
                └───────────┴───────────┘

7.2 Phase 1: OTel GenAI Compliance (P1/P2)

Goal: Achieve 80%+ compliance with GenAI semantic conventions

Task	Priority	Effort	Impact
Add `gen_ai.operation.name` to LLM events	P1	Low	High
Rename `gen_ai.system` → `gen_ai.provider.name`	P2	Low	Medium
Capture `gen_ai.conversation.id`	P1	Medium	High
Add `gen_ai.response.model`	P2	Low	Medium
Add `gen_ai.response.finish_reasons`	P2	Low	Medium

Estimated effort: 1-2 development cycles

7.3 Phase 2: Agent Observability (P1)

Goal: First-class support for agent/tool span semantics

Task	Priority	Effort	Impact
Define agent span schema	P1	Medium	High
Tool execution span tracking	P1	Medium	High
Agent invocation correlation	P1	High	High
Multi-agent workflow visualization	P3	High	Medium

Estimated effort: 2-3 development cycles

7.4 Phase 3: Metrics Enhancement (P2)

Goal: Standard histogram metrics with OTel bucket boundaries

Task	Priority	Effort	Impact
Implement histogram aggregation	P2	Medium	Medium
Add p50/p95/p99 percentiles	P2	Low	Medium
Time-to-first-token metric	P2	Medium	Medium
Cost estimation layer	P3	Low	Low

Estimated effort: 1-2 development cycles

7.5 Phase 4: Quality Layer (P3)

Goal: Optional integration with evaluation frameworks

Task	Priority	Effort	Impact
Langfuse integration research	P3	Low	Medium
Eval score storage schema	P3	Medium	Medium
LLM-as-judge hook support	P4	High	High
Human feedback API	P4	High	Medium

Estimated effort: 3+ development cycles (optional)

7.6 Implementation Roadmap

2026 Q1                    Q2                    Q3
───────────────────────────────────────────────────────────
│ Phase 1: OTel Compliance │ Phase 2: Agents    │ Phase 3  │
│ ├── gen_ai.operation.name│ ├── Agent spans    │ Metrics  │
│ ├── gen_ai.provider.name │ ├── Tool spans     │ ├── Hist │
│ ├── gen_ai.conversation  │ └── Correlation    │ └── TTFT │
│ └── finish_reasons       │                    │          │
───────────────────────────────────────────────────────────
                                                     │
                                                     ▼
                                              Phase 4: Quality
                                              (Future/Optional)

8. Future Research Directions

8.1 Emerging Standards

MCP Observability Conventions: As Model Context Protocol gains adoption, standardized telemetry for MCP tool calls may emerge
Agentic System Semantics: OTel Issue #2664 proposes comprehensive agent conventions including tasks, artifacts, and memory
Multi-Agent Coordination: Observability for agent-to-agent communication patterns

8.2 Quality Measurement Evolution

Real-time hallucination detection: Moving from batch eval to streaming assessment
Automated regression detection: Identifying quality degradation across model updates
Domain-specific evaluators: Specialized judges for code, medical, legal domains

8.3 Cost Optimization

Token budget observability: Real-time context window utilization tracking
Model routing telemetry: Visibility into cost/quality tradeoff decisions
Caching effectiveness: Measuring semantic cache hit rates

8.4 Privacy and Compliance

Content redaction pipelines: OTel Collector processors for PII removal
Audit trail requirements: Regulatory compliance for AI decisions
Differential privacy: Aggregated telemetry without individual exposure

9. References

9.1 OpenTelemetry Specifications

OpenTelemetry. “Semantic conventions for generative AI systems.” https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenTelemetry. “Semantic conventions for generative client AI spans.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
OpenTelemetry. “Semantic conventions for generative AI metrics.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/
OpenTelemetry. “Gen AI Registry Attributes.” https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/

9.2 Industry Publications

Liu, G. & Solomon, S. “AI Agent Observability - Evolving Standards and Best Practices.” OpenTelemetry Blog, March 2025. https://opentelemetry.io/blog/2025/ai-agent-observability/
Jain, I. “An Introduction to Observability for LLM-based applications using OpenTelemetry.” OpenTelemetry Blog, June 2024. https://opentelemetry.io/blog/2024/llm-observability/
Datadog. “Datadog LLM Observability natively supports OpenTelemetry GenAI Semantic Conventions.” December 2025. https://www.datadoghq.com/blog/llm-otel-semantic-convention/
Horovits, D. “OpenTelemetry for GenAI and the OpenLLMetry project.” Medium, November 2025. https://horovits.medium.com/opentelemetry-for-genai-and-the-openllmetry-project-81b9cea6a771

9.3 Evaluation and Quality

Confident AI. “LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide.” https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
DeepEval. “Hallucination Metric Documentation.” https://deepeval.com/docs/metrics-hallucination
“Evaluating Evaluation Metrics – The Mirage of Hallucination Detection.” arXiv:2504.18114, 2025.

9.4 Tools and Frameworks

Langfuse. “OpenTelemetry (OTel) for LLM Observability.” https://langfuse.com/blog/2024-10-opentelemetry-for-llm-observability
Traceloop. “OpenLLMetry: Open-source observability for GenAI.” https://github.com/traceloop/openllmetry
Anthropic. “Building effective agents.” https://www.anthropic.com/research/building-effective-agents

10. Appendices

Appendix A: OTel GenAI Attribute Reference

Status: Index entry for future deep dive

Complete reference of all gen_ai.* attributes with:

Full attribute list with types and examples
Requirement levels by operation type
Provider-specific extensions
Migration guide from pre-1.37 conventions

Appendix B: Agent Span Hierarchies

Status: Index entry for future deep dive

Detailed span hierarchy patterns for:

Single-agent workflows
Multi-agent orchestration
Tool execution chains
Error propagation patterns
Correlation strategies

Appendix C: LLM Evaluation Frameworks

Status: Index entry for future deep dive

Comparative analysis of:

Langfuse evaluation capabilities
Arize Phoenix integration patterns
DeepEval metric implementations
Custom evaluator development
Production deployment patterns

Appendix D: observability-toolkit Schema Migration

Status: Index entry for future deep dive

Migration guide covering:

Current schema documentation
Target OTel-compliant schema
Backward compatibility strategy
Data migration procedures
Validation test suites

Appendix E: Cost Tracking Implementation

Status: Index entry for future deep dive

Cost observability implementation covering:

Provider pricing models
Token-to-cost calculation
Budget alerting patterns
Cost attribution by session/user
Optimization recommendations

Document History

Version	Date	Author	Changes
1.0	2026-01-29	Research Analysis	Initial publication

This document was produced through systematic web research and comparative analysis. It represents the state of LLM observability standards as of January 2026 and should be reviewed periodically as the field evolves rapidly.