Technical White Paper v1.0
Abstract
As Large Language Model (LLM) applications transition from experimental deployments to production-critical systems, the need for standardized observability practices has become paramount. This paper examines the current state of LLM observability standards, with particular focus on OpenTelemetry’s emerging GenAI semantic conventions, agent tracking methodologies, and quality measurement frameworks. We evaluate the observability-toolkit MCP server against these industry standards, identifying alignment areas and gaps. This document serves as an index to deeper technical analyses across five key domains: semantic conventions, agent observability, quality metrics, performance optimization, and tooling ecosystem.
Keywords: LLM observability, OpenTelemetry, GenAI semantic conventions, agent tracking, AI quality metrics, distributed tracing
1. Introduction
1.1 Problem Statement
The rapid adoption of LLM-based applications has outpaced the development of observability tooling, creating a fragmented landscape where teams rely on vendor-specific instrumentation, proprietary formats, and ad-hoc monitoring solutions. This fragmentation leads to:
- Vendor lock-in through non-standard telemetry formats
- Incomplete visibility into multi-step agent workflows
- Inability to compare performance across providers and models
- Quality blind spots where systems appear operational but produce low-quality outputs
1.2 Scope
This paper focuses on three primary areas:
- Standardization: OpenTelemetry GenAI semantic conventions (v1.39.0)
- Agent Tracking: Multi-turn, tool-use, and reasoning chain observability
- Quality Measurement: Production evaluation metrics beyond latency and throughput
1.3 Methodology
Research was conducted through:
- Analysis of OpenTelemetry specification documents and GitHub discussions
- Review of industry tooling (Langfuse, Arize Phoenix, Datadog LLM Observability)
- Examination of academic literature on hallucination detection and LLM evaluation
- Comparative analysis against the observability-toolkit MCP server implementation
2. Background: The Evolution of LLM Observability
2.1 Traditional ML Observability vs. LLM Observability
Traditional machine learning observability focused on:
- Model accuracy metrics (precision, recall, F1)
- Feature drift detection
- Inference latency and throughput
- Resource utilization
LLM applications introduce fundamentally different observability challenges:
| Dimension | Traditional ML | LLM Applications |
|---|---|---|
| Input Nature | Structured features | Unstructured natural language |
| Output Nature | Discrete classes/values | Free-form generated text |
| Evaluation | Ground truth comparison | Subjective quality assessment |
| Cost Model | Compute-based | Token-based pricing |
| Failure Modes | Classification errors | Hallucinations, toxicity, irrelevance |
| Execution Pattern | Single inference | Multi-turn, tool-augmented chains |
2.2 The Three Pillars Extended
The traditional observability pillars (metrics, traces, logs) require extension for LLM systems:
┌─────────────────────────────────────────────────────────────────┐
│ LLM Observability Pillars │
├─────────────────┬─────────────────┬─────────────────────────────┤
│ TRACES │ METRICS │ LOGS │
├─────────────────┼─────────────────┼─────────────────────────────┤
│ • Prompt chains │ • Token usage │ • Prompt/completion content │
│ • Tool calls │ • Latency (TTFT)│ • Error details │
│ • Agent loops │ • Cost per req │ • Reasoning chains │
│ • Retrieval │ • Quality scores│ • Human feedback │
└─────────────────┴─────────────────┴─────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ EVALUATION LAYER (NEW) │
├─────────────────────────────────────────────────────────────────┤
│ • Hallucination detection • Answer relevancy │
│ • Factual accuracy • Task completion │
│ • Tool correctness • Safety/toxicity │
└─────────────────────────────────────────────────────────────────┘
2.3 Key Industry Developments (2024-2026)
| Date | Development | Impact |
|---|---|---|
| Apr 2024 | OTel GenAI SIG formation | Standardization effort begins |
| Jun 2024 | GenAI semantic conventions draft | Initial attribute definitions |
| Oct 2024 | Langfuse OTel support | Open-source adoption |
| Dec 2024 | Datadog native OTel GenAI support | Enterprise validation |
| Jan 2025 | OTel v1.37+ GenAI conventions | Production-ready standards |
| Mar 2025 | Agent framework conventions proposed | Multi-agent standardization |
| Dec 2025 | OTel v1.39 GenAI conventions | Agent/tool span semantics |
3. OpenTelemetry GenAI Semantic Conventions
Deep Dive Reference: See Appendix A: OTel GenAI Attribute Reference
3.1 Overview
The OpenTelemetry GenAI semantic conventions (currently in Development status) establish a standardized schema for:
- Spans: LLM inference calls, tool executions, agent invocations
- Metrics: Token usage histograms, operation duration, latency breakdowns
- Events: Input/output messages, system instructions, tool definitions
- Attributes: Model parameters, provider metadata, conversation context
3.2 Core Span Attributes
3.2.1 Required Attributes
| Attribute | Type | Description | Example |
|---|---|---|---|
gen_ai.operation.name | string | Operation type | chat, invoke_agent, execute_tool |
gen_ai.provider.name | string | Provider identifier | anthropic, openai, aws.bedrock |
3.2.2 Conditionally Required Attributes
| Attribute | Condition | Type | Example |
|---|---|---|---|
gen_ai.request.model | If available | string | claude-3-opus-20240229 |
gen_ai.conversation.id | When available | string | conv_5j66UpCpwteGg4YSxUnt7lPY |
error.type | If error occurred | string | timeout, rate_limit |
3.2.3 Recommended Attributes
| Attribute | Type | Description |
|---|---|---|
gen_ai.request.temperature | double | Sampling temperature |
gen_ai.request.max_tokens | int | Maximum output tokens |
gen_ai.request.top_p | double | Nucleus sampling parameter |
gen_ai.response.model | string | Actual model that responded |
gen_ai.response.finish_reasons | string[] | Why generation stopped |
gen_ai.usage.input_tokens | int | Prompt token count |
gen_ai.usage.output_tokens | int | Completion token count |
3.3 Operation Types
The specification defines seven standard operation names:
gen_ai.operation.name:
├── chat # Chat completion (most common)
├── text_completion # Legacy completion API
├── generate_content # Multimodal generation
├── embeddings # Vector embeddings
├── create_agent # Agent instantiation
├── invoke_agent # Agent execution
└── execute_tool # Tool/function execution
3.4 Provider Identifiers
Standardized gen_ai.provider.name values:
| Provider | Value | Notes |
|---|---|---|
| Anthropic | anthropic | Claude models |
| OpenAI | openai | GPT models |
| AWS Bedrock | aws.bedrock | Multi-model |
| Azure OpenAI | azure.ai.openai | Azure-hosted |
| Google Gemini | gcp.gemini | AI Studio API |
| Google Vertex AI | gcp.vertex_ai | Enterprise API |
| Cohere | cohere | |
| Mistral AI | mistral_ai |
3.5 Standard Metrics
3.5.1 Client Metrics
| Metric | Type | Unit | Buckets |
|---|---|---|---|
gen_ai.client.token.usage | Histogram | {token} | [1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, …] |
gen_ai.client.operation.duration | Histogram | s | [0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, …] |
3.5.2 Server Metrics (for model hosting)
| Metric | Type | Unit | Purpose |
|---|---|---|---|
gen_ai.server.request.duration | Histogram | s | Total request time |
gen_ai.server.time_to_first_token | Histogram | s | Prefill + queue latency |
gen_ai.server.time_per_output_token | Histogram | s | Decode phase performance |
3.6 Content Handling
The specification addresses sensitive content through three approaches:
- Default: Do not capture prompts/completions
- Opt-in attributes: Record on spans (
gen_ai.input.messages,gen_ai.output.messages) - External storage: Upload to secure storage, record references
Recommended for production:
┌─────────────────────────────────────────────────────────┐
│ Span: gen_ai.operation.name = "chat" │
│ ├── gen_ai.input.messages.uri = "s3://bucket/msg/123" │
│ └── gen_ai.output.messages.uri = "s3://bucket/msg/124" │
└─────────────────────────────────────────────────────────┘
4. Agent Observability Standards
Deep Dive Reference: See Appendix B: Agent Span Hierarchies
4.1 The Agent Observability Challenge
AI agents introduce observability complexity through:
- Non-deterministic execution: Same input may produce different tool call sequences
- Multi-turn reasoning: Extended context across many LLM calls
- Tool orchestration: External system interactions within agent loops
- Framework diversity: LangGraph, CrewAI, AutoGen, etc. have different patterns
4.2 Agent Application vs. Framework Distinction
The OpenTelemetry specification distinguishes:
| Concept | Definition | Examples |
|---|---|---|
| Agent Application | Specific AI-driven entity | Customer support bot, coding assistant |
| Agent Framework | Infrastructure for building agents | LangGraph, CrewAI, Claude Code |
4.3 Agent Span Semantics
4.3.1 Agent Creation Span
Span: create_agent {agent_name}
├── gen_ai.operation.name: "create_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.agent.description: "Handles tier-1 support queries"
4.3.2 Agent Invocation Span
Span: invoke_agent {agent_name}
├── gen_ai.operation.name: "invoke_agent"
├── gen_ai.agent.id: "agent_abc123"
├── gen_ai.agent.name: "CustomerSupportAgent"
└── gen_ai.conversation.id: "conv_xyz789"
│
├── Child Span: chat claude-3-opus
│ └── gen_ai.operation.name: "chat"
│
├── Child Span: execute_tool get_customer_info
│ ├── gen_ai.tool.name: "get_customer_info"
│ ├── gen_ai.tool.type: "function"
│ └── gen_ai.tool.call.id: "call_abc"
│
└── Child Span: chat claude-3-opus
└── gen_ai.operation.name: "chat"
4.4 Tool Execution Attributes
| Attribute | Type | Description |
|---|---|---|
gen_ai.tool.name | string | Tool identifier |
gen_ai.tool.type | string | function, extension, datastore |
gen_ai.tool.description | string | Human-readable description |
gen_ai.tool.call.id | string | Unique call identifier |
gen_ai.tool.call.arguments | any | Input parameters (opt-in, sensitive) |
gen_ai.tool.call.result | any | Output (opt-in, sensitive) |
4.5 Framework Instrumentation Approaches
| Approach | Pros | Cons | Examples |
|---|---|---|---|
| Baked-in | Zero config, consistent | Bloat, version lag | CrewAI |
| External OTel | Decoupled, community-maintained | Integration complexity | OpenLLMetry |
| OTel Contrib | Official support, best practices | Review queue delays | instrumentation-genai |
4.6 Claude Code as Agent System
Claude Code exhibits agent characteristics:
- Multi-turn conversation management
- Tool execution (Bash, Read, Write, Edit, etc.)
- Reasoning chains across tool calls
- Session-based context
Current gap: Claude Code telemetry doesn’t emit standardized agent spans.
5. Quality and Evaluation Metrics
Deep Dive Reference: See Appendix C: LLM Evaluation Frameworks
5.1 The Quality Visibility Problem
Traditional observability answers: “Is the system up and performing?”
LLM observability must also answer: “Is the system producing good outputs?”
System Status Matrix:
│ Quality: Good │ Quality: Bad
────────────────────┼──────────────────┼──────────────────
Performance: Good │ Healthy │ INVISIBLE FAILURE
Performance: Bad │ Investigate │ Obvious failure
The “invisible failure” quadrant is uniquely dangerous for LLM systems.
5.2 Core Quality Metrics
| Metric | Description | Measurement Method |
|---|---|---|
| Answer Relevancy | Output addresses input intent | LLM-as-judge, embedding similarity |
| Faithfulness | Output grounded in provided context | LLM-as-judge, NLI models |
| Hallucination | Fabricated or false information | LLM-as-judge, fact verification |
| Task Completion | Agent accomplished stated goal | Rule-based + LLM assessment |
| Tool Correctness | Correct tools called with valid args | Deterministic validation |
| Toxicity/Safety | Output meets safety guidelines | Classifier models, guardrails |
5.3 LLM-as-Judge Pattern
The dominant approach for quality evaluation:
┌─────────────────────────────────────────────────────────────┐
│ LLM-as-Judge Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ Production LLM Call │
│ ┌──────────┐ ┌───────────┐ ┌──────────┐ │
│ │ Input │───▶│ Model A │───▶│ Output │ │
│ └──────────┘ └───────────┘ └──────────┘ │
│ │ │ │
│ │ Evaluation LLM │ │
│ │ ┌───────────────────────┐ │ │
│ └───▶│ Model B │◀──┘ │
│ │ (Judge: GPT-4, etc.) │ │
│ └───────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────┐ │
│ │ Quality Scores │ │
│ │ • Relevancy: 0.85 │ │
│ │ • Faithfulness: 0.92 │ │
│ │ • Hallucination: 0.08 │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
5.4 Evaluation Tool Landscape (2026)
| Tool | Type | Key Features |
|---|---|---|
| Langfuse | Open Source | Tracing, prompt management, evals, 19k+ stars |
| Arize Phoenix | Open Source | OTel-native, OTLP ingestion, 7.8k stars |
| DeepEval | Open Source | 14+ metrics, CI/CD integration |
| MLflow 3.0 | Open Source | GenAI evals, experiment tracking |
| Datadog LLM Obs | Commercial | Hallucination detection, production monitoring |
| Braintrust | Commercial | Eval datasets, prompt playground |
5.5 Production Evaluation Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Production Evaluation Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. CAPTURE 2. EVALUATE 3. ITERATE │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Production │ │ Async Eval │ │ Feedback │ │
│ │ Traces │─────────▶│ Workers │────────▶│ Loop │ │
│ └─────────────┘ └─────────────┘ └───────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Span + │ │ Quality │ │ Prompt │ │
│ │ Metadata │ │ Scores │ │ Iteration │ │
│ └─────────────┘ └─────────────┘ └───────────┘ │
│ │
│ Promote interesting traces to evaluation datasets │
└─────────────────────────────────────────────────────────────────┘
5.6 Hallucination Detection Challenges
Recent research (arXiv:2504.18114) reveals limitations:
- Metrics often fail to align with human judgments
- Inconsistent gains with model parameter scaling
- Myopic view focusing on surface-level patterns
- GPT-4 as judge yields best overall results but has cost implications
6. Comparative Analysis: observability-toolkit MCP
6.1 Architecture Overview
The observability-toolkit MCP server provides:
┌─────────────────────────────────────────────────────────────┐
│ observability-toolkit v1.6.0 │
├─────────────────────────────────────────────────────────────┤
│ │
│ Data Sources: │
│ ├── Local JSONL files (~/.claude/telemetry/) │
│ └── SigNoz Cloud API (optional) │
│ │
│ Tools: │
│ ├── obs_query_traces - Distributed trace queries │
│ ├── obs_query_metrics - Metric aggregation │
│ ├── obs_query_logs - Log search with boolean ops │
│ ├── obs_query_llm_events - LLM-specific event queries │
│ ├── obs_health_check - System health + cache stats │
│ ├── obs_context_stats - Context window utilization │
│ └── obs_get_trace_url - SigNoz trace viewer links │
│ │
│ Performance Features: │
│ ├── LRU query caching │
│ ├── File indexing (.idx sidecars) │
│ ├── Gzip compression support │
│ ├── Streaming with early termination │
│ └── Circuit breaker for SigNoz │
└─────────────────────────────────────────────────────────────┘
6.2 OTel GenAI Compliance Matrix
| Requirement | Spec | Implementation | Status |
|---|---|---|---|
gen_ai.operation.name | Required | Not captured | Gap |
gen_ai.provider.name | Required | Uses gen_ai.system | Partial |
gen_ai.request.model | Cond. Required | Captured | Compliant |
gen_ai.conversation.id | Cond. Required | Not captured | Gap |
gen_ai.usage.input_tokens | Recommended | Captured | Compliant |
gen_ai.usage.output_tokens | Recommended | Captured | Compliant |
gen_ai.response.model | Recommended | Not captured | Gap |
gen_ai.response.finish_reasons | Recommended | Not captured | Gap |
gen_ai.request.temperature | Recommended | Not captured | Gap |
gen_ai.request.max_tokens | Recommended | Not captured | Gap |
Compliance Score: 4/10 attributes implemented
6.3 Agent Tracking Analysis
| Capability | Spec Requirement | Implementation | Status |
|---|---|---|---|
Agent spans (create_agent, invoke_agent) | Defined | Not implemented | Gap |
Tool execution spans (execute_tool) | Defined | Not implemented | Gap |
gen_ai.agent.id | Recommended | Not captured | Gap |
gen_ai.agent.name | Recommended | Not captured | Gap |
gen_ai.tool.name | Recommended | Not captured | Gap |
gen_ai.tool.call.id | Recommended | Not captured | Gap |
| Session correlation | Custom | Uses session.id | Partial |
Agent Compliance: Minimal - relies on generic trace attributes
6.4 Metrics Compliance
| Metric | Spec | Implementation | Status |
|---|---|---|---|
gen_ai.client.token.usage | Histogram w/ buckets | Flat storage | Partial |
gen_ai.client.operation.duration | Histogram w/ buckets | Available via traces | Partial |
gen_ai.server.time_to_first_token | Histogram | Not implemented | Gap |
gen_ai.server.time_per_output_token | Histogram | Not implemented | Gap |
| Aggregation support | sum, avg, p50, p95, p99 | sum, avg, min, max, count | Partial |
6.5 Quality/Eval Capabilities
| Capability | Industry Standard | Implementation | Status |
|---|---|---|---|
| Hallucination detection | LLM-as-judge | Not implemented | Gap |
| Answer relevancy scoring | Automated eval | Not implemented | Gap |
| Human feedback collection | Annotation API | Not implemented | Gap |
| Eval dataset management | Trace promotion | Not implemented | Gap |
| Cost tracking | Price * tokens | Token counts only | Partial |
6.6 Strengths Relative to Industry
| Strength | Description | Competitive Position |
|---|---|---|
| Multi-directory scanning | Aggregates telemetry across locations | Unique |
| Gzip support | Transparent compression handling | Standard |
| Index files | Fast lookups via .idx sidecars | Above average |
| Query caching | LRU with TTL and stats | Standard |
| OTLP export | Standard format output | Compliant |
| Local-first | No cloud dependency required | Differentiator |
| Claude Code integration | Purpose-built for CC sessions | Unique |
7. Recommendations and Roadmap
7.1 Priority Matrix
Impact
Low High
┌───────────┬───────────┐
High │ P3: Nice │ P1: Do │
Effort │ to have │ First │
├───────────┼───────────┤
Low │ P4: Maybe │ P2: Quick │
│ later │ Wins │
└───────────┴───────────┘
7.2 Phase 1: OTel GenAI Compliance (P1/P2)
Goal: Achieve 80%+ compliance with GenAI semantic conventions
| Task | Priority | Effort | Impact |
|---|---|---|---|
Add gen_ai.operation.name to LLM events | P1 | Low | High |
Rename gen_ai.system → gen_ai.provider.name | P2 | Low | Medium |
Capture gen_ai.conversation.id | P1 | Medium | High |
Add gen_ai.response.model | P2 | Low | Medium |
Add gen_ai.response.finish_reasons | P2 | Low | Medium |
Estimated effort: 1-2 development cycles
7.3 Phase 2: Agent Observability (P1)
Goal: First-class support for agent/tool span semantics
| Task | Priority | Effort | Impact |
|---|---|---|---|
| Define agent span schema | P1 | Medium | High |
| Tool execution span tracking | P1 | Medium | High |
| Agent invocation correlation | P1 | High | High |
| Multi-agent workflow visualization | P3 | High | Medium |
Estimated effort: 2-3 development cycles
7.4 Phase 3: Metrics Enhancement (P2)
Goal: Standard histogram metrics with OTel bucket boundaries
| Task | Priority | Effort | Impact |
|---|---|---|---|
| Implement histogram aggregation | P2 | Medium | Medium |
| Add p50/p95/p99 percentiles | P2 | Low | Medium |
| Time-to-first-token metric | P2 | Medium | Medium |
| Cost estimation layer | P3 | Low | Low |
Estimated effort: 1-2 development cycles
7.5 Phase 4: Quality Layer (P3)
Goal: Optional integration with evaluation frameworks
| Task | Priority | Effort | Impact |
|---|---|---|---|
| Langfuse integration research | P3 | Low | Medium |
| Eval score storage schema | P3 | Medium | Medium |
| LLM-as-judge hook support | P4 | High | High |
| Human feedback API | P4 | High | Medium |
Estimated effort: 3+ development cycles (optional)
7.6 Implementation Roadmap
2026 Q1 Q2 Q3
───────────────────────────────────────────────────────────
│ Phase 1: OTel Compliance │ Phase 2: Agents │ Phase 3 │
│ ├── gen_ai.operation.name│ ├── Agent spans │ Metrics │
│ ├── gen_ai.provider.name │ ├── Tool spans │ ├── Hist │
│ ├── gen_ai.conversation │ └── Correlation │ └── TTFT │
│ └── finish_reasons │ │ │
───────────────────────────────────────────────────────────
│
▼
Phase 4: Quality
(Future/Optional)
8. Future Research Directions
8.1 Emerging Standards
- MCP Observability Conventions: As Model Context Protocol gains adoption, standardized telemetry for MCP tool calls may emerge
- Agentic System Semantics: OTel Issue #2664 proposes comprehensive agent conventions including tasks, artifacts, and memory
- Multi-Agent Coordination: Observability for agent-to-agent communication patterns
8.2 Quality Measurement Evolution
- Real-time hallucination detection: Moving from batch eval to streaming assessment
- Automated regression detection: Identifying quality degradation across model updates
- Domain-specific evaluators: Specialized judges for code, medical, legal domains
8.3 Cost Optimization
- Token budget observability: Real-time context window utilization tracking
- Model routing telemetry: Visibility into cost/quality tradeoff decisions
- Caching effectiveness: Measuring semantic cache hit rates
8.4 Privacy and Compliance
- Content redaction pipelines: OTel Collector processors for PII removal
- Audit trail requirements: Regulatory compliance for AI decisions
- Differential privacy: Aggregated telemetry without individual exposure
9. References
9.1 OpenTelemetry Specifications
OpenTelemetry. “Semantic conventions for generative AI systems.” https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenTelemetry. “Semantic conventions for generative client AI spans.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
OpenTelemetry. “Semantic conventions for generative AI metrics.” https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/
OpenTelemetry. “Gen AI Registry Attributes.” https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/
9.2 Industry Publications
Liu, G. & Solomon, S. “AI Agent Observability - Evolving Standards and Best Practices.” OpenTelemetry Blog, March 2025. https://opentelemetry.io/blog/2025/ai-agent-observability/
Jain, I. “An Introduction to Observability for LLM-based applications using OpenTelemetry.” OpenTelemetry Blog, June 2024. https://opentelemetry.io/blog/2024/llm-observability/
Datadog. “Datadog LLM Observability natively supports OpenTelemetry GenAI Semantic Conventions.” December 2025. https://www.datadoghq.com/blog/llm-otel-semantic-convention/
Horovits, D. “OpenTelemetry for GenAI and the OpenLLMetry project.” Medium, November 2025. https://horovits.medium.com/opentelemetry-for-genai-and-the-openllmetry-project-81b9cea6a771
9.3 Evaluation and Quality
Confident AI. “LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide.” https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
DeepEval. “Hallucination Metric Documentation.” https://deepeval.com/docs/metrics-hallucination
“Evaluating Evaluation Metrics – The Mirage of Hallucination Detection.” arXiv:2504.18114, 2025.
9.4 Tools and Frameworks
Langfuse. “OpenTelemetry (OTel) for LLM Observability.” https://langfuse.com/blog/2024-10-opentelemetry-for-llm-observability
Traceloop. “OpenLLMetry: Open-source observability for GenAI.” https://github.com/traceloop/openllmetry
Anthropic. “Building effective agents.” https://www.anthropic.com/research/building-effective-agents
10. Appendices
Appendix A: OTel GenAI Attribute Reference
Status: Index entry for future deep dive
Complete reference of all gen_ai.* attributes with:
- Full attribute list with types and examples
- Requirement levels by operation type
- Provider-specific extensions
- Migration guide from pre-1.37 conventions
Appendix B: Agent Span Hierarchies
Status: Index entry for future deep dive
Detailed span hierarchy patterns for:
- Single-agent workflows
- Multi-agent orchestration
- Tool execution chains
- Error propagation patterns
- Correlation strategies
Appendix C: LLM Evaluation Frameworks
Status: Index entry for future deep dive
Comparative analysis of:
- Langfuse evaluation capabilities
- Arize Phoenix integration patterns
- DeepEval metric implementations
- Custom evaluator development
- Production deployment patterns
Appendix D: observability-toolkit Schema Migration
Status: Index entry for future deep dive
Migration guide covering:
- Current schema documentation
- Target OTel-compliant schema
- Backward compatibility strategy
- Data migration procedures
- Validation test suites
Appendix E: Cost Tracking Implementation
Status: Index entry for future deep dive
Cost observability implementation covering:
- Provider pricing models
- Token-to-cost calculation
- Budget alerting patterns
- Cost attribution by session/user
- Optimization recommendations
Document History
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-01-29 | Research Analysis | Initial publication |
This document was produced through systematic web research and comparative analysis. It represents the state of LLM observability standards as of January 2026 and should be reviewed periodically as the field evolves rapidly.