Migration Anomaly Classification: When Your OTEL Dashboard Lies to You

The dashboard showed tool_correctness: 0.89 — CRITICAL. Alerts fired. And the system was doing exactly what it should: working through the tail end of a three-phase TypeScript migration. The session that followed was devoted to fixing the alert system itself: documenting why this happens, building a classification framework for expected anomalies, and laying the implementation groundwork so that future migrations don’t look like incidents.

Four deliverables came out of it — a Jekyll article for practitioners, a technical implementation plan, a statistical reference document, and backlog entries — and this report evaluates how well each one did its job.

Quality Scorecard

Seven metrics. Three from rule-based telemetry analysis, four from LLM-as-Judge evaluation of the session outputs. Together they form a complete picture of how well this session did its job.

The Headline

 RELEVANCE        ███████████████████░  0.96   healthy
 FAITHFULNESS     ██████████████████░░  0.91   healthy
 COHERENCE        ███████████████████░  0.94   healthy
 HALLUCINATION    ██████████████████░░  0.09   warning  (lower is better)
 TOOL ACCURACY    ████████████████████  0.99   healthy
 EVAL LATENCY     ████████████████████  0.005s healthy
 TASK COMPLETION  ░░░░░░░░░░░░░░░░░░░░   N/A   —

Dashboard status: warning — hallucination at 0.09 sits in the 0.05–0.1 warning band. The judge traced this to forward-looking quantitative estimates (per-type baseline values, false positive reduction percentages) stated without consistent “estimated” qualifiers throughout the documents. Not fabricated — reasonable engineering projections — but some care is needed distinguishing observed from projected values.

How We Measured

Tool correctness, evaluation latency, and task completion were derived automatically from OpenTelemetry trace spans. Every tool call emits a span; the rule engine checks whether it succeeded and how long it took. Task completion is marked N/A because this session used Write/Edit/Bash directly rather than TaskCreate/TaskUpdate — no task lifecycle data to ratio.

The content quality metrics come from LLM-as-Judge evaluation — a G-Eval pattern where an AI judge reads the session’s four output files and scores them on relevance, faithfulness, coherence, and hallucination. The judge verified commit hashes against git log, checked file paths, and validated statistical formulas before scoring.

Per-Output Breakdown

Each output was evaluated independently, then aggregated:

Document	Relevance	Faithfulness	Coherence	Hallucination
Jekyll article (migration anomaly, ~1,600 words)	0.97	0.88	0.96	0.12
Implementation plan (`/tmp/anomaly-implementation-plan.md`)	0.97	0.91	0.94	0.09
Classification reference (`EXPECTED_ANOMALY_CLASSIFICATION.md`)	0.98	0.90	0.97	0.10
Backlog entries (ANO-1, ANO-2 in BACKLOG.md)	0.93	0.95	0.88	0.05
Session Average	0.96	0.91	0.94	0.09

What the Judge Found

Strongest piece: the classification reference doc. Coherence 0.97 — the taxonomy, statistical methods, and threshold tables flow into each other naturally. The MAD formula and CUSUM parameters were verified against the canonical sources (Iglewicz & Hoaglin 1993, Montgomery 2019 — both real, both correctly cited). The ctx.addAttributes implementation pattern matches the actual hooks API. Edge cases (MAD=0, N<5) are handled explicitly. This is a document that could be handed to another engineer and used immediately.

Most precise claim in the session: The global thresholds cited across all four documents — tool_correctness: 0.95, token_burn: 200K, hook_latency_p95: 500ms, task_completion: 0.60 — were all verified against check-thresholds.sh and constants.ts. When the article says “the 0.95 threshold fired during a migration session,” that is a true statement traceable to the actual configuration.

The hallucination warning: The per-type baseline table in the Jekyll article presents migration session characteristics (tool_correctness median 0.88, avg latency 2.1s, avg token burn 150K) as observed values. But session.type tagging does not exist yet — those values are engineering estimates based on qualitative experience with the TS migration, not measured medians from tagged session data. The article’s roadmap section acknowledges this correctly, but the table header does not carry a “(projected)” qualifier. Same pattern in the classification reference’s proposed per-type threshold table. The backlog entries avoid this issue entirely, scoring faithfulness 0.95 — the entries make no quantitative claims, just accurately summarize what was built and what is pending.

Coherence note on the backlog: The BACKLOG.md header still reads “Last Updated: 2026-02-16 (Session 3)” after the new ANO-1/ANO-2 entries were appended. Minor, but a reader scanning the header would miss the 2026-02-23 additions.

Session Telemetry

Metric	Value
Session ID	`406a06d8-7daf-4970-937a-3f415a3a3f01`
Date	2026-02-23
Model	Claude Opus 4.6
Total Spans	141
Tool Calls	101
Input Tokens	43,246
Output Tokens	52,424
Cache Read Tokens	7,917,462
Total Tokens	95,670
Hooks Observed	session-start, builtin-post-tool, builtin-pre-tool, agent-post-tool, agent-pre-tool, plugin-post-tool, plugin-pre-tool, notification, skill-activation-prompt, error-handling-reminder, token-metrics-extraction

The cache read token count (7.9M) reflects context reuse across a multi-file read-heavy session — the hooks system, existing backlog format, Jekyll article conventions, and session-start.ts were all read before writing, amortizing that context across all four output files.

Methodology Notes

Session 406a06d8 was identified from today’s trace file (traces-2026-02-23.jsonl) by matching session IDs across 141 spans. The compute-metrics script calculated tool_correctness from builtin-post-tool and mcp-post-tool spans. No task lifecycle spans were present, so task_completion returns null rather than a ratio.

The four output files were read in full by the LLM-as-Judge agent before scoring. Claims were cross-checked against: git log (commit hashes), ~/.claude/hooks/handlers/session-start.ts (attribute API), ~/.claude/skills/otel-improvement/scripts/check-thresholds.sh (threshold values), and standard statistical references (MAD, CUSUM). The judge flagged forward-looking estimates as a faithfulness risk but did not classify them as hallucinations — the distinction being that the estimates are reasonable extrapolations from observed behavior, not invented figures.

The note about ctx.addAttribute (singular, line 27) vs ctx.addAttributes (plural, in the implementation plan) is a real API discrepancy — the plan’s code snippets have been corrected to use the singular form matching the actual call at line 27.

Appendix A: OTEL Data

Trace Summary

Field	Value
Trace file	`~/.claude/telemetry/traces-2026-02-23.jsonl`
Session ID	`406a06d8-7daf-4970-937a-3f415a3a3f01`
Total spans	141
Tool spans	101
Hook types	11

Hook Span Distribution

Hook	Role
`session-start`	Environment checks, git status, task restore
`builtin-post-tool`	Write/Edit/Bash success tracking
`builtin-pre-tool`	Permission checks
`agent-post-tool`	Subagent completion tracking
`agent-pre-tool`	Subagent launch tracking
`plugin-post-tool`	Plugin call completion
`plugin-pre-tool`	Plugin call launch
`notification`	Alert dispatch
`skill-activation-prompt`	Skill matching
`error-handling-reminder`	Error pattern detection
`token-metrics-extraction`	Token usage capture

Token Breakdown

Category	Tokens
Input	43,246
Output	52,424
Cache Read	7,917,462
Total (excl. cache)	95,670

Rule-Based Metrics

Metric	Value	Threshold	Status
tool_correctness	0.9901	0.95	healthy
evaluation_latency	0.005s	1.0s	healthy
task_completion	N/A	0.90	N/A (no task lifecycle spans)

Appendix B: Readability Scores Index

Textstat analysis of all session outputs (front matter excluded).

D5 — Quality Report (this document)

Metric	Score
Flesch Reading Ease	32.1
Flesch-Kincaid Grade	11.6
Gunning Fog Index	14.1
SMOG Index	12.6
Coleman-Liau Index	17.0
Automated Readability Index	11.9
Dale-Chall Score	14.2
Linsear Write	14.6
Consensus Grade	11th-12th grade
Reading Time	32s
Word Count	284
Sentence Count	25
Avg Sentence Length	11.4 words

D1 — Jekyll Article (Expected Anomalies in OTEL Migration Metrics)

Metric	Score
Flesch Reading Ease	35.7
Flesch-Kincaid Grade	11.8
Gunning Fog Index	14.8
SMOG Index	13.6
Coleman-Liau Index	15.8
Automated Readability Index	12.7
Dale-Chall Score	13.8
Linsear Write	7.3
Consensus Grade	13th-14th grade
Reading Time	2m 33s
Word Count	1,647
Sentence Count	115
Avg Sentence Length	14.3 words

Readability Summary

Both documents score in the “difficult” range (Flesch Reading Ease 32-36), consistent with technical writing for practitioners who understand OTEL, SRE, and statistical methods. The Jekyll article has a higher consensus grade (13th-14th) due to longer sentences and more domain-specific vocabulary. The report is slightly more accessible (11th-12th) due to shorter tabular content and bullet structure.

Appendix C: Session Summary

Deliverables

ID	Artifact	Location	Status
D1	Jekyll article — expected anomalies in OTEL migration metrics	`~/code/personal-site/_work/2026-02-23-expected-anomalies-in-otel-migration-metrics.md`	Written
D2	Backlog entries — ANO-1 (implementation), ANO-2 (reference doc)	`docs/BACKLOG.md`	Appended
D3	Classification reference — session type taxonomy, MAD Z-score, CUSUM, per-type thresholds	`docs/EXPECTED_ANOMALY_CLASSIFICATION.md`	Written
D4	Implementation plan — session-start.ts, summarize-session.py, check-thresholds.sh, update-scorecard.sh	`/tmp/anomaly-implementation-plan.md`	Written
D5	Quality report (this document)	`_reports/2026-02-23-migration-anomaly-otel-expected-degradation-classification.md`	Published

LLM-as-Judge Scores

Metric	D1 Article	D2 Backlog	D3 Reference	D4 Plan	Average
Relevance	0.97	0.93	0.98	0.97	0.96
Faithfulness	0.88	0.95	0.90	0.91	0.91
Coherence	0.96	0.88	0.97	0.94	0.94
Hallucination	0.12	0.05	0.10	0.09	0.09

Key Findings

Strongest output: D3 (classification reference) — coherence 0.97, immediately usable by another engineer
Highest faithfulness: D2 (backlog entries) — 0.95, no quantitative claims to verify
Hallucination source: Forward-looking per-type baseline estimates presented without consistent “projected” qualifiers (D1, D3)
Actionable fix applied: Implementation plan corrected to use ctx.addAttribute (singular) matching actual API at session-start.ts:27
Dashboard status: warning (hallucination 0.09 in 0.05-0.1 band)