The dashboard showed tool_correctness: 0.89 — CRITICAL. Alerts fired. And the system was doing exactly what it should: working through the tail end of a three-phase TypeScript migration. The session that followed was devoted to fixing the alert system itself: documenting why this happens, building a classification framework for expected anomalies, and laying the implementation groundwork so that future migrations don’t look like incidents.

Four deliverables came out of it — a Jekyll article for practitioners, a technical implementation plan, a statistical reference document, and backlog entries — and this report evaluates how well each one did its job.

Quality Scorecard

Seven metrics. Three from rule-based telemetry analysis, four from LLM-as-Judge evaluation of the session outputs. Together they form a complete picture of how well this session did its job.

The Headline

 RELEVANCE        ███████████████████░  0.96   healthy
 FAITHFULNESS     ██████████████████░░  0.91   healthy
 COHERENCE        ███████████████████░  0.94   healthy
 HALLUCINATION    ██████████████████░░  0.09   warning  (lower is better)
 TOOL ACCURACY    ████████████████████  0.99   healthy
 EVAL LATENCY     ████████████████████  0.005s healthy
 TASK COMPLETION  ░░░░░░░░░░░░░░░░░░░░   N/A   —

Dashboard status: warning — hallucination at 0.09 sits in the 0.05–0.1 warning band. The judge traced this to forward-looking quantitative estimates (per-type baseline values, false positive reduction percentages) stated without consistent “estimated” qualifiers throughout the documents. Not fabricated — reasonable engineering projections — but some care is needed distinguishing observed from projected values.

How We Measured

Tool correctness, evaluation latency, and task completion were derived automatically from OpenTelemetry trace spans. Every tool call emits a span; the rule engine checks whether it succeeded and how long it took. Task completion is marked N/A because this session used Write/Edit/Bash directly rather than TaskCreate/TaskUpdate — no task lifecycle data to ratio.

The content quality metrics come from LLM-as-Judge evaluation — a G-Eval pattern where an AI judge reads the session’s four output files and scores them on relevance, faithfulness, coherence, and hallucination. The judge verified commit hashes against git log, checked file paths, and validated statistical formulas before scoring.

Per-Output Breakdown

Each output was evaluated independently, then aggregated:

DocumentRelevanceFaithfulnessCoherenceHallucination
Jekyll article (migration anomaly, ~1,600 words)0.970.880.960.12
Implementation plan (/tmp/anomaly-implementation-plan.md)0.970.910.940.09
Classification reference (EXPECTED_ANOMALY_CLASSIFICATION.md)0.980.900.970.10
Backlog entries (ANO-1, ANO-2 in BACKLOG.md)0.930.950.880.05
Session Average0.960.910.940.09

What the Judge Found

Strongest piece: the classification reference doc. Coherence 0.97 — the taxonomy, statistical methods, and threshold tables flow into each other naturally. The MAD formula and CUSUM parameters were verified against the canonical sources (Iglewicz & Hoaglin 1993, Montgomery 2019 — both real, both correctly cited). The ctx.addAttributes implementation pattern matches the actual hooks API. Edge cases (MAD=0, N<5) are handled explicitly. This is a document that could be handed to another engineer and used immediately.

Most precise claim in the session: The global thresholds cited across all four documents — tool_correctness: 0.95, token_burn: 200K, hook_latency_p95: 500ms, task_completion: 0.60 — were all verified against check-thresholds.sh and constants.ts. When the article says “the 0.95 threshold fired during a migration session,” that is a true statement traceable to the actual configuration.

The hallucination warning: The per-type baseline table in the Jekyll article presents migration session characteristics (tool_correctness median 0.88, avg latency 2.1s, avg token burn 150K) as observed values. But session.type tagging does not exist yet — those values are engineering estimates based on qualitative experience with the TS migration, not measured medians from tagged session data. The article’s roadmap section acknowledges this correctly, but the table header does not carry a “(projected)” qualifier. Same pattern in the classification reference’s proposed per-type threshold table. The backlog entries avoid this issue entirely, scoring faithfulness 0.95 — the entries make no quantitative claims, just accurately summarize what was built and what is pending.

Coherence note on the backlog: The BACKLOG.md header still reads “Last Updated: 2026-02-16 (Session 3)” after the new ANO-1/ANO-2 entries were appended. Minor, but a reader scanning the header would miss the 2026-02-23 additions.

Session Telemetry

MetricValue
Session ID406a06d8-7daf-4970-937a-3f415a3a3f01
Date2026-02-23
ModelClaude Opus 4.6
Total Spans141
Tool Calls101
Input Tokens43,246
Output Tokens52,424
Cache Read Tokens7,917,462
Total Tokens95,670
Hooks Observedsession-start, builtin-post-tool, builtin-pre-tool, agent-post-tool, agent-pre-tool, plugin-post-tool, plugin-pre-tool, notification, skill-activation-prompt, error-handling-reminder, token-metrics-extraction

The cache read token count (7.9M) reflects context reuse across a multi-file read-heavy session — the hooks system, existing backlog format, Jekyll article conventions, and session-start.ts were all read before writing, amortizing that context across all four output files.

Methodology Notes

Session 406a06d8 was identified from today’s trace file (traces-2026-02-23.jsonl) by matching session IDs across 141 spans. The compute-metrics script calculated tool_correctness from builtin-post-tool and mcp-post-tool spans. No task lifecycle spans were present, so task_completion returns null rather than a ratio.

The four output files were read in full by the LLM-as-Judge agent before scoring. Claims were cross-checked against: git log (commit hashes), ~/.claude/hooks/handlers/session-start.ts (attribute API), ~/.claude/skills/otel-improvement/scripts/check-thresholds.sh (threshold values), and standard statistical references (MAD, CUSUM). The judge flagged forward-looking estimates as a faithfulness risk but did not classify them as hallucinations — the distinction being that the estimates are reasonable extrapolations from observed behavior, not invented figures.

The note about ctx.addAttribute (singular, line 27) vs ctx.addAttributes (plural, in the implementation plan) is a real API discrepancy — the plan’s code snippets have been corrected to use the singular form matching the actual call at line 27.


Appendix A: OTEL Data

Trace Summary

FieldValue
Trace file~/.claude/telemetry/traces-2026-02-23.jsonl
Session ID406a06d8-7daf-4970-937a-3f415a3a3f01
Total spans141
Tool spans101
Hook types11

Hook Span Distribution

HookRole
session-startEnvironment checks, git status, task restore
builtin-post-toolWrite/Edit/Bash success tracking
builtin-pre-toolPermission checks
agent-post-toolSubagent completion tracking
agent-pre-toolSubagent launch tracking
plugin-post-toolPlugin call completion
plugin-pre-toolPlugin call launch
notificationAlert dispatch
skill-activation-promptSkill matching
error-handling-reminderError pattern detection
token-metrics-extractionToken usage capture

Token Breakdown

CategoryTokens
Input43,246
Output52,424
Cache Read7,917,462
Total (excl. cache)95,670

Rule-Based Metrics

MetricValueThresholdStatus
tool_correctness0.99010.95healthy
evaluation_latency0.005s1.0shealthy
task_completionN/A0.90N/A (no task lifecycle spans)

Appendix B: Readability Scores Index

Textstat analysis of all session outputs (front matter excluded).

D5 — Quality Report (this document)

MetricScore
Flesch Reading Ease32.1
Flesch-Kincaid Grade11.6
Gunning Fog Index14.1
SMOG Index12.6
Coleman-Liau Index17.0
Automated Readability Index11.9
Dale-Chall Score14.2
Linsear Write14.6
Consensus Grade11th-12th grade
Reading Time32s
Word Count284
Sentence Count25
Avg Sentence Length11.4 words

D1 — Jekyll Article (Expected Anomalies in OTEL Migration Metrics)

MetricScore
Flesch Reading Ease35.7
Flesch-Kincaid Grade11.8
Gunning Fog Index14.8
SMOG Index13.6
Coleman-Liau Index15.8
Automated Readability Index12.7
Dale-Chall Score13.8
Linsear Write7.3
Consensus Grade13th-14th grade
Reading Time2m 33s
Word Count1,647
Sentence Count115
Avg Sentence Length14.3 words

Readability Summary

Both documents score in the “difficult” range (Flesch Reading Ease 32-36), consistent with technical writing for practitioners who understand OTEL, SRE, and statistical methods. The Jekyll article has a higher consensus grade (13th-14th) due to longer sentences and more domain-specific vocabulary. The report is slightly more accessible (11th-12th) due to shorter tabular content and bullet structure.


Appendix C: Session Summary

Deliverables

IDArtifactLocationStatus
D1Jekyll article — expected anomalies in OTEL migration metrics~/code/personal-site/_work/2026-02-23-expected-anomalies-in-otel-migration-metrics.mdWritten
D2Backlog entries — ANO-1 (implementation), ANO-2 (reference doc)docs/BACKLOG.mdAppended
D3Classification reference — session type taxonomy, MAD Z-score, CUSUM, per-type thresholdsdocs/EXPECTED_ANOMALY_CLASSIFICATION.mdWritten
D4Implementation plan — session-start.ts, summarize-session.py, check-thresholds.sh, update-scorecard.sh/tmp/anomaly-implementation-plan.mdWritten
D5Quality report (this document)_reports/2026-02-23-migration-anomaly-otel-expected-degradation-classification.mdPublished

LLM-as-Judge Scores

MetricD1 ArticleD2 BacklogD3 ReferenceD4 PlanAverage
Relevance0.970.930.980.970.96
Faithfulness0.880.950.900.910.91
Coherence0.960.880.970.940.94
Hallucination0.120.050.100.090.09

Key Findings

  • Strongest output: D3 (classification reference) — coherence 0.97, immediately usable by another engineer
  • Highest faithfulness: D2 (backlog entries) — 0.95, no quantitative claims to verify
  • Hallucination source: Forward-looking per-type baseline estimates presented without consistent “projected” qualifiers (D1, D3)
  • Actionable fix applied: Implementation plan corrected to use ctx.addAttribute (singular) matching actual API at session-start.ts:27
  • Dashboard status: warning (hallucination 0.09 in 0.05-0.1 band)