Closing the Gaps: Hook Telemetry Fix Session
When observability tooling itself has observability gaps, the problem is self-referential in an uncomfortable way. This session set out to fix seven telemetry gaps in the Claude Code hook system — legacy field references letting errors slip by silently, missing OTEL spans leaving MCP session failures untracked, and a cache path mismatch that meant error reminders were reading from the wrong directory. Then a code reviewer ran on the changes, found four more issues, and those got fixed too. By the end, 917 tests passed, the dist was rebuilt, and a clean set of commits landed on main.
Quality Scorecard
Seven metrics. Three from rule-based telemetry analysis, four from LLM-as-Judge evaluation of the session outputs. Together they form a complete picture of how well this session did its job.
The Headline
RELEVANCE ██████████████████░░ 0.94 healthy
FAITHFULNESS ███████████████████░ 0.96 healthy
COHERENCE ███████████████████░ 0.96 healthy
HALLUCINATION ███████████████████░ 0.03 healthy (lower is better)
TOOL ACCURACY ███████████████████░ 0.97 healthy
EVAL LATENCY ████████████████████ 1.4ms healthy
TASK COMPLETION ████████████████████ 1.00 healthy
Dashboard status: healthy — all seven metrics in the green. Tool accuracy at 0.97 with 95 of 98 calls succeeding (3 non-blocking Read/Bash failures). LLM judge scores cluster tightly around 0.95 across relevance, faithfulness, and coherence, with hallucination at 0.03.
How We Measured
The first three metrics — tool correctness, evaluation latency, and task completion — were derived automatically from OpenTelemetry trace spans emitted by the hook system itself. Every tool call produces a hook:builtin-post-tool span with a builtin.success attribute; the rule engine aggregates those across the session.
The content quality metrics come from LLM-as-Judge evaluation — a G-Eval pattern where an AI judge reads the session’s outputs and scores them along four criteria. For this session, the judge evaluated six outputs: four changed source files in ~/.claude/hooks/, the updated mcp-status.ts with its new @deprecated annotation, and the BACKLOG.md entries documenting deferred findings. All file paths and line numbers cited in the evaluation were cross-checked against live source.
Per-Output Breakdown
Each output was evaluated independently, then aggregated:
| Output | Relevance | Faithfulness | Coherence | Hallucination |
|---|---|---|---|---|
handlers/post-tool.ts | 0.97 | 0.98 | 0.96 | 0.02 |
lib/output-analyzer.ts | 0.96 | 0.95 | 0.97 | 0.03 |
handlers/stop.ts | 0.96 | 0.97 | 0.95 | 0.03 |
handlers/notification.ts | 0.93 | 0.92 | 0.95 | 0.05 |
handlers/mcp-status.ts | 0.88 | 0.98 | 0.96 | 0.02 |
docs/BACKLOG.md | 0.94 | 0.96 | 0.95 | 0.04 |
| Session Average | 0.94 | 0.96 | 0.96 | 0.03 |
What the Judge Found
post-tool.ts scored the highest overall (0.97 relevance, 0.98 faithfulness). The judge verified that resolveOutputText() is now called once, stored as resolvedOutput, and passed to handleMcpStatus in both MCP call sites — the exact fix the plan specified. No invented claims.
output-analyzer.ts earned a 0.97 coherence score. The judge confirmed all five claimed additions: the 50KB pre-slice at line 86, the five new switch cases, the classifyError(safeOutput) fix at line 173, and the KillShell word-boundary regex. It correctly noted that TaskOutput omits the \bfailed\b pattern from the generic path — accurately flagged in the BACKLOG as M3, a conscious deferral, not a mistake.
stop.ts had the most changes (five distinct fixes) and held up well at 0.97 faithfulness. The judge traced each fix to its line: getTypeCheckCacheDir at line 419, instrumentHook wrapping at lines 956–975, isolated callLlmJudge try/catch blocks at lines 774–786, the parse-failure throw at line 877, and sessionIdForMcp narrowing at lines 946–947.
mcp-status.ts scored lowest on relevance (0.88) — the judge noted that adding a @deprecated JSDoc is a minimal intervention for a dead-code gap; a more complete fix might have removed or re-routed the function. That’s a fair read, and why the finding was intentionally low priority.
The BACKLOG entries were praised for accuracy: all four line number references were verified against live source, with a minor uncertainty on the H3 budget claim (the 1.5 cents per call figure would need cross-checking against quality-budget.ts).
Session Telemetry
| Metric | Value |
|---|---|
| Session ID | 99daa61f-c413-4692-9dab-4fa7f604e3ca |
| Date | 2026-02-28 |
| Model | claude-opus-4-6 |
| Duration | 38.3 min (15:20–15:58) |
| Total Spans | 169 |
| Tool Calls | 98 (success: 95, failed: 3) |
| Input Tokens | 755 |
| Output Tokens | 87,061 |
| Cache Read Tokens | 33,759,937 |
| Top Tools | Bash (44), Read (20), Edit (16), Grep (9), TaskUpdate (6), TaskCreate (3) |
| Commits | 6 (4 reviewed by post-commit-review hook) |
| Hooks Observed | builtin-post-tool, builtin-pre-tool, code-structure, error-handling-reminder, notification, post-commit-review, session-start, skill-activation-prompt, telemetry-alert-evaluation, token-metrics-extraction, tsc-check |
The cache read token count (33.7M) reflects heavy context reuse across the hook files — most of the large files were read once and reused across subsequent edits without re-reading.
Methodology Notes
- tool_correctness computed from
hook:builtin-post-toolspans withbuiltin.successattribute. The 3 failures were non-blocking: oneBashcall and twoReadcalls (likely probing files that didn’t exist yet). - task_completion reflects the in-session task tracker ratio: 3 tasks created, 3 completed = 1.0. The evaluation log records 0.5 from mid-session stop events captured before tasks resolved; the final ratio is 1.0.
- evaluation_latency is the median span duration (1.4ms) across all 169 hook spans — a measure of hook overhead, not LLM latency.
- LLM-as-Judge evaluation used the
genai-quality-monitoragent with G-Eval scoring. Six source outputs were evaluated; each file/line reference was cross-checked against live source for faithfulness scoring.