Six Sessions, One Design Spec: Aggregate Telemetry for LLM Explainability Dashboard
How does a 1,463-line frontend design spec come into existence? Not in a single sitting. Over the course of eight days, six Claude Code sessions wove together platform research, codebase audits, regulatory analysis, and UX pattern extraction – then distilled it all into a production specification for an LLM evaluation explainability dashboard. This report traces the telemetry footprint of that entire arc, from the first Wiz.io research scrape on February 6th to the final git commit at 19:23 EST on Valentine’s Day.
Quality Scorecard
Seven metrics. Three from rule-based telemetry analysis across all six contributing sessions, four from LLM-as-Judge evaluation of the four deliverable documents. Together they form a complete picture of how well this multi-session workflow performed.
The Headline
RELEVANCE ██████████████████░░ 0.91 healthy
FAITHFULNESS ██████████████████░░ 0.89 healthy
COHERENCE ███████████████████░ 0.94 healthy
HALLUCINATION ███████████████████░ 0.06 warning (lower is better)
TOOL ACCURACY ████████████████████ 1.00 healthy
EVAL LATENCY ████████████████████ 3.9ms healthy
TASK COMPLETION ████████████████████ 1.00 healthy
Dashboard status: warning – Hallucination at 0.06 sits just above the 0.05 healthy threshold. A single function-reference inaccuracy (computeExecutiveView() vs the actual unified computeRoleView()) and unverifiable CHI 2025 citation content account for the score.
How We Measured
The first three metrics – tool correctness, evaluation latency, and task completion – were derived automatically from OpenTelemetry trace spans emitted by Claude Code’s hook pipeline. Every tool call (Write, Edit, Bash, TaskCreate, TaskUpdate, TaskOutput) produces pre/post spans; the rule engine checks builtin.success and measures duration. These metrics are aggregated across all six sessions (630 total spans, 393 tool spans).
The content quality metrics come from LLM-as-Judge evaluation using a G-Eval pattern. An AI judge read all four deliverable documents in full and cross-referenced claims against the actual codebase (quality-metrics.ts, llm-as-judge.ts, backends/index.ts), the source research documents, and external references (OTel attribute names, regulatory article numbers, platform feature claims). Line-level verification was performed where references cited specific code locations.
Per-Output Breakdown
Each output was evaluated independently, then aggregated:
| Document | Relevance | Faithfulness | Coherence | Hallucination |
|---|---|---|---|---|
llm-explainability-design.md (1,463 lines) | 0.95 | 0.88 | 0.96 | 0.08 |
llm-explainability-research.md (research) | 0.95 | 0.85 | 0.94 | 0.08 |
wiz-io-security-explainability-ux.md (research) | 0.82 | 0.90 | 0.93 | 0.05 |
quality-dashboard-ux-review.md (gap analysis) | 0.93 | 0.91 | 0.94 | 0.03 |
| Session Average | 0.91 | 0.89 | 0.94 | 0.06 |
What the Judge Found
Coherence was the standout signal (0.94 avg). All four documents demonstrate disciplined internal structure. The design spec uses a repeating pattern throughout its 16 sections: every component includes an anatomy diagram, TypeScript props interface, state definitions, accessibility notes, and data source mapping. Cross-references between documents use precise section numbers (e.g., “[Research Section 3, Pattern 1]”, “[UX Review Gap G1]”) – all verified as accurate.
The one faithfulness slip. The design spec references computeExecutiveView(), computeOperatorView(), and computeAuditorView() as three separate functions (line 1387). The codebase actually uses a single computeRoleView(summary, role) function. The types exist separately, so this is not fabrication – but it could mislead an implementer. This single inaccuracy accounts for most of the 0.08 hallucination score on the design spec.
Section 16 (Feature Engineering) is original work. The statistical methods – Gini coefficient for coverage uniformity, Pearson R for correlation discovery, composite quality index with configurable weights – are properly applied but extend beyond the source research. They are clearly labeled as “Proposed” but blur the boundary between “translating existing findings” and “original design contributions.”
The UX Review scored lowest hallucination (0.03). Because it makes claims primarily about the existing codebase, every assertion was directly verifiable. MetricConfigBuilder, AlertThreshold, and TriggeredAlert interfaces were all confirmed. Implementation status commit hashes provide an audit trail.
The Wiz.io research scored lowest relevance (0.82). Much of the document describes security-specific features (CSPM, CWPP, attack paths) that serve as contextual background rather than directly applicable patterns. The abstracted design patterns – toxic combinations, progressive disclosure, role-based views – are the pieces that directly informed the design spec.
Session Telemetry
Aggregate
| Metric | Value |
|---|---|
| Contributing Sessions | 6 |
| Date Range | 2026-02-06 to 2026-02-14 |
| Primary Model | claude-opus-4-6 (120 LLM calls) |
| Secondary Model | claude-haiku-4-5 (27 LLM calls) |
| Total Spans | 630 |
| Tool Calls | 393 (success: 393, failed: 0) |
| Input Tokens | 434,772 |
| Output Tokens | 831,080 |
| Cache Read Tokens | 753,838,706 |
| Cache Creation Tokens | 49,218,250 |
| Commit | e00ab1b |
Per-Session Breakdown
| # | Session ID (short) | Phase | Duration | Spans | Tool Calls | Role |
|---|---|---|---|---|---|---|
| S1 | 452e6359 | Research | 10 min | 8 | 3 | Wiz.io UX scraping (webscraping-research-analyst) |
| S2 | 919e6917 | Research | 7 hours | 130 | 68 | Main research: 8 subagent phases, iterative doc refinement |
| S3 | eea5c092 | Design | 2 hours | 311 | 195 | Orchestrator: code review + design spec, 5 tasks completed |
| S4 | 769b5ef9 | Design | 18 min | 21 | 10 | Fetch CHI conference + regulatory source material |
| S5 | bd0dd9fe | Design | 2.2 hours | 68 | 51 | Explore dashboard UI data flow |
| S6 | dbbe3b2e | Design | 16 min | 33 | 28 | Final file creation + git commit |
Tool Usage (Aggregate)
| Tool | Count | Sessions Used In |
|---|---|---|
| Edit | 135 | S2, S3, S4, S5, S6 |
| Bash | 139 | S3, S4, S5, S6 |
| TaskUpdate | 42 | S2, S3 |
| TaskOutput | 28 | S1, S2, S3, S4, S5 |
| TaskCreate | 16 | S2, S3 |
| Write | 14 | S1, S2, S3, S4, S6 |
Token Usage by Phase
| Phase | Model | LLM Calls | Input | Output | Cache Read | Cache Creation |
|---|---|---|---|---|---|---|
| Research (Feb 6) | opus-4-6 | 68 | 349,534 | 487,678 | 618,134,575 | 40,552,678 |
| Research (Feb 6) | haiku-4-5 | 5 | 538 | 5,034 | 8,678,825 | 1,235,592 |
| Design (Feb 14) | opus-4-6 | 52 | 84,622 | 338,116 | 125,320,108 | 6,959,494 |
| Design (Feb 14) | haiku-4-5 | 22 | 78 | 252 | 1,705,198 | 470,486 |
Session Timeline
Feb 6 11:50 ━━━━━━━━ S2: Research Main (130 spans, ~7h) ━━━━━━━━━━━ 18:55
Feb 6 12:14 ━━ S1: Wiz.io Research (8 spans, 10m) ━━ 12:24
Feb 14 16:55 ━━━━━━━ S5: Dashboard Explore (68 spans, ~2.2h) ━━━━━━━━ 19:10
Feb 14 16:57 ━━━━━━━ S3: Main Design (311 spans, ~2h) ━━━━━━━━ 18:55
Feb 14 18:56 ━━━━━ S4: Fetch Sources (21 spans, 18m) ━━━━━ 19:13
Feb 14 19:11 ━━━━━━ S6: Commit (33 spans, 16m) ━━━━━━ 19:27
^ commit e00ab1b @ 19:23
Rule-Based Metrics (Per Session)
| Session | tool_correctness | eval_latency (ms) | task_completion | Total Spans | Tool Spans |
|---|---|---|---|---|---|
S1 452e6359 | 1.00 | 2.96 | – | 8 | 2 |
S2 919e6917 | 1.00 | 2.89 | 0.00* | 130 | 68 |
S3 eea5c092 | 1.00 | 4.68 | 1.00 | 311 | 195 |
S4 769b5ef9 | 1.00 | 4.17 | – | 21 | 8 |
S5 bd0dd9fe | 1.00 | 4.82 | – | 68 | 48 |
S6 dbbe3b2e | 1.00 | 3.72 | – | 33 | 28 |
| Aggregate | 1.00 | 3.94 | 1.00 | 630 | 393 |
*S2’s task_completion = 0.00 is a telemetry tracking artifact: the session created 6 tasks via TaskCreate but completion signals were emitted via TaskUpdate spans that did not carry the expected builtin.task_status=completed attribute in the hook data. The design orchestrator session (S3) correctly tracked all 5 tasks to completion.
Evaluation Coverage
| Session | Rule-Based Evals | LLM-as-Judge Evals | Notes |
|---|---|---|---|
S1 452e6359 | 7 (4 latency, 2 correctness, 1 completion) | – | Short subagent, fully evaluated |
S2 919e6917 | 114 (61 latency, 51 correctness, 2 completion) | – | Heaviest evaluation coverage |
| S3-S6 (Design) | 0 | – | Evaluation pipeline stopped at 21:36 UTC, 19 min before design sessions began |
| All outputs | – | 4 outputs scored | LLM-as-Judge evaluated all deliverable documents |
Methodology Notes
- Telemetry source: Local JSONL files at
~/.claude/telemetry/(traces, logs, evaluations) supplemented by SigNoz Cloud query for cross-validation. - Session identification: Sessions were identified by correlating
session.idattributes in hook trace spans with git commit timestamps andagent.descriptionfields. File-path-level attribution was not available in hook spans; sessions were linked to outputs via temporal proximity togit commitand agent description matching. - Token metrics limitation: Token usage spans (
hook:token-metrics-extraction) do not carrysession.idand were attributed by phase time window rather than individual session. This means token numbers represent the full activity in each time window, not strictly the design-doc work. - Evaluation gap: The rule-based evaluation pipeline (
telemetry-rule-engine) stopped processing at 21:36 UTC on Feb 14. All four design-phase sessions (S3-S6) began after this cutoff and have zero rule-based evaluations in the evaluation JSONL files. The per-session rule-based metrics reported here are computed directly from trace span attributes, not from the evaluation pipeline. - Task completion interpretation: S2’s 0.00 task_completion reflects the ratio of
TaskUpdatespans withstatus=completedtoTaskCreatespans. The research session used tasks for tracking but completion signals may have been recorded differently. S3’s 1.00 reflects all 5 tasks tracked throughbuiltin.task_idandbuiltin.task_statusattributes. - LLM-as-Judge verification: The judge cross-referenced code-level claims against actual source files, verifying line references (
quality-metrics.ts:1298-1360,llm-as-judge.ts:247-282), function signatures, and interface definitions. Platform feature claims were validated against cited source URLs where possible.