How does a 1,463-line frontend design spec come into existence? Not in a single sitting. Over the course of eight days, six Claude Code sessions wove together platform research, codebase audits, regulatory analysis, and UX pattern extraction – then distilled it all into a production specification for an LLM evaluation explainability dashboard. This report traces the telemetry footprint of that entire arc, from the first Wiz.io research scrape on February 6th to the final git commit at 19:23 EST on Valentine’s Day.


Quality Scorecard

Seven metrics. Three from rule-based telemetry analysis across all six contributing sessions, four from LLM-as-Judge evaluation of the four deliverable documents. Together they form a complete picture of how well this multi-session workflow performed.

The Headline

 RELEVANCE       ██████████████████░░  0.91   healthy
 FAITHFULNESS    ██████████████████░░  0.89   healthy
 COHERENCE       ███████████████████░  0.94   healthy
 HALLUCINATION   ███████████████████░  0.06   warning  (lower is better)
 TOOL ACCURACY   ████████████████████  1.00   healthy
 EVAL LATENCY    ████████████████████  3.9ms  healthy
 TASK COMPLETION ████████████████████  1.00   healthy

Dashboard status: warning – Hallucination at 0.06 sits just above the 0.05 healthy threshold. A single function-reference inaccuracy (computeExecutiveView() vs the actual unified computeRoleView()) and unverifiable CHI 2025 citation content account for the score.


How We Measured

The first three metrics – tool correctness, evaluation latency, and task completion – were derived automatically from OpenTelemetry trace spans emitted by Claude Code’s hook pipeline. Every tool call (Write, Edit, Bash, TaskCreate, TaskUpdate, TaskOutput) produces pre/post spans; the rule engine checks builtin.success and measures duration. These metrics are aggregated across all six sessions (630 total spans, 393 tool spans).

The content quality metrics come from LLM-as-Judge evaluation using a G-Eval pattern. An AI judge read all four deliverable documents in full and cross-referenced claims against the actual codebase (quality-metrics.ts, llm-as-judge.ts, backends/index.ts), the source research documents, and external references (OTel attribute names, regulatory article numbers, platform feature claims). Line-level verification was performed where references cited specific code locations.


Per-Output Breakdown

Each output was evaluated independently, then aggregated:

DocumentRelevanceFaithfulnessCoherenceHallucination
llm-explainability-design.md (1,463 lines)0.950.880.960.08
llm-explainability-research.md (research)0.950.850.940.08
wiz-io-security-explainability-ux.md (research)0.820.900.930.05
quality-dashboard-ux-review.md (gap analysis)0.930.910.940.03
Session Average0.910.890.940.06

What the Judge Found

Coherence was the standout signal (0.94 avg). All four documents demonstrate disciplined internal structure. The design spec uses a repeating pattern throughout its 16 sections: every component includes an anatomy diagram, TypeScript props interface, state definitions, accessibility notes, and data source mapping. Cross-references between documents use precise section numbers (e.g., “[Research Section 3, Pattern 1]”, “[UX Review Gap G1]”) – all verified as accurate.

The one faithfulness slip. The design spec references computeExecutiveView(), computeOperatorView(), and computeAuditorView() as three separate functions (line 1387). The codebase actually uses a single computeRoleView(summary, role) function. The types exist separately, so this is not fabrication – but it could mislead an implementer. This single inaccuracy accounts for most of the 0.08 hallucination score on the design spec.

Section 16 (Feature Engineering) is original work. The statistical methods – Gini coefficient for coverage uniformity, Pearson R for correlation discovery, composite quality index with configurable weights – are properly applied but extend beyond the source research. They are clearly labeled as “Proposed” but blur the boundary between “translating existing findings” and “original design contributions.”

The UX Review scored lowest hallucination (0.03). Because it makes claims primarily about the existing codebase, every assertion was directly verifiable. MetricConfigBuilder, AlertThreshold, and TriggeredAlert interfaces were all confirmed. Implementation status commit hashes provide an audit trail.

The Wiz.io research scored lowest relevance (0.82). Much of the document describes security-specific features (CSPM, CWPP, attack paths) that serve as contextual background rather than directly applicable patterns. The abstracted design patterns – toxic combinations, progressive disclosure, role-based views – are the pieces that directly informed the design spec.


Session Telemetry

Aggregate

MetricValue
Contributing Sessions6
Date Range2026-02-06 to 2026-02-14
Primary Modelclaude-opus-4-6 (120 LLM calls)
Secondary Modelclaude-haiku-4-5 (27 LLM calls)
Total Spans630
Tool Calls393 (success: 393, failed: 0)
Input Tokens434,772
Output Tokens831,080
Cache Read Tokens753,838,706
Cache Creation Tokens49,218,250
Commite00ab1b

Per-Session Breakdown

#Session ID (short)PhaseDurationSpansTool CallsRole
S1452e6359Research10 min83Wiz.io UX scraping (webscraping-research-analyst)
S2919e6917Research7 hours13068Main research: 8 subagent phases, iterative doc refinement
S3eea5c092Design2 hours311195Orchestrator: code review + design spec, 5 tasks completed
S4769b5ef9Design18 min2110Fetch CHI conference + regulatory source material
S5bd0dd9feDesign2.2 hours6851Explore dashboard UI data flow
S6dbbe3b2eDesign16 min3328Final file creation + git commit

Tool Usage (Aggregate)

ToolCountSessions Used In
Edit135S2, S3, S4, S5, S6
Bash139S3, S4, S5, S6
TaskUpdate42S2, S3
TaskOutput28S1, S2, S3, S4, S5
TaskCreate16S2, S3
Write14S1, S2, S3, S4, S6

Token Usage by Phase

PhaseModelLLM CallsInputOutputCache ReadCache Creation
Research (Feb 6)opus-4-668349,534487,678618,134,57540,552,678
Research (Feb 6)haiku-4-555385,0348,678,8251,235,592
Design (Feb 14)opus-4-65284,622338,116125,320,1086,959,494
Design (Feb 14)haiku-4-522782521,705,198470,486

Session Timeline

Feb 6  11:50 ━━━━━━━━ S2: Research Main (130 spans, ~7h) ━━━━━━━━━━━ 18:55
Feb 6  12:14 ━━ S1: Wiz.io Research (8 spans, 10m) ━━ 12:24

Feb 14 16:55 ━━━━━━━ S5: Dashboard Explore (68 spans, ~2.2h) ━━━━━━━━ 19:10
Feb 14 16:57 ━━━━━━━ S3: Main Design (311 spans, ~2h) ━━━━━━━━ 18:55
Feb 14 18:56 ━━━━━ S4: Fetch Sources (21 spans, 18m) ━━━━━ 19:13
Feb 14 19:11 ━━━━━━ S6: Commit (33 spans, 16m) ━━━━━━ 19:27
                                      ^ commit e00ab1b @ 19:23

Rule-Based Metrics (Per Session)

Sessiontool_correctnesseval_latency (ms)task_completionTotal SpansTool Spans
S1 452e63591.002.9682
S2 919e69171.002.890.00*13068
S3 eea5c0921.004.681.00311195
S4 769b5ef91.004.17218
S5 bd0dd9fe1.004.826848
S6 dbbe3b2e1.003.723328
Aggregate1.003.941.00630393

*S2’s task_completion = 0.00 is a telemetry tracking artifact: the session created 6 tasks via TaskCreate but completion signals were emitted via TaskUpdate spans that did not carry the expected builtin.task_status=completed attribute in the hook data. The design orchestrator session (S3) correctly tracked all 5 tasks to completion.


Evaluation Coverage

SessionRule-Based EvalsLLM-as-Judge EvalsNotes
S1 452e63597 (4 latency, 2 correctness, 1 completion)Short subagent, fully evaluated
S2 919e6917114 (61 latency, 51 correctness, 2 completion)Heaviest evaluation coverage
S3-S6 (Design)0Evaluation pipeline stopped at 21:36 UTC, 19 min before design sessions began
All outputs4 outputs scoredLLM-as-Judge evaluated all deliverable documents

Methodology Notes

  • Telemetry source: Local JSONL files at ~/.claude/telemetry/ (traces, logs, evaluations) supplemented by SigNoz Cloud query for cross-validation.
  • Session identification: Sessions were identified by correlating session.id attributes in hook trace spans with git commit timestamps and agent.description fields. File-path-level attribution was not available in hook spans; sessions were linked to outputs via temporal proximity to git commit and agent description matching.
  • Token metrics limitation: Token usage spans (hook:token-metrics-extraction) do not carry session.id and were attributed by phase time window rather than individual session. This means token numbers represent the full activity in each time window, not strictly the design-doc work.
  • Evaluation gap: The rule-based evaluation pipeline (telemetry-rule-engine) stopped processing at 21:36 UTC on Feb 14. All four design-phase sessions (S3-S6) began after this cutoff and have zero rule-based evaluations in the evaluation JSONL files. The per-session rule-based metrics reported here are computed directly from trace span attributes, not from the evaluation pipeline.
  • Task completion interpretation: S2’s 0.00 task_completion reflects the ratio of TaskUpdate spans with status=completed to TaskCreate spans. The research session used tasks for tracking but completion signals may have been recorded differently. S3’s 1.00 reflects all 5 tasks tracked through builtin.task_id and builtin.task_status attributes.
  • LLM-as-Judge verification: The judge cross-referenced code-level claims against actual source files, verifying line references (quality-metrics.ts:1298-1360, llm-as-judge.ts:247-282), function signatures, and interface definitions. Platform feature claims were validated against cited source URLs where possible.