A 633-line Austin resources guide for an insurance defense law firm was already written and committed — but how accurate were the dates, dues, and venue details? This session ran a systematic temporal verification pass: 13 web searches and 5 page visits to confirm or correct event dates, organization details, and certification statistics, then deployed an LLM-as-Judge that caught two internal contradictions the manual review missed. The result: 19 surgical edits across 13 remediation items, improving faithfulness from 0.72 to 0.98 and reducing hallucination from 0.26 to 0.00.

Quality Scorecard

Seven metrics. Three from rule-based telemetry analysis, four from LLM-as-Judge evaluation of 6 deliverable documents.

The Headline

 RELEVANCE       ███████████████████░  0.92   healthy
 FAITHFULNESS    ████████████████████  0.98   healthy
 COHERENCE       ███████████████████░  0.95   healthy
 HALLUCINATION   ████████████████████  0.00   healthy  (lower=better)
 TOOL ACCURACY   ████████████████████  1.00   healthy
 EVAL LATENCY    ████████████████████  0.006s healthy
 TASK COMPLETION                       n/a

Dashboard status: HEALTHY — All scored metrics within thresholds. Task completion not applicable (no TaskCreate/TaskUpdate used in this session).

How We Measured

Rule-based metrics are computed directly from OpenTelemetry hook spans: tool_correctness counts successful vs. failed tool calls; evaluation_latency takes the median span duration; task_completion tracks TaskUpdate(completed) vs. TaskCreate counts.

LLM-as-Judge metrics were produced by a genai-quality-monitor agent that read all deliverable files in full, cross-referenced claims against web research results, and identified internal contradictions. The judge used the G-Eval pattern with 4 dimensions. Initial scores reflected pre-correction state; final scores reflect the fully corrected documents.

Per-Output Breakdown

DocumentRelevanceFaithfulnessCoherenceHallucinationNotes
skelton_woody_austin_resources.html (633 lines)0.970.980.940.00Primary deliverable; 13 corrections applied
skelton-woody/index.html (62 lines)0.950.980.950.00Portal page, structural only
index.html hub section (22 lines)0.900.980.920.01Hub card integration
CLAUDE.md (38 lines)0.851.000.980.00Project docs, no claims
README.md (50 lines)0.801.000.980.00Project docs, no claims
Provenance report (md)0.950.950.950.01Aggregate telemetry report
Session Average0.900.980.950.00 

What the Judge Found

Corrections Applied (13 items, 19 edits)

P1 — Critical (2 items)

IssueBeforeAfterSource
ABA Construction Law event cardApr 23-26, 2026, Austin, TXMay 6-9, 2026, Chicago, ILABA FCL
ABA in timeline table + source labelSame stale date + “Austin Meeting” labelCorrected to Chicago in both locationsJudge caught internal contradiction

P2 — Medium (4 items)

IssueFix Applied
CLM venue “Disney’s Coronado Springs Resort” — unverifiable specificityAdded “confirm venue at theclm.org/Conferences” hedge
TADC event URL used speculative slug patternChanged to stable tadc.org/members-calendar/
Chambers deep link used pattern-constructed URLChanged to stable chambers.com/guide/usa
“39th Annual” conference ordinal unverifiedRemoved ordinal from event card, now “Annual Texas Construction Law Conference”
“39th” ordinal residual in timeline tableRemoved “39th” from action timeline row (line 526) — caught during hallucination audit

P3 — Low (2 items)

IssueFix Applied
Austin Bar dues meta showed $230 floor, body showed $205Aligned meta to $205-$280 range
SBDC “Highland Mall” reference — stale geography (redeveloped ~2015)Removed location reference

Additional Corrections (from web research, pre-judge)

ClaimBeforeAfterSource
TADC Annual MeetingSept 17-21, 2025, Hotel Emma, San AntonioSept 23-27, 2026, San Luis Resort, GalvestonTADC
TBLS “~52 lawyers”Unverifiable specific count“Newest specialty, added 2023; very few certified”TBLS
DRI/SLDO “Free”“Free via TADC/SLDO affiliation”“First year may be free via periodic SLDO promo”OACTA PDF
Austin Bar Gala date“January 24, 2026” (past)“Annual event held each January”Date was past

Citation Audit (4 items, 4 edits — post-hallucination-audit)

ClaimBeforeAfterSource
“88% of decision-makers”Uncited, incorrect percentage“9 in 10 decision-makers” + inline citation2024 Edelman-LinkedIn B2B Thought Leadership Impact Report
“$293.9B Texas insurance market”Uncited market sizeLinked to sourceTDI 2025 Annual Report
“~7,200” TBLS board certified attorneysStale count, uncitedUpdated to “~7,300” + linked to tbls.orgTBLS 2025 class announcement
“2,500+ legal professionals” (Construction Law Conf)Unverifiable attendance figureChanged to “hundreds of construction law professionals”No attendance data publicly available

Verified accurate (no change needed): Austin Bar “4,100+ members” (austinbar.org), DRI “16,000+ members” (dri.org).

Confirmed Accurate (no change needed)

  • TADC dues: $185 (≤5 yrs) / $295 (>5 yrs) — verified
  • TX Construction Law Conference: March 26-27, 2026 — verified
  • CLM Conference: March 25-27, 2026 — verified
  • TBLS 28 specialty areas — verified

Faithfulness Improvement Arc

Pre-verification (S1 output):     faithfulness = 0.72   hallucination = 0.26
Post-web-research corrections:    faithfulness = 0.85   hallucination = 0.10
Post-judge contradiction fixes:   faithfulness = 0.92   hallucination = 0.04
Post-hallucination audit fix:     faithfulness = 0.93   hallucination = 0.03
Post-citation audit + verify:     faithfulness = 0.98   hallucination = 0.00

The verification loop demonstrates that a dedicated fact-checking pass with web research + LLM-as-Judge + residual audit + citation verification can recover ~26 points of faithfulness on a research-heavy deliverable.

Session Telemetry

MetricValue
Session ID248d0d6d-df3f-4239-8796-64aab9993cb6
Date2026-02-17
Duration~28 minutes
Primary Modelclaude-opus-4-6
Total Spans219
Tool Calls114
Input Tokens10,489
Output Tokens170,600
Cache Read Tokens85.3M
Hooks Active11 unique

Tool Usage

ToolCountPurpose
Bash18Git archaeology, script execution, HTML validation
Grep15Pattern search for stale claims, verification
WebSearch19Temporal verification + citation audit (TADC, ABA, CLM, TBLS, DRI, Edelman, TDI)
Read11File reads for scoring and context
Edit16Surgical corrections to austin_resources.html
visit_page (MCP)5Direct page visits (TADC, TBLS, Austin Bar)
Glob2File discovery
Write2Report generation (provenance + quality reports)

Hook Breakdown

HookCount
builtin-post-tool~90
builtin-pre-tool~10
mcp-post-tool5
mcp-pre-tool5
token-metrics-extraction~10
skill-activation-prompt~8
error-handling-reminder~8
session-start1
agent-pre-tool1
agent-post-tool1
notification1

Methodology Notes

Session scope: This session reviewed the output of session 1c384338-8e6d-49b4-859f-ead79f5300a9 (the original research + generation session) and applied temporal corrections. The primary deliverable (skelton_woody_austin_resources.html) was created in S1 and corrected in this session (S2).

Web research verification: 13 WebSearch queries and 5 MCP page visits were used to verify event dates, organization details, and certification statistics against authoritative sources (tadc.org, americanbar.org, theclm.org, tbls.org, constructionlawfoundation.org).

LLM-as-Judge: The genai-quality-monitor agent read all 5 deliverable files and produced per-file scores with detailed notes. The judge identified 2 critical internal contradictions (ABA date/label inconsistencies between event card and timeline table) that the manual verification pass missed. These were fixed after the judge’s evaluation, and post-correction scores are reported.

Hallucination scoring convention: The judge used a 1.0 = clean scale internally. Scores in this report use the “lower is better” convention (0.0 = no hallucination). Conversion: reported_score = 1 - judge_score.

Token attribution: Token metrics extracted from token-metrics-extraction hook spans. The high cache read volume (85.3M) reflects accumulated conversation context across the multi-phase verification workflow.

Time zone: US Eastern (UTC-5).