Session Date: 2026-02-28
Project: observability-toolkit
Focus: Skill auditor scorer script + agent definition optimization
Session Type: Implementation

Executive Summary

Created scripts/skill-audit-scorer.py, a standalone Python scorer that implements the full 6-dimension skill-auditor rubric (max 60 points) against plugin-level OTEL telemetry. Successfully tested against all 18 skills, producing a grade distribution of 9 B’s and 9 C’s. Then collapsed both agent-auditor.md and skill-auditor.md definitions by removing verbose rubric tables now superseded by their respective scorer scripts, reducing injection token cost by ~38%.

MetricValue
Skills scored18
Grade distributionB: 9, C: 9
agent-auditor.md lines137 → 85 (-38%)
skill-auditor.md lines146 → 89 (-39%)
skill-audit-scorer.py lines712
Commitdfddc95

Problem Statement

The agent-auditor had a scorer script (agent-audit-scorer.py) for automated Phase 1-3 scoring, but the skill-auditor lacked a parallel script — forcing the agent to parse telemetry inline via embedded Python snippets during every invocation. Both agent definitions also carried full rubric tables (scoring thresholds, formulas, sub-metric weights) that were now redundant with their scripts, inflating injection size by ~700 tokens each.

Implementation Details

1. Skill Audit Scorer (scripts/skill-audit-scorer.py)

Built as a parallel to agent-audit-scorer.py with key adaptations for skill telemetry:

Dimension 1 (Telemetry Health) — Uses hook:plugin-pre-tool / plugin-post-tool spans keyed on plugin.name. Replaces agent error rate + trend with:

  • Empty injection rate (25%) — plugin.output_size == 0 / total
  • Activation conversion (15%) — from hook:skill-activation-prompt event spans

Dimension 6 (Efficiency) — Replaces agent duration/amplification/output/background with:

  • Injection size percentile (35%) — ranked against all skills
  • Model cost (30%) — haiku=10, sonnet=7, opus=3, none=5
  • Tool scope (35%) — fewer restricted tools = higher score

Frontmatter parsing — Handles YAML list allowed-tools (not CSV tools), tags, full model IDs (claude-sonnet-4-6sonnet).

Self-contained — no dependency on agent-telemetry-audit.py (unlike the agent scorer).

2. Agent Definition Collapse

Replaced verbose per-dimension sections (scoring tables, formulas, inline code) with a compact 6-row summary table in both files:

| # | Dimension | What it measures |
|---|-----------|-----------------|
| 1 | Telemetry Health | Usage frequency, error rate, ... |
| 2 | Definition Quality | Frontmatter completeness, ... |
...

Additional changes to skill-auditor.md:

  • Removed inline Python telemetry snippet (14 lines)
  • Added Phase 4: Rewrite (mirroring agent-auditor)
  • Added Write/Edit to tools (needed for rewrites)
  • Updated guidelines: read-only → rewrite C/D skills
  • Added scorer script reference section

3. Restored Scripts

scripts/agent-audit-scorer.py and scripts/parse/agent-telemetry-audit.py existed in git history (commits e25d148, b66021d) but were not in the working tree. Restored from those commits.

Testing and Verification

Ran --all --days 30 successfully across all 18 skills:

python3 scripts/skill-audit-scorer.py --all --days 30

Top scorers: ralph-wiggum (46.9/60 B), backlog-migrate (45.7/60 B), git-commit-smart (44.0/60 B). Lowest: otel-output-provenance (26.2/60 C), session-report (28.9/60 C).

Single-skill JSON output verified:

python3 scripts/skill-audit-scorer.py otel-session-summary --json --days 7

Token Count Comparison

File~TokensChars
agent-auditor.md~1,6085,510
skill-auditor.md~1,6695,401
skill-audit-scorer.py~8,60026,311

The script is ~5x larger than either agent definition, but runs at zero LLM inference cost.

Files Modified/Created

FileActionLines
scripts/skill-audit-scorer.pyCreated712
scripts/agent-audit-scorer.pyRestored584
scripts/parse/agent-telemetry-audit.pyRestored318
~/.claude/agents/agent-auditor.mdCollapsed137 → 85
~/.claude/agents/skill-auditor.mdCollapsed + rewrite phase146 → 89

Decisions

  • Choice: Self-contained skill scorer (no import of agent-telemetry-audit)
  • Rationale: Plugin spans use different span names and attribute keys than agent spans; sharing the parser would require abstraction that adds complexity for minimal benefit
  • Alternative: Refactor agent-telemetry-audit to support both span types
  • Trade-off: ~50 lines of duplicated file-discovery logic

References

  • scripts/skill-audit-scorer.py — new skill scorer
  • scripts/agent-audit-scorer.py — existing agent scorer (parallel structure)
  • ~/.claude/agents/skill-auditor.md — skill auditor agent definition
  • ~/.claude/agents/agent-auditor.md — agent auditor agent definition
  • src/lib/skill-auditor-scoring.ts — TypeScript scoring module (shared functions)
  • Commit dfddc95 — session commit