Session Date: 2026-02-25
Project: ast-grep-mcp
Focus: End-to-end live validation of condense feature tools
Session Type: Validation | Bugfix

Executive Summary

Ran all 6 condense MCP tools against the production tcad-scraper codebase (219 files, 1.1MB TypeScript/JS/Python). Every tool executed successfully with zero errors. During validation, discovered and fixed a serialization gap where per_language_stats in condense_pack output was missing byte-level metrics — only line counts were emitted. After the fix, all 117 condense tests continue to pass.

The ai_chat strategy achieved 15.7% actual reduction (43,827 tokens saved) against a theoretical 85% estimate, confirming the JS/TS surface extractor’s brace-matching heuristic retains more code than the theoretical model assumes — documented as a future improvement area.

Key Metrics

MetricValue
Tools validated6 / 6
Target codebasetcad-scraper (219 files, 1.1 MB)
Errors encountered0
Bug found and fixed1 (per-language byte stats)
Tests passing117 / 117
Best actual reduction15.7% (ai_chat)
Tokens saved (ai_chat)43,827

Tool-by-Tool Results

1. condense_normalize (batch, 212 TS files)

Processed in 5 batches of 50. Completed in 0.4s.

MetricValue
Files processed212
Normalizations applied11,494
Files with changes210 / 212
Byte delta+1,774 (0.16% expansion)

Quote canonicalization was the dominant transform. Net byte expansion is expected — normalization targets compression consistency, not direct size reduction. Top file: continuous-batch-scraper.ts with 1,159 normalizations.

2. condense_strip (batch, 219 files)

MetricValue
Files processed219
Lines removed164
Line reduction0.40%
Files with removals20 / 219
Elapsed0.6s

Removed console.log, debugger, print(), and pdb.set_trace statements. Top file: setup-test-db.ts (37 lines removed). Codebase is relatively clean — only 0.4% dead code.

3. condense_extract_surface (212 TS files)

MetricValue
Files processed212
Condensed lines33,938
Reduction15.0%
Output size946,780 chars
Elapsed22.5s

Kept only export declarations with brace-matched blocks. Test files with describe/it (no export prefix) fall back to keeping all lines, limiting reduction.

4. condense_pack (all 4 strategies)

StrategyCondensedReductionTokens (est)Time
ai_chat938,923 B15.7%234,73010.6s
ai_analysis1,102,420 B1.1%275,60519.2s
archival1,102,420 B1.1%275,60514.9s
polyglot938,923 B15.7%234,73015.5s

ai_chat and polyglot produce identical output (all files are code, no config/text routing divergence). ai_analysis and archival are identical (both lossless, normalize+strip only).

Per-language breakdown (ai_chat):

LanguageFilesReduction
TypeScript21215.6%
JavaScript426.4%
Python323.5%

5. condense_estimate

StrategyEst. BytesEst. TokensTheoretical Reduction
ai_chat167,13441,783~85%
ai_analysis668,538167,134~40%
archival779,961194,990~30%
polyglot389,98097,495~65%

Top reduction candidates: continuous-batch-scraper.ts (1,769 lines, 4.3% of codebase).

6. condense_normalize on ~/reports/ (32 files)

Also validated against the reports site (JS, Python, CSS files):

MetricValue
Files processed32
Normalizations applied59
Files with changes16 / 32
Byte reduction75 (0.02%)

Bug Found and Fixed

Problem: per_language_stats in condense_pack output only serialized files_processed, original_lines, condensed_lines — missing byte-level metrics entirely. This caused per-language stats to appear as all zeros when accessing original_bytes/condensed_bytes keys.

Root cause: LanguageCondenseStats dataclass had no byte fields, and condense_pack_impl only aggregated line counts per language.

Fix (2 files):

src/ast_grep_mcp/models/condense.py:11-12 — Added fields:

original_bytes: int = 0
condensed_bytes: int = 0

src/ast_grep_mcp/features/condense/service.py:399-400 — Aggregate bytes:

stats.original_bytes += file_result["original_bytes"]
stats.condensed_bytes += file_result["condensed_bytes"]

src/ast_grep_mcp/features/condense/service.py:431-436 — Serialize with computed reduction:

"original_bytes": s.original_bytes,
"condensed_bytes": s.condensed_bytes,
"reduction_pct": round((1.0 - s.condensed_bytes / s.original_bytes) * 100, 1)

All 117 condense tests pass after the fix.

Estimate vs Actual Gap

StrategyEstimated ReductionActual ReductionGap
ai_chat~85%15.7%69.3pp
ai_analysis~40%1.1%38.9pp

The estimator uses theoretical STRATEGY_REDUCTION_RATIOS constants. The actual JS/TS surface extractor keeps entire brace-matched export blocks (including function bodies), and test files with describe/it fall back to keeping everything. This is the primary improvement target for the next phase.

Files Modified

FileChange
src/ast_grep_mcp/models/condense.py:11-12Added original_bytes, condensed_bytes fields
src/ast_grep_mcp/features/condense/service.py:399-400Aggregate byte counts per language
src/ast_grep_mcp/features/condense/service.py:429-436Serialize byte metrics + reduction_pct

Git Context

d97d782 refactor(condense): remove unused CondenseDefaults constants and standardize field naming
1ffc15b feat(condense): implement P9 — condense_train_dictionary tool (zstd)
9a09893 fix(condense): address critical/high code review findings

6. condense_train_dictionary (TypeScript)

MetricValue
Dictionary path.condense/dictionaries/dict_typescript.zdict
Dictionary size112,640 B (110 KB)
Samples used200
Total sample bytes999,080 B (~1 MB)
Est. compression improvement15.0%
Elapsed13.2s

Trained a zstd dictionary on 200 TypeScript files from tcad-scraper, written to tcad-scraper/.condense/dictionaries/dict_typescript.zdict. The dictionary captures repeated cross-file patterns (import paths, type annotations, test boilerplate) that standard zstd cannot exploit. Usage:

zstd -D .condense/dictionaries/dict_typescript.zdict <file>

The 15% estimated improvement applies on top of standard zstd compression ratios — most effective for small-to-medium files (<100KB) with consistent coding patterns across the codebase.

References