A comprehensive guide to minimizing context usage, optimizing token consumption, and maximizing efficiency when working with Claude Code and the Claude API.


Executive Summary

Context management is now recognized as “effectively the #1 job” for engineers building AI agents. As Anthropic emphasizes: “Claude is already smart enough–intelligence is not the bottleneck, context is.” Research shows that for many LLMs, performance degrades significantly as context length increases, with 11 out of 12 tested models dropping below 50% performance at 32k tokens.

Key metrics from optimization efforts:

  • 54-62% reduction in startup tokens through tiered documentation
  • 85% reduction in MCP tool overhead with Tool Search
  • 84% reduction in token consumption with context editing
  • 90% cost reduction possible with prompt caching
  • 37-85% token reduction with Programmatic Tool Calling (PTC)

1. Token Optimization Strategies

1.1 Token-Efficient Tool Use

Claude 4 models have built-in token-efficient tool use that saves an average of 14% in output tokens (up to 70%) while also reducing latency. For Claude Sonnet 3.7 users, enable the beta header:

anthropic-beta: token-efficient-tools-2025-02-19

1.2 Programmatic Tool Calling (PTC)

PTC allows Claude to write code that calls tools programmatically within a code execution environment, rather than requiring round-trips through the model for each tool invocation.

Benefits:

  • 85.6% token reduction demonstrated (110,473 to 15,919 tokens)
  • 37% average reduction on complex research tasks
  • Keeps intermediate results out of Claude’s context
  • Substantially reduces end-to-end latency

1.3 Dynamic/Lazy Context Loading

Instead of loading verbose documentation upfront, use triggers to load detailed context on-demand.

Results from one project:

  • Initial context reduced from 7,584 to 3,434 tokens (54% reduction)
  • Improved tool discovery and enforcement
  • Monthly cost for 5 developers doing 100 sessions/day dropped to $72 (62% token savings)

1.4 Hybrid Model Approach

Reserve expensive, high-reasoning models (Claude Opus 4.5) for:

  • High-level planning
  • Architectural design
  • Final code review

Use faster, cheaper models (Sonnet, Haiku) for:

  • High-frequency implementation work
  • Basic syntax validation and linting
  • Simple text transformations
  • Data parsing and quick status checks

2. Efficient Tool Usage Patterns

2.1 Parallel vs Sequential Tool Calls

Use parallel calls when:

  • Operations are independent with no dependencies
  • Multiple searches or reads can run simultaneously
  • You need to gather information from multiple sources

Use sequential calls when:

  • One operation depends on another’s result
  • Order of execution matters
  • You need to chain operations (e.g., mkdir before cp)

Best Practice: When multiple independent pieces of information are needed and all commands are likely to succeed, make all independent calls in the same request block.

2.2 Batching Strategies

For immediate parallel execution:

# Use async/await to run multiple independent calls concurrently
# All questions run concurrently, completing in roughly the
# time of the slowest individual request

For non-urgent bulk operations:

  • Use the Message Batches API (50% cost reduction)
  • Limited to 100,000 requests or 256 MB per batch
  • Most batches complete within 1 hour
  • Ideal for: evaluations, content moderation, data analysis, bulk generation

2.3 Subagent Delegation

Use subagents when:

  • The task produces verbose output you do not need in your main context
  • You want to enforce specific tool restrictions or permissions
  • The work is self-contained and can return a summary
  • Running tests, fetching documentation, or processing log files

Built-in subagent types:

  • Explore: For searching/understanding codebases without making changes
  • General-purpose: For tasks requiring both exploration and modification

Thoroughness levels:

  • quick: Targeted lookups
  • medium: Balanced exploration
  • very thorough: Comprehensive analysis

Limitations:

  • Subagents cannot spawn other subagents
  • Subagents start with a blank slate (“handoff problem”)
  • Provide detailed briefs to avoid “context amnesia”

Pro tip: To maximize subagent usage, explicitly specify which steps should be delegated to subagents in your instructions.


3. Context Window Management

3.1 Understanding Context Limits

TierContext WindowNotes
Standard200,000 tokensDefault for most users
Advanced (Tier 4+)1,000,000 tokensPremium pricing applies
Premium pricing threshold>200K tokens2x input, 1.5x output pricing

Critical insight: Avoid using the final 20% of your context window for complex tasks. Quality notably declines for memory-intensive operations.

3.2 Built-in Commands

CommandPurposeWhen to Use
/contextVisualizes context usage as colored gridBefore deciding to compact; identify MCP server consumption
/clearWipes conversation historyBetween tasks; after commits; when <50% of context is relevant
/compactSummarizes conversation and starts freshAt 70% capacity; at logical breakpoints; during long sessions
/costShows token usage statisticsTo understand patterns and identify optimization opportunities

3.3 Compaction Strategies

Auto-compact: Triggers automatically at ~95% capacity.

Manual compact best practices:

  • Compact at 70% capacity before hitting limits
  • Add custom instructions: /compact focus on authentication logic
  • Compact at logical breakpoints (feature complete, tests passing)

“Document & Clear” method for large tasks:

  1. Have Claude dump its plan and progress into a .md file
  2. /clear the state
  3. Start a new session by telling Claude to read the .md and continue

3.4 Context Editing (September 2025 Feature)

Anthropic’s context editing automatically clears stale tool calls while preserving conversation flow. In testing, it enabled agents to complete workflows that would otherwise fail due to context exhaustion while reducing token consumption by 84%.


4. Tool-Specific Optimizations

4.1 File Reading (Read Tool)

Default limits:

  • Maximum: 2,000 lines per read operation
  • Token limit: 25,000 tokens (hardcoded)
  • Lines longer than 2,000 characters are truncated

When files exceed limits:

Use offset and limit parameters to read specific portions of the file,
or use the GrepTool to search for specific content.

Chunking strategies:

  • Focus on one directory at a time
  • Use specific queries: "explain the QueryContext class in velox/core/query.h"
  • Read only the portions you need with offset and limit parameters

Environment variable for larger files:

export MAX_MCP_OUTPUT_TOKENS=250000

Warning: After 2-3 context compactions, Claude may revert to using grep/wc/partial reads instead of complete file reading. Monitor for this behavior.

4.2 Search Tools (Grep, Glob)

Grep Tool best practices:

TechniqueExampleBenefit
Use type parametertype: "py"More efficient than glob patterns
Use output_mode wiselyfiles_with_matches (default)Only returns paths, not content
Use head_limithead_limit: 10Limits results to first N entries
Use literal patterns-F "literal.string"Faster than regex for exact matches
Pre-filter by file typerg "pattern" -t pyMuch faster than post-filtering

Glob patterns for filtering:

*.log          # Log files only
!*.min.js      # Exclude minified JS
src/**         # Only src directory tree
*test*         # Include test files
!*node_modules* # Exclude node_modules

Key principle: Always prefer Grep, Glob, or Task tools over direct find/grep bash commands.

4.3 Bash Commands

Output limiting strategies:

# Truncate test output
npm test 2>&1 | tail -30

# Filter for errors/warnings only
npm run build 2>&1 | grep -i "error\|warning" || echo "Build succeeded"

# Limit output to N lines
command | head -100

Configuration:

  • BASH_MAX_OUTPUT_LENGTH: Controls character-based truncation for long outputs

Memory warning: Claude Code stores all bash output in memory for the entire session. Large outputs (90GB+ reported) can crash the application. Always truncate verbose commands.

Implement output truncation in code:

def truncate_output(output, max_lines=100):
    lines = output.split('\n')
    if len(lines) > max_lines:
        return '\n'.join(lines[:max_lines]) + f'\n... [truncated {len(lines) - max_lines} lines]'
    return output

4.4 When to Use Each Tool

ScenarioRecommended Tool
Find files by name patternGlob
Search file contentsGrep
Read known fileRead (with offset/limit for large files)
Execute commandsBash (with output truncation)
Open-ended explorationTask/Subagent
Multiple rounds of searchTask tool
Verbose operationsDelegate to subagent

5. Response Formatting

5.1 Requesting Concise Outputs

In CLAUDE.md or prompts:

## Response Guidelines
- Provide concise, actionable responses
- Omit verbose explanations unless requested
- Use bullet points over paragraphs
- Return only relevant code snippets, not entire files
- Summarize large outputs before presenting

5.2 Structured Output Requests

When analyzing code, return:
1. One-line summary
2. Key findings (3-5 bullets max)
3. Recommended actions

Do NOT include:
- Full file contents
- Verbose explanations
- Redundant information

6. Caching and Reuse Strategies

6.1 Prompt Caching (API)

Pricing structure:

Cache TypeCost vs Base
5-minute cache write1.25x
1-hour cache write2x
Cache read0.1x (90% savings)

Implementation:

# Place static content at the beginning
# Mark end of reusable content with cache_control
# Minimum block size: 1,024 tokens

Best use cases:

  • Extended conversations with long instructions
  • Uploaded documents
  • Agentic tool use with iterative code changes
  • Talking to books, papers, documentation

Monitor cache performance via response fields:

  • cache_creation_input_tokens
  • cache_read_input_tokens
  • input_tokens

6.2 Avoiding Redundant Operations

Principles:

  1. Read files once, reference by line numbers thereafter
  2. Cache search results mentally–do not repeat the same grep
  3. Use CLAUDE.md for information that persists across sessions
  4. Store findings in external files for multi-session projects

Pattern for large tasks:

1. Search/read once at the start
2. Document findings in a scratchpad file
3. Reference the scratchpad instead of re-reading
4. Clear context while preserving scratchpad

7. Prompt Engineering for Efficiency

7.1 CLAUDE.md Best Practices

What to include:

  • Project context (one-liner orientation)
  • Code style preferences (specific, not vague)
  • Commands (test, build, lint, deploy)
  • Project-specific gotchas and warnings
  • Things Claude should NOT do

What NOT to include:

  • Information needed only occasionally (put in docs/ instead)
  • Verbose explanations
  • Everything marked as “IMPORTANT” (dilutes emphasis)

Structure:

# Project: [One-line description]

## Tech Stack
- [Framework]
- [Database]
- [Key dependencies]

## Commands
- Test: `npm test`
- Build: `npm run build`
- Lint: `npm run lint`

## Code Style
- 2-space indentation
- Named exports preferred
- ES modules (not CommonJS)

## IMPORTANT: Do Not
- Modify the migrations folder directly
- Use any deprecated APIs
- Create excessive comments

File locations (hierarchy order):

  1. Project root CLAUDE.md (shared via version control)
  2. .claude/CLAUDE.md (subdirectory alternative)
  3. ~/.claude/CLAUDE.md (user-level defaults)
  4. CLAUDE.local.md (private, auto-gitignored)

7.2 Writing Efficient Prompts

Minimize back-and-forth by:

  • Providing complete context upfront
  • Specifying expected output format
  • Including constraints and boundaries
  • Listing files/directories to focus on
  • Stating what NOT to do

Example efficient prompt:

Fix the authentication bug in src/auth/login.ts

Context:
- Users report 401 errors on valid credentials
- Issue started after commit abc123
- Related files: src/auth/login.ts, src/middleware/auth.ts

Requirements:
- Do not modify the session schema
- Add debug logging to track the issue
- Write a test case for the fix

Output format:
1. Root cause (1-2 sentences)
2. Code changes (diff format)
3. Test case

7.3 Progressive Disclosure

Let agents navigate and retrieve data autonomously. Each interaction yields context that informs the next decision. Agents can assemble understanding layer by layer, maintaining only what is necessary in working memory.


8. MCP Server Optimization

8.1 The Problem

MCP tool definitions can consume massive context:

  • 5-server setup: ~55K tokens before conversation starts
  • Jira alone: ~17K tokens
  • One reported case: 134K tokens of tool definitions before optimization

Introduced to reduce token overhead by 85% by loading tools on-demand rather than upfront.

Configuration:

# Auto mode (default) - activates when tools exceed threshold
ENABLE_TOOL_SEARCH=auto

# Custom threshold (5%)
ENABLE_TOOL_SEARCH=auto:5

# Disable entirely
ENABLE_TOOL_SEARCH=false

Performance improvements:

  • Opus 4: 49% to 74% accuracy
  • Opus 4.5: 79.5% to 88.1% accuracy
  • 46.9% reduction in total agent tokens (51K to 8.5K)

8.3 Manual Optimization Strategies

Disable unused MCP servers:

  1. Use /context to identify consumption
  2. Disable with @server-name disable or /mcp
  3. Re-enable only when needed

Tool consolidation:

  • Example: mcp-omnisearch reduced from 20 tools (14,214 tokens) to 8 tools (5,663 tokens)
  • Combine similar functionality
  • Build scoped, narrow-purpose servers

Present MCP servers as code APIs:

  • Agents write code to interact with servers
  • Load only needed tools
  • Process data in execution environment before passing to model

9. Monitoring and Measurement

9.1 Key Commands

CommandInformation Provided
/costToken usage statistics for current session
/contextVisual context usage grid
/doctorDiagnose context-related issues

9.2 Metrics to Track

  • Tokens per session
  • Context utilization percentage
  • Compaction frequency
  • Cost per task type
  • Time-to-context-limit

9.3 Warning Signs

  • Frequent auto-compaction triggering
  • Degraded response quality
  • Claude reverting to sampling files instead of full reads
  • “Context low” errors
  • Memory usage spikes (90GB+ indicates output retention issues)

Quick Reference Card

Daily Workflow

1. Start session
   - /context to check baseline
   - Disable unused MCP servers

2. During work
   - Use subagents for verbose operations
   - Truncate bash output
   - Read files with offset/limit for large files
   - Use Grep output_mode: "files_with_matches"

3. Between tasks
   - /clear if <50% context is relevant
   - /compact at 70% capacity

4. End of session
   - Document progress in .md file
   - /cost to review usage

Token Budget Guidelines

OperationEstimated Tokens
CLAUDE.md (lean)500-1,000
CLAUDE.md (bloated)2,000-5,000
MCP server (typical)5,000-20,000
File read (2000 lines)10,000-25,000
Grep results (content mode)Varies widely
Bash output (untruncated)Potentially unlimited

Emergency Actions

ProblemSolution
Context overflow imminent/compact immediately
Performance degraded/clear and restart
MCP consuming too much/mcp to disable servers
Large file read failingUse offset/limit parameters
Bash output overwhelmingPipe to head or tail

Sources

Anthropic Official Documentation

Community Guides and Analysis

Research and Academic

GitHub Issues and Feature Requests


This document was compiled from research conducted on January 19, 2026. Practices and features may evolve as Claude Code and the Claude API continue to be updated.