A comprehensive guide to minimizing context usage, optimizing token consumption, and maximizing efficiency when working with Claude Code and the Claude API.
Executive Summary
Context management is now recognized as “effectively the #1 job” for engineers building AI agents. As Anthropic emphasizes: “Claude is already smart enough–intelligence is not the bottleneck, context is.” Research shows that for many LLMs, performance degrades significantly as context length increases, with 11 out of 12 tested models dropping below 50% performance at 32k tokens.
Key metrics from optimization efforts:
- 54-62% reduction in startup tokens through tiered documentation
- 85% reduction in MCP tool overhead with Tool Search
- 84% reduction in token consumption with context editing
- 90% cost reduction possible with prompt caching
- 37-85% token reduction with Programmatic Tool Calling (PTC)
1. Token Optimization Strategies
1.1 Token-Efficient Tool Use
Claude 4 models have built-in token-efficient tool use that saves an average of 14% in output tokens (up to 70%) while also reducing latency. For Claude Sonnet 3.7 users, enable the beta header:
anthropic-beta: token-efficient-tools-2025-02-19
1.2 Programmatic Tool Calling (PTC)
PTC allows Claude to write code that calls tools programmatically within a code execution environment, rather than requiring round-trips through the model for each tool invocation.
Benefits:
- 85.6% token reduction demonstrated (110,473 to 15,919 tokens)
- 37% average reduction on complex research tasks
- Keeps intermediate results out of Claude’s context
- Substantially reduces end-to-end latency
1.3 Dynamic/Lazy Context Loading
Instead of loading verbose documentation upfront, use triggers to load detailed context on-demand.
Results from one project:
- Initial context reduced from 7,584 to 3,434 tokens (54% reduction)
- Improved tool discovery and enforcement
- Monthly cost for 5 developers doing 100 sessions/day dropped to $72 (62% token savings)
1.4 Hybrid Model Approach
Reserve expensive, high-reasoning models (Claude Opus 4.5) for:
- High-level planning
- Architectural design
- Final code review
Use faster, cheaper models (Sonnet, Haiku) for:
- High-frequency implementation work
- Basic syntax validation and linting
- Simple text transformations
- Data parsing and quick status checks
2. Efficient Tool Usage Patterns
2.1 Parallel vs Sequential Tool Calls
Use parallel calls when:
- Operations are independent with no dependencies
- Multiple searches or reads can run simultaneously
- You need to gather information from multiple sources
Use sequential calls when:
- One operation depends on another’s result
- Order of execution matters
- You need to chain operations (e.g.,
mkdirbeforecp)
Best Practice: When multiple independent pieces of information are needed and all commands are likely to succeed, make all independent calls in the same request block.
2.2 Batching Strategies
For immediate parallel execution:
# Use async/await to run multiple independent calls concurrently
# All questions run concurrently, completing in roughly the
# time of the slowest individual request
For non-urgent bulk operations:
- Use the Message Batches API (50% cost reduction)
- Limited to 100,000 requests or 256 MB per batch
- Most batches complete within 1 hour
- Ideal for: evaluations, content moderation, data analysis, bulk generation
2.3 Subagent Delegation
Use subagents when:
- The task produces verbose output you do not need in your main context
- You want to enforce specific tool restrictions or permissions
- The work is self-contained and can return a summary
- Running tests, fetching documentation, or processing log files
Built-in subagent types:
- Explore: For searching/understanding codebases without making changes
- General-purpose: For tasks requiring both exploration and modification
Thoroughness levels:
quick: Targeted lookupsmedium: Balanced explorationvery thorough: Comprehensive analysis
Limitations:
- Subagents cannot spawn other subagents
- Subagents start with a blank slate (“handoff problem”)
- Provide detailed briefs to avoid “context amnesia”
Pro tip: To maximize subagent usage, explicitly specify which steps should be delegated to subagents in your instructions.
3. Context Window Management
3.1 Understanding Context Limits
| Tier | Context Window | Notes |
|---|---|---|
| Standard | 200,000 tokens | Default for most users |
| Advanced (Tier 4+) | 1,000,000 tokens | Premium pricing applies |
| Premium pricing threshold | >200K tokens | 2x input, 1.5x output pricing |
Critical insight: Avoid using the final 20% of your context window for complex tasks. Quality notably declines for memory-intensive operations.
3.2 Built-in Commands
| Command | Purpose | When to Use |
|---|---|---|
/context | Visualizes context usage as colored grid | Before deciding to compact; identify MCP server consumption |
/clear | Wipes conversation history | Between tasks; after commits; when <50% of context is relevant |
/compact | Summarizes conversation and starts fresh | At 70% capacity; at logical breakpoints; during long sessions |
/cost | Shows token usage statistics | To understand patterns and identify optimization opportunities |
3.3 Compaction Strategies
Auto-compact: Triggers automatically at ~95% capacity.
Manual compact best practices:
- Compact at 70% capacity before hitting limits
- Add custom instructions:
/compact focus on authentication logic - Compact at logical breakpoints (feature complete, tests passing)
“Document & Clear” method for large tasks:
- Have Claude dump its plan and progress into a
.mdfile /clearthe state- Start a new session by telling Claude to read the
.mdand continue
3.4 Context Editing (September 2025 Feature)
Anthropic’s context editing automatically clears stale tool calls while preserving conversation flow. In testing, it enabled agents to complete workflows that would otherwise fail due to context exhaustion while reducing token consumption by 84%.
4. Tool-Specific Optimizations
4.1 File Reading (Read Tool)
Default limits:
- Maximum: 2,000 lines per read operation
- Token limit: 25,000 tokens (hardcoded)
- Lines longer than 2,000 characters are truncated
When files exceed limits:
Use offset and limit parameters to read specific portions of the file,
or use the GrepTool to search for specific content.
Chunking strategies:
- Focus on one directory at a time
- Use specific queries:
"explain the QueryContext class in velox/core/query.h" - Read only the portions you need with
offsetandlimitparameters
Environment variable for larger files:
export MAX_MCP_OUTPUT_TOKENS=250000
Warning: After 2-3 context compactions, Claude may revert to using grep/wc/partial reads instead of complete file reading. Monitor for this behavior.
4.2 Search Tools (Grep, Glob)
Grep Tool best practices:
| Technique | Example | Benefit |
|---|---|---|
Use type parameter | type: "py" | More efficient than glob patterns |
Use output_mode wisely | files_with_matches (default) | Only returns paths, not content |
Use head_limit | head_limit: 10 | Limits results to first N entries |
| Use literal patterns | -F "literal.string" | Faster than regex for exact matches |
| Pre-filter by file type | rg "pattern" -t py | Much faster than post-filtering |
Glob patterns for filtering:
*.log # Log files only
!*.min.js # Exclude minified JS
src/** # Only src directory tree
*test* # Include test files
!*node_modules* # Exclude node_modules
Key principle: Always prefer Grep, Glob, or Task tools over direct find/grep bash commands.
4.3 Bash Commands
Output limiting strategies:
# Truncate test output
npm test 2>&1 | tail -30
# Filter for errors/warnings only
npm run build 2>&1 | grep -i "error\|warning" || echo "Build succeeded"
# Limit output to N lines
command | head -100
Configuration:
BASH_MAX_OUTPUT_LENGTH: Controls character-based truncation for long outputs
Memory warning: Claude Code stores all bash output in memory for the entire session. Large outputs (90GB+ reported) can crash the application. Always truncate verbose commands.
Implement output truncation in code:
def truncate_output(output, max_lines=100):
lines = output.split('\n')
if len(lines) > max_lines:
return '\n'.join(lines[:max_lines]) + f'\n... [truncated {len(lines) - max_lines} lines]'
return output
4.4 When to Use Each Tool
| Scenario | Recommended Tool |
|---|---|
| Find files by name pattern | Glob |
| Search file contents | Grep |
| Read known file | Read (with offset/limit for large files) |
| Execute commands | Bash (with output truncation) |
| Open-ended exploration | Task/Subagent |
| Multiple rounds of search | Task tool |
| Verbose operations | Delegate to subagent |
5. Response Formatting
5.1 Requesting Concise Outputs
In CLAUDE.md or prompts:
## Response Guidelines
- Provide concise, actionable responses
- Omit verbose explanations unless requested
- Use bullet points over paragraphs
- Return only relevant code snippets, not entire files
- Summarize large outputs before presenting
5.2 Structured Output Requests
When analyzing code, return:
1. One-line summary
2. Key findings (3-5 bullets max)
3. Recommended actions
Do NOT include:
- Full file contents
- Verbose explanations
- Redundant information
6. Caching and Reuse Strategies
6.1 Prompt Caching (API)
Pricing structure:
| Cache Type | Cost vs Base |
|---|---|
| 5-minute cache write | 1.25x |
| 1-hour cache write | 2x |
| Cache read | 0.1x (90% savings) |
Implementation:
# Place static content at the beginning
# Mark end of reusable content with cache_control
# Minimum block size: 1,024 tokens
Best use cases:
- Extended conversations with long instructions
- Uploaded documents
- Agentic tool use with iterative code changes
- Talking to books, papers, documentation
Monitor cache performance via response fields:
cache_creation_input_tokenscache_read_input_tokensinput_tokens
6.2 Avoiding Redundant Operations
Principles:
- Read files once, reference by line numbers thereafter
- Cache search results mentally–do not repeat the same grep
- Use CLAUDE.md for information that persists across sessions
- Store findings in external files for multi-session projects
Pattern for large tasks:
1. Search/read once at the start
2. Document findings in a scratchpad file
3. Reference the scratchpad instead of re-reading
4. Clear context while preserving scratchpad
7. Prompt Engineering for Efficiency
7.1 CLAUDE.md Best Practices
What to include:
- Project context (one-liner orientation)
- Code style preferences (specific, not vague)
- Commands (test, build, lint, deploy)
- Project-specific gotchas and warnings
- Things Claude should NOT do
What NOT to include:
- Information needed only occasionally (put in
docs/instead) - Verbose explanations
- Everything marked as “IMPORTANT” (dilutes emphasis)
Structure:
# Project: [One-line description]
## Tech Stack
- [Framework]
- [Database]
- [Key dependencies]
## Commands
- Test: `npm test`
- Build: `npm run build`
- Lint: `npm run lint`
## Code Style
- 2-space indentation
- Named exports preferred
- ES modules (not CommonJS)
## IMPORTANT: Do Not
- Modify the migrations folder directly
- Use any deprecated APIs
- Create excessive comments
File locations (hierarchy order):
- Project root
CLAUDE.md(shared via version control) .claude/CLAUDE.md(subdirectory alternative)~/.claude/CLAUDE.md(user-level defaults)CLAUDE.local.md(private, auto-gitignored)
7.2 Writing Efficient Prompts
Minimize back-and-forth by:
- Providing complete context upfront
- Specifying expected output format
- Including constraints and boundaries
- Listing files/directories to focus on
- Stating what NOT to do
Example efficient prompt:
Fix the authentication bug in src/auth/login.ts
Context:
- Users report 401 errors on valid credentials
- Issue started after commit abc123
- Related files: src/auth/login.ts, src/middleware/auth.ts
Requirements:
- Do not modify the session schema
- Add debug logging to track the issue
- Write a test case for the fix
Output format:
1. Root cause (1-2 sentences)
2. Code changes (diff format)
3. Test case
7.3 Progressive Disclosure
Let agents navigate and retrieve data autonomously. Each interaction yields context that informs the next decision. Agents can assemble understanding layer by layer, maintaining only what is necessary in working memory.
8. MCP Server Optimization
8.1 The Problem
MCP tool definitions can consume massive context:
- 5-server setup: ~55K tokens before conversation starts
- Jira alone: ~17K tokens
- One reported case: 134K tokens of tool definitions before optimization
8.2 MCP Tool Search
Introduced to reduce token overhead by 85% by loading tools on-demand rather than upfront.
Configuration:
# Auto mode (default) - activates when tools exceed threshold
ENABLE_TOOL_SEARCH=auto
# Custom threshold (5%)
ENABLE_TOOL_SEARCH=auto:5
# Disable entirely
ENABLE_TOOL_SEARCH=false
Performance improvements:
- Opus 4: 49% to 74% accuracy
- Opus 4.5: 79.5% to 88.1% accuracy
- 46.9% reduction in total agent tokens (51K to 8.5K)
8.3 Manual Optimization Strategies
Disable unused MCP servers:
- Use
/contextto identify consumption - Disable with
@server-name disableor/mcp - Re-enable only when needed
Tool consolidation:
- Example: mcp-omnisearch reduced from 20 tools (14,214 tokens) to 8 tools (5,663 tokens)
- Combine similar functionality
- Build scoped, narrow-purpose servers
Present MCP servers as code APIs:
- Agents write code to interact with servers
- Load only needed tools
- Process data in execution environment before passing to model
9. Monitoring and Measurement
9.1 Key Commands
| Command | Information Provided |
|---|---|
/cost | Token usage statistics for current session |
/context | Visual context usage grid |
/doctor | Diagnose context-related issues |
9.2 Metrics to Track
- Tokens per session
- Context utilization percentage
- Compaction frequency
- Cost per task type
- Time-to-context-limit
9.3 Warning Signs
- Frequent auto-compaction triggering
- Degraded response quality
- Claude reverting to sampling files instead of full reads
- “Context low” errors
- Memory usage spikes (90GB+ indicates output retention issues)
Quick Reference Card
Daily Workflow
1. Start session
- /context to check baseline
- Disable unused MCP servers
2. During work
- Use subagents for verbose operations
- Truncate bash output
- Read files with offset/limit for large files
- Use Grep output_mode: "files_with_matches"
3. Between tasks
- /clear if <50% context is relevant
- /compact at 70% capacity
4. End of session
- Document progress in .md file
- /cost to review usage
Token Budget Guidelines
| Operation | Estimated Tokens |
|---|---|
| CLAUDE.md (lean) | 500-1,000 |
| CLAUDE.md (bloated) | 2,000-5,000 |
| MCP server (typical) | 5,000-20,000 |
| File read (2000 lines) | 10,000-25,000 |
| Grep results (content mode) | Varies widely |
| Bash output (untruncated) | Potentially unlimited |
Emergency Actions
| Problem | Solution |
|---|---|
| Context overflow imminent | /compact immediately |
| Performance degraded | /clear and restart |
| MCP consuming too much | /mcp to disable servers |
| Large file read failing | Use offset/limit parameters |
| Bash output overwhelming | Pipe to head or tail |
Sources
Anthropic Official Documentation
- Context Windows - Claude Docs
- Token-efficient Tool Use - Claude Docs
- Claude Code: Best Practices for Agentic Coding
- Effective Context Engineering for AI Agents
- Introducing Advanced Tool Use
- Prompt Caching
- Bash Tool - Claude Docs
- Memory Tool - Claude Docs
- Create Custom Subagents - Claude Code Docs
- Manage Claude’s Memory - Claude Code Docs
- Manage Costs Effectively - Claude Code Docs
- Batch Processing - Claude Docs
- Programmatic Tool Calling (PTC)
Community Guides and Analysis
- Stop Wasting Tokens: How to Optimize Claude Code Context by 60%
- Claude Code Context Optimization: 54% Reduction
- Optimizing Token Efficiency in Claude Code Workflows
- How Claude Code Got Better by Protecting More Context
- Managing Claude Code Context - MCPcat Guide
- The Complete Guide to CLAUDE.md
- Claude Skills Solve the Context Window Problem
- How I Use Every Claude Code Feature
- Claude Code Compaction
- Optimising MCP Server Context Usage
- What is MCP Tool Search?
- Claude Code Just Cut MCP Context Bloat by 46.9%
- Task/Agent Tools - ClaudeLog
- A Practical Guide to Subagents in Claude Code
- Best Practices for Claude Code Subagents
- How to Optimize Claude Code Token Usage
Research and Academic
- Context Engineering Guide - Prompt Engineering Guide
- Cutting Through the Noise: Smarter Context Management - JetBrains Research
- LLM Context Management Guide
- Context Engineering in LLM-Based Agents
GitHub Issues and Feature Requests
- File Content Exceeds Maximum Token Limit
- Optimizing Claude Code for Large Codebases
- Claude Code Ingests Massive Tool Outputs Without Truncation
- Improve Claude Code Token Management with MCP Servers
This document was compiled from research conducted on January 19, 2026. Practices and features may evolve as Claude Code and the Claude API continue to be updated.