What is the context window limit for Claude?

Standard users have a 200,000 token context window. Advanced (Tier 4+) users can access up to 1,000,000 tokens with premium pricing (2x input, 1.5x output) above 200K tokens.

How do I reduce token usage in Claude Code?

Key strategies include: using /compact at 70% capacity, delegating verbose operations to subagents, truncating bash output with pipes to head/tail, using Grep with output_mode 'files_with_matches', and disabling unused MCP servers.

What is MCP Tool Search and how does it help?

MCP Tool Search loads tools on-demand rather than upfront, reducing token overhead by up to 85%. It activates automatically when tools exceed a threshold. Configure with ENABLE_TOOL_SEARCH=auto environment variable.

When should I use /clear vs /compact in Claude Code?

Use /clear when less than 50% of context is relevant (between tasks, after commits). Use /compact at 70% capacity to summarize conversation while preserving important context at logical breakpoints.

How can I optimize large file reads in Claude Code?

Use offset and limit parameters to read specific portions, use Grep to search for specific content first, set MAX_MCP_OUTPUT_TOKENS environment variable for larger files, and focus on one directory at a time.

Claude Tooling Context Management Best Practices

A comprehensive guide to minimizing context usage, optimizing token consumption, and maximizing efficiency when working with Claude Code and the Claude API.

Executive Summary

Context management is now recognized as “effectively the #1 job” for engineers building AI agents. As Anthropic emphasizes: “Claude is already smart enough–intelligence is not the bottleneck, context is.” Research shows that for many LLMs, performance degrades significantly as context length increases, with 11 out of 12 tested models dropped below 50% performance at 32k tokens.

Key metrics from optimization efforts:

54-62% reduction in startup tokens through tiered documentation
85% reduction in MCP tool overhead with Tool Search
84% reduction in token consumption with context editing
90% cost reduction possible with prompt caching
37-85% token reduction with Programmatic Tool Calling (PTC)

1. Token Optimization Strategies

1.1 Token-Efficient Tool Use

Claude 4 models have built-in token-efficient tool use that saves an average of 14% in output tokens (up to 70%) while also reducing latency. For Claude Sonnet 3.7 users, enable the beta header:

anthropic-beta: token-efficient-tools-2025-02-19

1.2 Programmatic Tool Calling (PTC)

PTC allows Claude to write code that calls tools programmatically within a code execution environment, rather than requiring round-trips through the model for each tool invocation.

Benefits:

85.6% token reduction demonstrated (110,473 to 15,919 tokens)
37% average reduction on complex research tasks
Keeps intermediate results out of Claude’s context
Substantially reduces end-to-end latency

1.3 Dynamic/Lazy Context Loading

Instead of loading verbose documentation upfront, use triggers to load detailed context on-demand.

Results from one project:

Initial context reduced from 7,584 to 3,434 tokens (54% reduction)
Improved tool discovery and enforcement
Monthly cost for 5 developers doing 100 sessions/day dropped to $72 (62% token savings)

1.4 Hybrid Model Approach

Reserve expensive, high-reasoning models (Claude Opus 4.5) for:

High-level planning
Architectural design
Final code review

Use faster, cheaper models (Sonnet, Haiku) for:

High-frequency implementation work
Basic syntax validation and linting
Simple text transformations
Data parsing and quick status checks

2. Efficient Tool Usage Patterns

2.1 Parallel vs Sequential Tool Calls

Use parallel calls when:

Operations are independent with no dependencies
Multiple searches or reads can run simultaneously
You need to gather information from multiple sources

Use sequential calls when:

One operation depends on another’s result
Order of execution matters
You need to chain operations (e.g., mkdir before cp)

Best Practice: When multiple independent pieces of information are needed and all commands are likely to succeed, make all independent calls in the same request block.

2.2 Batching Strategies

For immediate parallel execution:

# Use async/await to run multiple independent calls concurrently
# All questions run concurrently, completing in roughly the
# time of the slowest individual request

For non-urgent bulk operations:

Use the Message Batches API (50% cost reduction)
Limited to 100,000 requests or 256 MB per batch
Most batches complete within 1 hour
Ideal for: evaluations, content moderation, data analysis, bulk generation

2.3 Subagent Delegation

Use subagents when:

The task produces verbose output you do not need in your main context
You want to enforce specific tool restrictions or permissions
The work is self-contained and can return a summary
Running tests, fetching documentation, or processing log files

Built-in subagent types:

Explore: For searching/understanding codebases without making changes
General-purpose: For tasks requiring both exploration and modification

Thoroughness levels:

quick: Targeted lookups
medium: Balanced exploration
very thorough: Comprehensive analysis

Limitations:

Subagents cannot spawn other subagents
Subagents start with a blank slate (“handoff problem”)
Provide detailed briefs to avoid “context amnesia”

Pro tip: To maximize subagent usage, explicitly specify which steps should be delegated to subagents in your instructions.

3. Context Window Management

3.1 Understanding Context Limits

Tier	Context Window	Notes
Standard	200,000 tokens	Default for most users
Advanced (Tier 4+)	1,000,000 tokens	Premium pricing applies
Premium pricing threshold	>200K tokens	2x input, 1.5x output pricing

Critical insight: Avoid using the final 20% of your context window for complex tasks. Quality notably declines for memory-intensive operations.

3.2 Built-in Commands

Command	Purpose	When to Use
`/context`	Visualizes context usage as colored grid	Before deciding to compact; identify MCP server consumption
`/clear`	Wipes conversation history	Between tasks; after commits; when <50% of context is relevant
`/compact`	Summarizes conversation and starts fresh	At 70% capacity; at logical breakpoints; during long sessions
`/cost`	Shows token usage statistics	To understand patterns and identify optimization opportunities

3.3 Compaction Strategies

Auto-compact: Triggers automatically at ~95% capacity.

Manual compact best practices:

Compact at 70% capacity before hitting limits
Add custom instructions: /compact focus on authentication logic
Compact at logical breakpoints (feature complete, tests passing)

“Document & Clear” method for large tasks:

Have Claude dump its plan and progress into a .md file
/clear the state
Start a new session by telling Claude to read the .md and continue

3.4 Context Editing (September 2025 Feature)

Anthropic’s context editing automatically clears stale tool calls while preserving conversation flow. In testing, it enabled agents to complete workflows that would otherwise fail due to context exhaustion while reducing token consumption by 84%.

4. Tool-Specific Optimizations

4.1 File Reading (Read Tool)

Default limits:

Maximum: 2,000 lines per read operation
Token limit: 25,000 tokens (hardcoded)
Lines longer than 2,000 characters are truncated

When files exceed limits:

Use offset and limit parameters to read specific portions of the file,
or use the GrepTool to search for specific content.

Chunking strategies:

Focus on one directory at a time
Use specific queries: "explain the QueryContext class in velox/core/query.h"
Read only the portions you need with offset and limit parameters

Environment variable for larger files:

export MAX_MCP_OUTPUT_TOKENS=250000

Warning: After 2-3 context compactions, Claude may revert to using grep/wc/partial reads instead of complete file reading. Monitor for this behavior.

4.2 Search Tools (Grep, Glob)

Grep Tool best practices:

Technique	Example	Benefit
Use `type` parameter	`type: "py"`	More efficient than glob patterns
Use `output_mode` wisely	`files_with_matches` (default)	Only returns paths, not content
Use `head_limit`	`head_limit: 10`	Limits results to first N entries
Use literal patterns	`-F "literal.string"`	Faster than regex for exact matches
Pre-filter by file type	`rg "pattern" -t py`	Much faster than post-filtering

Glob patterns for filtering:

*.log          # Log files only
!*.min.js      # Exclude minified JS
src/**         # Only src directory tree
*test*         # Include test files
!*node_modules* # Exclude node_modules

Key principle: Always prefer Grep, Glob, or Task tools over direct find/grep bash commands.

4.3 Bash Commands

Output limiting strategies:

# Truncate test output
npm test 2>&1 | tail -30

# Filter for errors/warnings only
npm run build 2>&1 | grep -i "error\|warning" || echo "Build succeeded"

# Limit output to N lines
command | head -100

Configuration:

BASH_MAX_OUTPUT_LENGTH: Controls character-based truncation for long outputs

Memory warning: Claude Code stores all bash output in memory for the entire session. Large outputs (90GB+ reported) can crash the application. Always truncate verbose commands.

Implement output truncation in code:

def truncate_output(output, max_lines=100):
    lines = output.split('\n')
    if len(lines) > max_lines:
        return '\n'.join(lines[:max_lines]) + f'\n... [truncated {len(lines) - max_lines} lines]'
    return output

4.4 When to Use Each Tool

Scenario	Recommended Tool
Find files by name pattern	Glob
Search file contents	Grep
Read known file	Read (with offset/limit for large files)
Execute commands	Bash (with output truncation)
Open-ended exploration	Task/Subagent
Multiple rounds of search	Task tool
Verbose operations	Delegate to subagent

5. Response Formatting

5.1 Requesting Concise Outputs

In CLAUDE.md or prompts:

## Response Guidelines
- Provide concise, actionable responses
- Omit verbose explanations unless requested
- Use bullet points over paragraphs
- Return only relevant code snippets, not entire files
- Summarize large outputs before presenting

5.2 Structured Output Requests

When analyzing code, return:
1. One-line summary
2. Key findings (3-5 bullets max)
3. Recommended actions

Do NOT include:
- Full file contents
- Verbose explanations
- Redundant information

6. Caching and Reuse Strategies

6.1 Prompt Caching (API)

Pricing structure:

Cache Type	Cost vs Base
5-minute cache write	1.25x
1-hour cache write	2x
Cache read	0.1x (90% savings)

Implementation:

# Place static content at the beginning
# Mark end of reusable content with cache_control
# Minimum block size: 1,024 tokens

Best use cases:

Extended conversations with long instructions
Uploaded documents
Agentic tool use with iterative code changes
Talking to books, papers, documentation

Monitor cache performance via response fields:

cache_creation_input_tokens
cache_read_input_tokens
input_tokens

6.2 Avoiding Redundant Operations

Principles:

Read files once, reference by line numbers thereafter
Cache search results mentally–do not repeat the same grep
Use CLAUDE.md for information that persists across sessions
Store findings in external files for multi-session projects

Pattern for large tasks:

Search/read once at the start
Document findings in a scratchpad file
Reference the scratchpad instead of re-reading
Clear context while preserving scratchpad

7. Prompt Engineering for Efficiency

7.1 CLAUDE.md Best Practices

What to include:

Project context (one-liner orientation)
Code style preferences (specific, not vague)
Commands (test, build, lint, deploy)
Project-specific gotchas and warnings
Things Claude should NOT do

What NOT to include:

Information needed only occasionally (put in docs/ instead)
Verbose explanations
Everything marked as “IMPORTANT” (dilutes emphasis)

Structure:

# Project: [One-line description]

## Tech Stack
- [Framework]
- [Database]
- [Key dependencies]

## Commands
- Test: `npm test`
- Build: `npm run build`
- Lint: `npm run lint`

## Code Style
- 2-space indentation
- Named exports preferred
- ES modules (not CommonJS)

## IMPORTANT: Do Not
- Modify the migrations folder directly
- Use any deprecated APIs
- Create excessive comments

File locations (hierarchy order):

Project root CLAUDE.md (shared via version control)
.claude/CLAUDE.md (subdirectory alternative)
~/.claude/CLAUDE.md (user-level defaults)
CLAUDE.local.md (private, auto-gitignored)

7.2 Writing Efficient Prompts

Minimize back-and-forth by:

Providing complete context upfront
Specifying expected output format
Including constraints and boundaries
Listing files/directories to focus on
Stating what NOT to do

Example efficient prompt:

Fix the authentication bug in src/auth/login.ts

Context:
- Users report 401 errors on valid credentials
- Issue started after commit abc123
- Related files: src/auth/login.ts, src/middleware/auth.ts

Requirements:
- Do not modify the session schema
- Add debug logging to track the issue
- Write a test case for the fix

Output format:
1. Root cause (1-2 sentences)
2. Code changes (diff format)
3. Test case

7.3 Progressive Disclosure

Let agents navigate and retrieve data autonomously. Each interaction yields context that informs the next decision. Agents can assemble understanding layer by layer, maintaining only what is necessary in working memory.

8. MCP Server Optimization

8.1 The Problem

MCP tool definitions can consume massive context:

5-server setup: ~55K tokens before conversation starts
Jira alone: ~17K tokens
One reported case: 134K tokens of tool definitions before optimization

8.2 MCP Tool Search

Introduced to reduce token overhead by 85% by loading tools on-demand rather than upfront.

Configuration:

# Auto mode (default) - activates when tools exceed threshold
ENABLE_TOOL_SEARCH=auto

# Custom threshold (5%)
ENABLE_TOOL_SEARCH=auto:5

# Disable entirely
ENABLE_TOOL_SEARCH=false

Performance improvements:

Opus 4: 49% to 74% accuracy
Opus 4.5: 79.5% to 88.1% accuracy
46.9% reduction in total agent tokens (51K to 8.5K)

8.3 Manual Optimization Strategies

Disable unused MCP servers:

Use /context to identify consumption
Disable with @server-name disable or /mcp
Re-enable only when needed

Tool consolidation:

Example: mcp-omnisearch reduced from 20 tools (14,214 tokens) to 8 tools (5,663 tokens)
Combine similar functionality
Build scoped, narrow-purpose servers

Present MCP servers as code APIs:

Agents write code to interact with servers
Load only needed tools
Process data in execution environment before passing to model

9. Monitoring and Measurement

9.1 Key Commands

Command	Information Provided
`/cost`	Token usage statistics for current session
`/context`	Visual context usage grid
`/doctor`	Diagnose context-related issues

9.2 Metrics to Track

Tokens per session
Context utilization percentage
Compaction frequency
Cost per task type
Time-to-context-limit

9.3 Warning Signs

Frequent auto-compaction triggering
Degraded response quality
Claude reverting to sampling files instead of full reads
“Context low” errors
Memory usage spikes (90GB+ indicates output retention issues)

Quick Reference Card

Daily Workflow

1. Start session
   - /context to check baseline
   - Disable unused MCP servers

2. During work
   - Use subagents for verbose operations
   - Truncate bash output
   - Read files with offset/limit for large files
   - Use Grep output_mode: "files_with_matches"

3. Between tasks
   - /clear if <50% context is relevant
   - /compact at 70% capacity

4. End of session
   - Document progress in .md file
   - /cost to review usage

Token Budget Guidelines

Operation	Estimated Tokens
CLAUDE.md (lean)	500-1,000
CLAUDE.md (bloated)	2,000-5,000
MCP server (typical)	5,000-20,000
File read (2000 lines)	10,000-25,000
Grep results (content mode)	Varies widely
Bash output (untruncated)	Potentially unlimited

Emergency Actions

Problem	Solution
Context overflow imminent	`/compact` immediately
Performance degraded	`/clear` and restart
MCP consuming too much	`/mcp` to disable servers
Large file read failing	Use offset/limit parameters
Bash output overwhelming	Pipe to `head` or `tail`

Sources

Anthropic Official Documentation

Community Guides and Analysis

Research and Academic

GitHub Issues and Feature Requests

This document was compiled from research conducted on January 19, 2026. Practices and features may evolve as Claude Code and the Claude API continue to be updated.