Context Optimization Reference
This document provides detailed technical reference for context optimization techniques and strategies.
Compaction Strategies
Summary-Based Compaction
Summary-based compaction replaces verbose content with concise summaries while preserving key information. The approach works by identifying sections that can be compressed, generating summaries that capture essential points, and replacing full content with summaries.
The effectiveness of compaction depends on what information is preserved. Critical decisions, user preferences, and current task state should never be compacted. Intermediate results and supporting evidence can be summarized more aggressively. Boilerplate, repeated information, and exploratory reasoning can often be removed entirely.
Token Budget Allocation
Effective context budgeting requires understanding how different context components consume tokens and allocating budget strategically:
| Component | Typical Range | Notes |
|---|---|---|
| System prompt | 500-2000 tokens | Stable across session |
| Tool definitions | 100-500 per tool | Grows with tool count |
| Retrieved documents | Variable | Often largest consumer |
| Message history | Variable | Grows with conversation |
| Tool outputs | Variable | Can dominate context |
Compaction Thresholds
Trigger compaction at appropriate thresholds to maintain performance:
- Warning threshold at 70% of effective context limit
- Compaction trigger at 80% of effective context limit
- Aggressive compaction at 90% of effective context limit
The exact thresholds depend on model behavior and task characteristics. Some models show graceful degradation while others exhibit sharp performance cliffs.
Observation Masking Patterns
Selective Masking
Not all observations should be masked equally. Consider masking observations that have served their purpose and are no longer needed for active reasoning. Keep observations that are central to the current task. Keep observations from the most recent turn. Keep observations that may be referenced again.
Masking Implementation
def selective_mask(observations: List[Dict], current_task: Dict) -> List[Dict]:
"""
Selectively mask observations based on relevance.
Returns observations with mask field indicating masked content.
"""
masked = []
for obs in observations:
relevance = calculate_relevance(obs, current_task)
if relevance < 0.3 and obs["age"] > 3:
# Low relevance and old - mask
masked.append({
**obs,
"masked": True,
"reference": store_for_reference(obs["content"]),
"summary": summarize_content(obs["content"])
})
else:
masked.append({
**obs,
"masked": False
})
return maskedKV-Cache Optimization
Prefix Stability
KV-cache hit rates depend on prefix stability. Stable prefixes enable cache reuse across requests. Dynamic prefixes invalidate cache and force recomputation.
Elements that should remain stable include system prompts, tool definitions, and frequently used templates. Elements that may vary include timestamps, session identifiers, and query-specific content.
Cache-Friendly Design
Design prompts to maximize cache hit rates:
- Place stable content at the beginning
- Use consistent formatting across requests
- Avoid dynamic content in prompts when possible
- Use placeholders for dynamic content
# Cache-unfriendly: Dynamic timestamp in prompt
system_prompt = f"""
Current time: {datetime.now().isoformat()}
You are a helpful assistant.
"""
# Cache-friendly: Stable prompt with dynamic time as variable
system_prompt = """
You are a helpful assistant.
Current time is provided separately when relevant.
"""Context Partitioning Strategies
Sub-Agent Isolation
Partition work across sub-agents to prevent any single context from growing too large. Each sub-agent operates with a clean context focused on its subtask.
Partition Planning
def plan_partitioning(task: Dict, context_limit: int) -> Dict:
"""
Plan how to partition a task based on context limits.
Returns partitioning strategy and subtask definitions.
"""
estimated_context = estimate_task_context(task)
if estimated_context <= context_limit:
return {
"strategy": "single_agent",
"subtasks": [task]
}
# Plan multi-agent approach
subtasks = decompose_task(task)
return {
"strategy": "multi_agent",
"subtasks": subtasks,
"coordination": "hierarchical"
}Optimization Decision Framework
When to Optimize
Consider context optimization when context utilization exceeds 70%, when response quality degrades as conversations extend, when costs increase due to long contexts, or when latency increases with conversation length.
What Optimization to Apply
Choose optimization strategies based on context composition:
If tool outputs dominate context, apply observation masking. If retrieved documents dominate context, apply summarization or partitioning. If message history dominates context, apply compaction with summarization. If multiple components contribute, combine strategies.
Evaluation of Optimization
After applying optimization, evaluate effectiveness:
- Measure token reduction achieved
- Measure quality preservation (output quality should not degrade)
- Measure latency improvement
- Measure cost reduction
Iterate on optimization strategies based on evaluation results.
Common Pitfalls
Over-Aggressive Compaction
Compacting too aggressively can remove critical information. Always preserve task goals, user preferences, and recent conversation context. Test compaction at increasing aggressiveness levels to find the optimal balance.
Masking Critical Observations
Masking observations that are still needed can cause errors. Track observation usage and only mask content that is no longer referenced. Consider keeping references to masked content that could be retrieved if needed.
Ignoring Attention Distribution
The lost-in-middle phenomenon means that information placement matters. Place critical information at attention-favored positions (beginning and end of context). Use explicit markers to highlight important content.
Premature Optimization
Not all contexts require optimization. Adding optimization machinery has overhead. Optimize only when context limits actually constrain agent performance.
Monitoring and Alerting
Key Metrics
Track these metrics to understand optimization needs:
- Context token count over time
- Cache hit rates for repeated patterns
- Response quality metrics by context size
- Cost per conversation by context length
- Latency by context size
Alert Thresholds
Set alerts for:
- Context utilization above 80%
- Cache hit rate below 50%
- Quality score drop of more than 10%
- Cost increase above baseline
Integration Patterns
Integration with Agent Framework
Integrate optimization into agent workflow:
class OptimizingAgent:
def __init__(self, context_limit: int = 80000):
self.context_limit = context_limit
self.optimizer = ContextOptimizer()
def process(self, user_input: str, context: Dict) -> Dict:
# Check if optimization needed
if self.optimizer.should_compact(context):
context = self.optimizer.compact(context)
# Process with optimized context
response = self._call_model(user_input, context)
# Track metrics
self.optimizer.record_metrics(context, response)
return responseIntegration with Memory Systems
Connect optimization with memory systems:
class MemoryAwareOptimizer:
def __init__(self, memory_system, context_limit: int):
self.memory = memory_system
self.limit = context_limit
def optimize_context(self, current_context: Dict, task: str) -> Dict:
# Check if information is in memory
relevant_memories = self.memory.retrieve(task)
# Move information to memory if not needed in context
for mem in relevant_memories:
if mem["importance"] < threshold:
current_context = remove_from_context(current_context, mem)
# Keep reference that memory can be retrieved
return current_contextPerformance Benchmarks
Compaction Performance
Compaction should reduce token count while preserving quality. Target:
- 50-70% token reduction for aggressive compaction
- Less than 5% quality degradation from compaction
- Less than 10% latency increase from compaction overhead
Masking Performance
Observation masking should reduce token count significantly:
- 60-80% reduction in masked observations
- Less than 2% quality impact from masking
- Near-zero latency overhead
Cache Performance
KV-cache optimization should improve cost and latency:
- 70%+ cache hit rate for stable workloads
- 50%+ cost reduction from cache hits
- 40%+ latency reduction from cache hits