the problem: finite windows, infinite tasks

every llm has a context window limit. even at 128k or 200k tokens, an agent executing a multi-step workflow -- reading files, calling apis, reasoning over results -- will exhaust that budget fast. the naive approach of stuffing everything into the prompt works for simple chatbots but collapses for agents that need to operate over extended sessions, interact with dozens of tools, and maintain coherent state across hundreds of turns.

context management is not an optimization. it is a core system design concern. an agent without a context strategy is an agent that degrades unpredictably as conversations grow.

prefix caching

most agent interactions share a common prefix: the system prompt, tool definitions, persistent instructions, and persona configuration. prefix caching exploits this by hashing the prompt prefix and reusing the cached kv-state for subsequent requests that share the same prefix.

the practical implication for system design: structure your prompts so that the static portion comes first and the dynamic portion comes last. system instructions, tool schemas, and few-shot examples should be at the top. the conversation history and current user message go at the bottom. this maximizes cache hit rate because the prefix remains stable across turns.

anthropic's api supports this explicitly with cache_control breakpoints. openai implements it automatically. either way, the design principle is the same: front-load the stable content.

append-only context with front-trimming

treat the conversation as an append-only log. new messages, tool calls, and observations get appended to the end. when the total token count approaches the limit, trim from the front -- the oldest messages get dropped first.

this works because recent context is almost always more relevant than distant context. but naive front-trimming has failure modes: it can drop the system prompt, lose critical early instructions, or remove context that the agent is still referencing. the fix is to pin certain messages as non-evictable. the system prompt, any user-defined instructions, and the most recent n turns should be protected from trimming.

a practical implementation maintains three zones: a pinned prefix (system prompt + instructions), a sliding window of recent turns (last 10-20 exchanges), and an evictable middle zone where old turns get dropped as needed.

tool masking

tool definitions consume significant token budget. an agent with 50 available tools might spend 8-10k tokens just on tool schemas before any conversation begins. tool masking addresses this by only including tool definitions that are relevant to the current agent state.

implementation approaches range from simple to sophisticated:

  • state-based masking: define which tools are available at each stage of the workflow. a "research" phase exposes search and read tools; an "execution" phase exposes write and deploy tools.
  • classifier-based masking: use a lightweight model to predict which tools are likely needed based on the current user message, then only inject those tool schemas.
  • lazy loading: start with a minimal tool set and let the agent request additional tools when it identifies what it needs.

the tradeoff is between token savings and capability. overly aggressive masking can prevent the agent from discovering that a tool exists. a good heuristic is to always include a "meta" tool that lists all available tools, so the agent can request access to ones not currently loaded.

context compression

instead of dropping old messages entirely, compress them. summarize the first 50 turns of a conversation into a paragraph that captures the key decisions, facts established, and current state. this preserves the agent's understanding of the session history without consuming proportional token budget.

the compression can be done by the same llm (self-summarization) or by a smaller, cheaper model dedicated to compression. the summary replaces the original messages in the context, reducing perhaps 20k tokens to 500.

critical implementation detail: compress incrementally. do not re-summarize the entire history every turn. instead, when the evictable zone exceeds a threshold, summarize the oldest batch and prepend the summary to the existing compressed history. this keeps summarization costs linear rather than quadratic.

hierarchical memory

production agents need memory at multiple time scales:

  • short-term memory: the current context window. holds the active conversation, recent tool results, and working state. fast access, limited capacity.
  • medium-term memory: session-level persistence. stores compressed summaries of the current session, extracted facts and decisions, and intermediate results. survives context trimming but not session boundaries. typically implemented as a structured store the agent can query.
  • long-term memory: persistent storage across sessions. user preferences, learned procedures, project-specific knowledge, and accumulated facts. backed by a database or vector store. the agent retrieves from long-term memory at the start of each session or when it needs historical context.

the retrieval mechanism matters. short-term is implicit (it is the context). medium-term uses explicit tool calls to read/write a scratchpad or state store. long-term uses semantic search over a vector database, filtered by relevance to the current task.

practical system design

putting these together for a production agent:

  • structure prompts for prefix caching: static content first, dynamic content last.
  • implement three-zone context: pinned prefix, evictable middle with compression, protected recent window.
  • mask tools based on workflow state, keeping a meta-tool for discovery.
  • compress incrementally, never re-summarize the full history.
  • layer memory: context window for immediate work, session store for medium-term, vector db for long-term.
  • monitor token usage per zone and set alerts when compression frequency spikes -- this indicates the agent is doing too much in a single session and should be decomposed into sub-tasks.

context management is what separates a demo agent from a production agent. the demo works because the conversation is short. production conversations are long, messy, and unpredictable. designing for that reality upfront saves significant rework later.