AI Coding Agent Context Compaction Guide 2026: Cut Token Waste Without Losing the Thread
AI coding agents are getting better at long tasks, but they're still bad at one very human thing: knowing what to forget. Leave a Claude Code, Codex CLI, Cursor, or Cline session running long enough and the conversation turns into a junk drawer. Old errors, stale plans, abandoned branches, pasted logs, duplicate file snippets, and three versions of the same decision all sit in the context window charging rent.
That matters because context isn't free. Long history makes every next request slower, pricier, and easier to derail. Worse, a bloated session can hide the one instruction that actually matters: "don't touch billing code," "keep the API OpenAI-compatible," or "the test database is read-only." Context compaction is how you keep the useful memory and cut the waste.
This guide gives you a practical compaction pattern for coding agents in 2026. It works whether you use hosted tools or your own API stack, and it pairs nicely with token counting, cost limits, and multi-model routing.
What Context Compaction Means
Context compaction is not generic summarization. A normal summary says, "we debugged the auth flow." A useful coding-agent compaction says:
- What the user asked for.
- Which files changed and why.
- What decisions are locked.
- What constraints must not be violated.
- Which commands already ran and what failed.
- The exact next action.
That's the difference between a nice recap and a continuation artifact. Your goal is not to make the transcript shorter. Your goal is to let the next model call behave as if it still understands the job.
When to Compact
Don't wait until the context window is on fire. By then, the agent is usually already confused. Use thresholds.
| Trigger | Why It Helps | Suggested Action |
|---|---|---|
| 50% context used | Cheap early cleanup | Write a working summary |
| 70% context used | Risk of drift rises fast | Compact before more code changes |
| After a milestone | Old exploration becomes noise | Replace discovery with decisions |
| Before model switch | New model lacks hidden assumptions | Pass a compact handoff note |
| Before parallel agents | Subagents need tight scope | Give each agent only relevant state |
My rule: if a coding agent has spent more than 20 minutes exploring, compact before asking it to make broad edits. Exploration context is expensive and often misleading.
The Six-Part Compaction Template
Use a fixed template. Free-form summaries drift. This one is short enough to paste into most agent tools and structured enough to survive handoffs.
# Working Context
Goal: [one sentence]
Current status: [done / in progress / blocked]
Locked decisions:
- [decision + reason]
Files touched:
- path: [change summary]
Important constraints:
- [must keep / must avoid]
Commands run:
- [command] => [result]
Next action:
- [single next step]
Notice what's missing: emotional narration, every failed hypothesis, and full stack traces. Keep raw logs in files or issue comments. Put only the result in the compacted context.
Bad vs Good Compaction
Bad compaction sounds helpful but loses the task:
We worked on the billing page and fixed some bugs. Tests mostly pass. Continue from there.
That forces the next agent to rediscover everything. A good version is boring and precise:
# Working Context
Goal: Add invoice export to the billing dashboard.
Current status: UI complete; backend endpoint still missing.
Locked decisions:
- Export format is CSV first, PDF later.
- Do not change payment provider code.
Files touched:
- src/billing/InvoiceTable.tsx: added Export CSV button and selected row state.
- src/billing/types.ts: added InvoiceExportRow.
Important constraints:
- Existing pagination behavior must stay unchanged.
- User timezone must be used for date labels.
Commands run:
- npm test -- InvoiceTable => passed
Next action:
- Implement GET /api/billing/invoices/export with auth check.
This is the shape you want. It's not poetic. It lets the work continue.
Python: Estimate When to Compact
You don't need perfect tokenizer accuracy to make better decisions. A rough threshold is enough for automation. If you're using an OpenAI-compatible endpoint, count recent messages before each call and compact when the estimated input gets too large.
def rough_tokens(text: str) -> int:
return max(1, len(text) // 4)
def should_compact(messages, max_context_tokens=200_000, threshold=0.60):
total = 0
for msg in messages:
content = msg.get("content", "")
if isinstance(content, list):
content = "\n".join(part.get("text", "") for part in content if isinstance(part, dict))
total += rough_tokens(str(content))
return total > max_context_tokens * threshold, total
compact, used = should_compact(conversation_messages)
if compact:
print(f"Context is about {used:,} tokens. Compact before the next edit.")
For production billing, use a real tokenizer or your provider's usage fields. For a guardrail, this simple estimate catches the expensive cases early.
Node.js: Build a Compaction Prompt
When you ask a model to compact context, be strict about the output. Don't ask for "a concise summary." Ask for a continuation note with fields.
const compactionPrompt = (transcript) => `
You are compacting an AI coding-agent session.
Keep facts needed to continue the task. Drop chatter, duplicate logs, and dead ends.
Return Markdown with exactly these headings:
Goal, Current Status, Locked Decisions, Files Touched, Constraints, Commands Run, Next Action.
Transcript:
${transcript}
`;
async function compactSession(client, transcript) {
const response = await client.chat.completions.create({
model: "gpt-5.5",
messages: [{ role: "user", content: compactionPrompt(transcript) }],
temperature: 0.1,
max_tokens: 900
});
return response.choices[0].message.content;
}
If your endpoint supports multiple models, don't waste a premium reasoning model on every compaction. Use a cheaper model for straightforward compression, then reserve Claude Opus or GPT-5.5 for risky architecture decisions. KissAPI's OpenAI-compatible routing makes that kind of split easier because the app code can keep one request format while you change the model behind it.
What Not to Remove
Most bad compaction failures come from deleting constraints. Be aggressive with logs, but conservative with promises. Keep these even if they look small:
- User corrections: "Not React, use Vue" or "don't migrate the database."
- Security boundaries: read-only systems, production credentials, privacy constraints.
- Business rules: pricing math, quota logic, refund behavior, billing dates.
- Failed commands: only when the failure changes the next step.
- External state: deployed file paths, server names, feature flags, migration IDs.
Delete "I checked three files and maybe the issue is caching." Keep "the real bug is stale Redis key invoice:v2:{id}; local cache was a red herring." Specific beats long.
Cost Control: Compact, Then Route
Compaction is only one lever. The bigger win comes from combining it with routing and budgets:
- Count tokens before the call.
- Compact when history crosses your threshold.
- Route easy continuation work to a cheaper model.
- Reserve expensive models for design, debugging, and high-risk edits.
- Stop the run when task spend crosses a fixed budget.
Before running a long coding-agent job, plug your expected input and output into the API cost calculator. Then compare the model you plan to use with pages like Claude Sonnet 4.6 and GPT-5.5. The expensive mistake is not choosing a premium model. The expensive mistake is feeding that model 80,000 stale tokens every turn.
A Simple Operating Policy
Here's a sane default for teams using AI coding agents daily:
- At task start, write a short goal and constraints block.
- After discovery, compact into decisions and next action.
- After every major file change, update files touched and tests run.
- At 60% context, compact before continuing.
- At handoff, pass only the compacted context plus relevant files.
This sounds bureaucratic until it saves a two-hour debugging session. Coding agents don't need every word you've said. They need the right state at the right time.
FAQ
What is context compaction for AI coding agents?
It's the process of replacing a long coding-agent conversation with a shorter working summary that preserves task goal, decisions, constraints, changed files, commands run, open risks, and the next action.
When should I compact a coding agent session?
Compact around 50–70% of the usable context window, after a major milestone, before switching models, or before handing work to another agent. Waiting until the model is already confused is too late.
Does context compaction reduce API cost?
Yes, when it's done carefully. It cuts repeated history tokens and keeps future requests smaller. The catch: sloppy compaction can lose constraints and create rework, which costs more than the tokens you saved.
Keep Agent Costs Predictable
Use KissAPI to run Claude, GPT, and other models through one OpenAI-compatible endpoint, then pair compaction with routing and budget checks. Start with free credit at kissapi.ai/register.
Start Free