Claude Prompt Caching API Cost Optimization Guide (2026): Cut Token Spend Without Downgrading Models
Most teams overpay for Claude for one boring reason: they keep sending the same giant prompt prefix on every request. Think policy instructions, style rules, product docs, schema notes, and formatting constraints. You write it once, then accidentally bill it hundreds of times per hour.
Prompt caching fixes that. And when you apply it correctly, the savings are real. In some workloads, it reduces input spend by 40–70% without touching model quality. No downgrade to a smaller model. No weird summarization hacks. Just better request design.
This guide shows how to do it in production with curl, Python, and Node.js. We'll cover what to cache, what not to cache, how to measure hit rate, and where most teams mess it up.
What Claude Prompt Caching Actually Does
Prompt caching stores reusable prompt blocks so later requests can read from cache instead of re-processing the same input tokens every time. The key point: it helps when part of your prompt is stable across many calls.
- Good cache candidates: system instructions, long tool descriptions, static policy text, fixed codebase conventions.
- Bad cache candidates: user-specific data, timestamps, request IDs, dynamic search results.
If your "shared prefix" changes every request, you won't get a hit, and you won't save money.
Cost Math: Why This Is Worth Your Time
Let's use a practical example: a coding assistant endpoint with a 9,000-token instruction prefix and a 1,500-token user payload.
| Scenario | Tokens Billed as Fresh Input | Relative Input Cost |
|---|---|---|
| No caching | 10,500 every request | 100% |
| Caching enabled, low hit rate (30%) | ~7,800 effective | ~74% |
| Caching enabled, solid hit rate (70%) | ~4,200 effective | ~40% |
| Caching enabled, high hit rate (90%) | ~2,400 effective | ~23% |
Same user experience. Same model. Lower input bill. This is one of the few API optimizations that pays back immediately.
Minimal curl Example (Claude Messages API)
Here's a compact request shape that marks the reusable system block as cacheable:
curl https://api.anthropic.com/v1/messages \
-H "x-api-key: $ANTHROPIC_API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "anthropic-beta: prompt-caching-2024-07-31" \
-H "content-type: application/json" \
-d '{
"model": "claude-sonnet-4-6",
"max_tokens": 900,
"system": [
{
"type": "text",
"text": "You are a strict senior code reviewer. Return issues grouped by severity with exact line references.",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{
"role": "user",
"content": "Review this diff and flag security issues first..."
}
]
}'
Do not include moving values (current timestamp, request UUID, session-specific hints) in the cached block. Put those in the user message instead.
Python Pattern: Keep a Versioned Stable Prefix
This pattern works well for PR review, ticket triage, and QA validation endpoints.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
PROMPT_VERSION = "review-v3"
SYSTEM_PREFIX = f"""
[{PROMPT_VERSION}] You are a backend reviewer.
Rules:
1) Prioritize security and data-loss risks.
2) Give exact file and line references.
3) End with a merge recommendation.
""".strip()
def review_patch(diff_text: str) -> str:
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=900,
system=[{
"type": "text",
"text": SYSTEM_PREFIX,
"cache_control": {"type": "ephemeral"}
}],
messages=[{
"role": "user",
"content": f"Review this patch:\n\n{diff_text}"
}]
)
usage = getattr(resp, "usage", None)
if usage:
print("cache_creation_input_tokens:", getattr(usage, "cache_creation_input_tokens", 0))
print("cache_read_input_tokens:", getattr(usage, "cache_read_input_tokens", 0))
return resp.content[0].text
Why the version tag? Because you will tweak instructions over time. Versioning avoids mystery behavior when one worker uses old text and another uses new text.
Node.js Pattern: Separate Stable and Volatile Content
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
const STABLE_POLICY = `
You are a support classifier.
Return JSON with: priority, owner, summary, and next_action.
Follow SLA policy exactly.
`.trim();
export async function classifyTicket(ticketText, customerTier) {
const msg = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 500,
system: [{
type: "text",
text: STABLE_POLICY,
cache_control: { type: "ephemeral" }
}],
messages: [{
role: "user",
content: `Customer tier: ${customerTier}\n\nTicket:\n${ticketText}`
}]
});
return msg.content[0].text;
}
Keep the stable policy block stable. If you inject customer tier into the cached block, you hurt hit rate for no reason.
Production Rules That Improve Cache Hit Rate Fast
- Freeze your shared prefix. Small edits, including whitespace churn, can break reuse.
- Move dynamic values out of cached blocks. User IDs, dates, and request metadata belong in the volatile section.
- Version prompts intentionally. Roll from
v3tov4on purpose, not by accident. - Track hit metrics per endpoint. Don't rely on one global average.
- Use routing discipline. If this endpoint depends on caching behavior, keep its request format strict.
Common Mistakes (I See These Weekly)
- Caching tiny prompts: If your shared prefix is only a few hundred tokens, savings may be too small to matter.
- Mixing personalization into system text: Makes every request unique and kills cache reuse.
- Ignoring usage fields: If you don't inspect usage metrics, you won't know whether caching is helping.
- No fallback path: Cost optimization and reliability are separate. Keep a backup route for spikes.
Many teams solve the fallback side with an OpenAI-compatible secondary endpoint. For example, you can keep your primary Claude flow and still maintain a spare route through KissAPI for burst traffic or regional issues. Different problem, same operational mindset: don't wait for incident day to wire it up.
Should You Cache Every Claude Endpoint?
No. Cache where there is real prefix reuse. Skip it for highly personalized chat threads where each request is mostly fresh context. A quick rule of thumb:
If your stable prefix is at least 30% of request input and gets reused across many calls, caching is usually worth it. Below that, test first before adding complexity.
Final take: prompt caching is one of the rare optimizations that makes both finance and engineering happy. You keep model quality, reduce input waste, and buy extra headroom before your next traffic jump.
Want a Backup Route Before the Next Traffic Spike?
Create a free account at kissapi.ai/register and keep an OpenAI-compatible fallback ready while you optimize Claude spend.
Start Free