Prompt Caching Cost Controls for Multi-Tenant Agents
Bottom Line
Prompt caching is a cost control only when your tenant boundaries, prompt layout, and observability agree. Keep reusable context at the front, isolate cache routing by tenant policy, and alert on cached-token regressions before they become margin leaks.
Key Takeaways
- ›OpenAI prompt caching begins at 1,024 prompt tokens and reports
cached_tokens. - ›Put static tools, schemas, policies, and examples before tenant-specific request data.
- ›Use tenant-aware
prompt_cache_keyvalues; never treat caching as an auth boundary. - ›Track cache hit ratio, uncached input spend, and p95 latency per tenant and agent version.
Prompt caching can cut repeated-input cost and latency for AI agents, but multi-tenant systems make it easy to lose the savings or blur operational boundaries. The fix is not just “make prompts longer.” You need deterministic prompt assembly, tenant-aware cache routing, budget checks, and verification that cached tokens are actually showing up in provider usage metadata.
Prerequisites
Bottom Line
Treat prompt caching as a production cost-control layer, not a model feature you hope will happen. Cache hit rate should be measured per tenant, per agent, and per prompt template version.
Prerequisites box
- An agent service that sends repeated system prompts, tool definitions, schemas, examples, or policy text.
- Access to provider usage metadata; for OpenAI, read
usage.prompt_tokens_details.cached_tokens. - A tenant identifier, agent identifier, and prompt template version available at request time.
- A billing config that stores current model input prices outside application code.
OpenAI’s official guidance says Prompt Caching is automatic on recent models, starts when prompts contain 1,024 or more tokens, and depends on exact prefix matches. Anthropic exposes cache_control for automatic or explicit breakpoints, while Gemini’s Interactions API uses implicit caching for supported models. The common engineering rule is the same: stable prefix first, variable context last.
1. Map Cacheable Prefixes
Start by splitting each agent prompt into stable and variable regions. A cacheable prefix should change slowly and be identical across many requests. In a multi-tenant product, that does not always mean global reuse. The prefix may include tenant-specific policy, contractual constraints, retrieval rules, or tool availability.
Build a prompt manifest
type PromptManifest = {
tenantId: string;
agentId: string;
templateVersion: string;
dataPolicyVersion: string;
staticBlocks: string[];
dynamicBlocks: string[];
};
export function assemblePrompt(manifest: PromptManifest) {
return [
...manifest.staticBlocks,
"--- dynamic request context below ---",
...manifest.dynamicBlocks,
].join("\n\n");
}
The manifest gives you a controlled place to review cache behavior during releases. Before sending logs or customer examples into test fixtures, sanitize them with TechBytes’ Data Masking Tool so cache experiments do not copy production secrets into prompts.
- Cacheable: system instructions, tool schemas, JSON response schemas, safety policy, long examples.
- Usually variable: user question, retrieved snippets, customer record IDs, locale, time-sensitive facts.
- Risky to share: tenant-specific policy, private knowledge-base excerpts, entitlements, compliance text.
2. Build Tenant-Safe Keys
On OpenAI, promptcachekey can improve routing for requests with common long prefixes. For multi-tenant agents, build this key from stable dimensions that describe who may share cache locality. Default to per-tenant isolation, then selectively group tenants only when policy and prompt content are intentionally identical.
import crypto from "node:crypto";
export function promptCacheKey(input: {
tenantId: string;
agentId: string;
templateVersion: string;
dataPolicyVersion: string;
}) {
const raw = [
input.tenantId,
input.agentId,
input.templateVersion,
input.dataPolicyVersion,
].join(":");
return crypto.createHash("sha256").update(raw).digest("hex").slice(0, 32);
}
Then pass the key with the request. The following example uses the OpenAI Responses.create pattern and requests 24h cache retention where supported. Keep the model name and price data in config so you can update them without editing business logic.
import OpenAI from "openai";
import { assemblePrompt, promptCacheKey } from "./prompt-cache";
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
export async function runAgent(manifest) {
const input = assemblePrompt(manifest);
return client.responses.create({
model: process.env.AGENT_MODEL || "gpt-4.1",
input,
prompt_cache_key: promptCacheKey(manifest),
prompt_cache_retention: "24h",
});
}
- Use one key per tenant when prompts include tenant-specific policy or data-access rules.
- Use one key per tenant cohort only when legal, security, and prompt content are the same.
- Rotate keys when templateVersion or dataPolicyVersion changes.
3. Add Budget Guards
Caching reduces marginal repeated-input cost; it does not cap spend. Add a budget guard that estimates uncached input exposure before dispatch and records actual usage afterward. This example avoids hard-coded vendor prices by reading your current billing table from configuration.
type Usage = {
prompt_tokens?: number;
prompt_tokens_details?: { cached_tokens?: number };
};
type PriceConfig = {
inputPerMillion: number;
cachedInputPerMillion: number;
};
export function estimateInputCost(usage: Usage, price: PriceConfig) {
const promptTokens = usage.prompt_tokens ?? 0;
const cachedTokens = usage.prompt_tokens_details?.cached_tokens ?? 0;
const uncachedTokens = Math.max(promptTokens - cachedTokens, 0);
return {
promptTokens,
cachedTokens,
cacheHitRatio: promptTokens === 0 ? 0 : cachedTokens / promptTokens,
estimatedInputCost:
(uncachedTokens / 1_000_000) * price.inputPerMillion +
(cachedTokens / 1_000_000) * price.cachedInputPerMillion,
};
}
Store the result with tenant, agent, model, template version, and request path. For production dashboards, make these your first-line metrics:
- cachedtokens / prompttokens: the direct hit-rate signal.
- uncached input cost per tenant: the spend number finance will care about.
- p95 latency by cache hit bucket: the user-visible performance effect.
- template version drift: the fastest way to catch accidental prefix churn.
4. Verify Expected Output
Run two identical requests with a prompt longer than 1,024 tokens. The first request may show zero cached tokens because the prefix has just been written. The second request should show a positive cached_tokens value if the prefix, key, model, and routing conditions match.
const first = await runAgent(manifest);
const second = await runAgent(manifest);
console.log({
firstCached: first.usage?.prompt_tokens_details?.cached_tokens ?? 0,
secondCached: second.usage?.prompt_tokens_details?.cached_tokens ?? 0,
});
Expected output:
{
"firstCached": 0,
"secondCached": 1920
}
The exact number will vary with tokenization and provider behavior. The important check is directional: repeated requests should move cached tokens above zero and reduce the uncached share. If the second value is still zero, compare the assembled prompt byte-for-byte before investigating provider behavior.
5. Troubleshooting Top 3 and What’s Next
Troubleshooting top 3
- cached_tokens stays at zero: your prompt may be under 1,024 tokens, the prefix may differ between calls, or the cache may have expired before reuse.
- Hit rate drops after deployment: inspect generated tool schemas, timestamps, randomized examples, or feature flags that moved into the prefix.
- One tenant becomes expensive: check whether that tenant has a unique policy version, a noisy prompt template, or traffic too sparse for retention to help.
What’s next
Once basic caching works, move cost controls into release engineering. Treat prompt templates like code artifacts: diff them, format snippets consistently with the Code Formatter, tag versions, and run regression tests that compare prompt length, cached-token ratio, and estimated input spend. For larger platforms, add policy-aware tenant cohorts, cache-aware routing dashboards, and automatic alerts when an agent version increases uncached input cost by more than your margin threshold.
Frequently Asked Questions
How does prompt caching reduce AI agent costs? +
usage.prompt_tokens_details.cached_tokens to confirm whether repeated tokens were served from cache.Should multi-tenant agents share the same prompt cache key? +
Why are cached tokens still zero after I enabled caching? +
cached_tokens as zero.Does prompt caching change the model answer? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
Agent Observability Checklist [Developer Cheat Sheet]
A practical checklist for traces, tool logs, cost telemetry, and replay in production AI agents.
System ArchitectureBackend AI Engineering Patterns 2026: APIs, Caching & Cost
Architecture patterns for model routing, prompt compression, and cost-aware backend AI systems.
Security Deep-DiveAI Agent Sandbox Filesystems: Isolation for Coders
How MicroVMs, gVisor, and namespace controls isolate hosted coding agents across tenants.