Home Posts Prompt Caching Cost Controls for Multi-Tenant Agents
AI Engineering

Prompt Caching Cost Controls for Multi-Tenant Agents

Prompt Caching Cost Controls for Multi-Tenant Agents
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · June 25, 2026 · 6 min read

Bottom Line

Prompt caching is a cost control only when your tenant boundaries, prompt layout, and observability agree. Keep reusable context at the front, isolate cache routing by tenant policy, and alert on cached-token regressions before they become margin leaks.

Key Takeaways

  • OpenAI prompt caching begins at 1,024 prompt tokens and reports cached_tokens.
  • Put static tools, schemas, policies, and examples before tenant-specific request data.
  • Use tenant-aware prompt_cache_key values; never treat caching as an auth boundary.
  • Track cache hit ratio, uncached input spend, and p95 latency per tenant and agent version.

Prompt caching can cut repeated-input cost and latency for AI agents, but multi-tenant systems make it easy to lose the savings or blur operational boundaries. The fix is not just “make prompts longer.” You need deterministic prompt assembly, tenant-aware cache routing, budget checks, and verification that cached tokens are actually showing up in provider usage metadata.

Prerequisites

Bottom Line

Treat prompt caching as a production cost-control layer, not a model feature you hope will happen. Cache hit rate should be measured per tenant, per agent, and per prompt template version.

Prerequisites box

  • An agent service that sends repeated system prompts, tool definitions, schemas, examples, or policy text.
  • Access to provider usage metadata; for OpenAI, read usage.prompt_tokens_details.cached_tokens.
  • A tenant identifier, agent identifier, and prompt template version available at request time.
  • A billing config that stores current model input prices outside application code.

OpenAI’s official guidance says Prompt Caching is automatic on recent models, starts when prompts contain 1,024 or more tokens, and depends on exact prefix matches. Anthropic exposes cache_control for automatic or explicit breakpoints, while Gemini’s Interactions API uses implicit caching for supported models. The common engineering rule is the same: stable prefix first, variable context last.

1. Map Cacheable Prefixes

Start by splitting each agent prompt into stable and variable regions. A cacheable prefix should change slowly and be identical across many requests. In a multi-tenant product, that does not always mean global reuse. The prefix may include tenant-specific policy, contractual constraints, retrieval rules, or tool availability.

Build a prompt manifest

type PromptManifest = {
  tenantId: string;
  agentId: string;
  templateVersion: string;
  dataPolicyVersion: string;
  staticBlocks: string[];
  dynamicBlocks: string[];
};

export function assemblePrompt(manifest: PromptManifest) {
  return [
    ...manifest.staticBlocks,
    "--- dynamic request context below ---",
    ...manifest.dynamicBlocks,
  ].join("\n\n");
}

The manifest gives you a controlled place to review cache behavior during releases. Before sending logs or customer examples into test fixtures, sanitize them with TechBytes’ Data Masking Tool so cache experiments do not copy production secrets into prompts.

  • Cacheable: system instructions, tool schemas, JSON response schemas, safety policy, long examples.
  • Usually variable: user question, retrieved snippets, customer record IDs, locale, time-sensitive facts.
  • Risky to share: tenant-specific policy, private knowledge-base excerpts, entitlements, compliance text.
Watch out: Prompt caching is not an authorization layer. Provider caches may be organization-scoped, but your application still owns tenant isolation, access checks, and redaction.

2. Build Tenant-Safe Keys

On OpenAI, promptcachekey can improve routing for requests with common long prefixes. For multi-tenant agents, build this key from stable dimensions that describe who may share cache locality. Default to per-tenant isolation, then selectively group tenants only when policy and prompt content are intentionally identical.

import crypto from "node:crypto";

export function promptCacheKey(input: {
  tenantId: string;
  agentId: string;
  templateVersion: string;
  dataPolicyVersion: string;
}) {
  const raw = [
    input.tenantId,
    input.agentId,
    input.templateVersion,
    input.dataPolicyVersion,
  ].join(":");

  return crypto.createHash("sha256").update(raw).digest("hex").slice(0, 32);
}

Then pass the key with the request. The following example uses the OpenAI Responses.create pattern and requests 24h cache retention where supported. Keep the model name and price data in config so you can update them without editing business logic.

import OpenAI from "openai";
import { assemblePrompt, promptCacheKey } from "./prompt-cache";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export async function runAgent(manifest) {
  const input = assemblePrompt(manifest);

  return client.responses.create({
    model: process.env.AGENT_MODEL || "gpt-4.1",
    input,
    prompt_cache_key: promptCacheKey(manifest),
    prompt_cache_retention: "24h",
  });
}
  • Use one key per tenant when prompts include tenant-specific policy or data-access rules.
  • Use one key per tenant cohort only when legal, security, and prompt content are the same.
  • Rotate keys when templateVersion or dataPolicyVersion changes.

3. Add Budget Guards

Caching reduces marginal repeated-input cost; it does not cap spend. Add a budget guard that estimates uncached input exposure before dispatch and records actual usage afterward. This example avoids hard-coded vendor prices by reading your current billing table from configuration.

type Usage = {
  prompt_tokens?: number;
  prompt_tokens_details?: { cached_tokens?: number };
};

type PriceConfig = {
  inputPerMillion: number;
  cachedInputPerMillion: number;
};

export function estimateInputCost(usage: Usage, price: PriceConfig) {
  const promptTokens = usage.prompt_tokens ?? 0;
  const cachedTokens = usage.prompt_tokens_details?.cached_tokens ?? 0;
  const uncachedTokens = Math.max(promptTokens - cachedTokens, 0);

  return {
    promptTokens,
    cachedTokens,
    cacheHitRatio: promptTokens === 0 ? 0 : cachedTokens / promptTokens,
    estimatedInputCost:
      (uncachedTokens / 1_000_000) * price.inputPerMillion +
      (cachedTokens / 1_000_000) * price.cachedInputPerMillion,
  };
}

Store the result with tenant, agent, model, template version, and request path. For production dashboards, make these your first-line metrics:

  • cachedtokens / prompttokens: the direct hit-rate signal.
  • uncached input cost per tenant: the spend number finance will care about.
  • p95 latency by cache hit bucket: the user-visible performance effect.
  • template version drift: the fastest way to catch accidental prefix churn.

4. Verify Expected Output

Run two identical requests with a prompt longer than 1,024 tokens. The first request may show zero cached tokens because the prefix has just been written. The second request should show a positive cached_tokens value if the prefix, key, model, and routing conditions match.

const first = await runAgent(manifest);
const second = await runAgent(manifest);

console.log({
  firstCached: first.usage?.prompt_tokens_details?.cached_tokens ?? 0,
  secondCached: second.usage?.prompt_tokens_details?.cached_tokens ?? 0,
});

Expected output:

{
  "firstCached": 0,
  "secondCached": 1920
}

The exact number will vary with tokenization and provider behavior. The important check is directional: repeated requests should move cached tokens above zero and reduce the uncached share. If the second value is still zero, compare the assembled prompt byte-for-byte before investigating provider behavior.

Pro tip: Add a canary test that sends a fixed long prompt every deploy and fails the rollout if cached-token ratio drops below your baseline.

5. Troubleshooting Top 3 and What’s Next

Troubleshooting top 3

  1. cached_tokens stays at zero: your prompt may be under 1,024 tokens, the prefix may differ between calls, or the cache may have expired before reuse.
  2. Hit rate drops after deployment: inspect generated tool schemas, timestamps, randomized examples, or feature flags that moved into the prefix.
  3. One tenant becomes expensive: check whether that tenant has a unique policy version, a noisy prompt template, or traffic too sparse for retention to help.

What’s next

Once basic caching works, move cost controls into release engineering. Treat prompt templates like code artifacts: diff them, format snippets consistently with the Code Formatter, tag versions, and run regression tests that compare prompt length, cached-token ratio, and estimated input spend. For larger platforms, add policy-aware tenant cohorts, cache-aware routing dashboards, and automatic alerts when an agent version increases uncached input cost by more than your margin threshold.

Frequently Asked Questions

How does prompt caching reduce AI agent costs? +
Prompt caching reuses repeated prompt prefixes so the provider does less repeated input processing. In OpenAI responses, check usage.prompt_tokens_details.cached_tokens to confirm whether repeated tokens were served from cache.
Should multi-tenant agents share the same prompt cache key? +
Default to tenant-specific keys. Share a key only when tenants have identical prompt prefixes, data policies, entitlements, and security approval for shared cache locality.
Why are cached tokens still zero after I enabled caching? +
The prompt may be shorter than the provider minimum, the prefix may not match exactly, or the cache may have expired. For OpenAI, prompts under 1,024 tokens report cached_tokens as zero.
Does prompt caching change the model answer? +
No. Prompt caching should reduce repeated input processing and latency, but the model still generates a fresh output for the request. Your evaluation tests should still compare quality, latency, and cost after every prompt-template change.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.