Automatic Prompt Caching
Prompt Caching allows users to make repeated API calls more efficiently by reusing context from recent prompts, resulting in a reduction in input token costs and faster response times. The Prompt Caching option is available for Claude, OpenAI, and Google Gemini models.
Challenges with Current AI Context Handling
Previously, when interacting with an AI model, the entire conversation history had to be sent to the LLM for each new query to maintain the conversation context. This repetitive processing could lead to slower responses, increased latency, and higher operational costs, especially during long conversations or complex tasks.
By using the Prompt Caching feature, you can pass some content to the model once, cache the input tokens, and then refer to the cached tokens for subsequent requests.
How Prompt Caching Works
Prompt Caching improves AI efficiency by allowing models like Claude, OpenAI, or Google Gemini to store and reuse stable contexts, such as system instructions or background information. When you send a request with Prompt Caching enabled:
The system checks if the start of your prompt is already cached from a recent query.
If it is, the cached version is used, speeding up responses and lowering costs.
If not, the full prompt is processed, and the prefix is cached for future use.
This is especially useful for recurring queries against large document sets, prompts with many examples, repetitive tasks, and long multi-turn conversations. By reusing cached information, the AI models can focus on new queries without reprocessing the entire conversation history, enhancing accuracy for complex tasks.
Time to Live (TTL) for Cache Storage
For OpenAI: Cached prefixes generally remain active for 5 to 10 minutes of inactivity. However, during off-peak periods, caches may persist for up to one hour.
For Claude: The cache has a 5-minute lifetime, refreshed each time the cached content is used.
For Gemini: The default TTL is 1 hour.
Supported Models
With Claude, Prompt Caching is supported on:
Claude 3.5 Sonnet
Claude 3 Haiku
Claude 3 Opus
With OpenAI, Prompt Caching is supported on the latest version of:
GPT-4o
GPT-4o mini
o1-preview
o1-mini
Fine-tuned versions of the above models
With Google Gemini, Context/Prompt Caching is supported on:
Gemini 1.5 Pro
Gemini 1.5 Flash
Why Use Prompt Caching?
For Claude, using Prompt Caching allows you to get up to 85% faster response times for cached prompts and potentially reduce costs by up to 90%. Discounts are as follows:
Claude 3.5 Sonnet: 90% off input tokens, 75% off output tokens
Claude 3 Opus: 90% off input tokens, 75% off output tokens
Claude 3 Haiku: 88% off input tokens, 76% off output tokens
💡 Note: While creating the initial cached prompt incurs a 25% higher cost than the standard API rate, subsequent requests using the cached prompt will be up to 90% cheaper than the usual API cost.
Prompt Caching Costs
For Claude:
The minimum cachable prompt length is 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus, and 2048 tokens for Claude 3.0 Haiku. (Support for caching prompts shorter than 1024 tokens is coming soon.)
You can set up to 4 cache breakpoints within a prompt.
For OpenAI:
You can get a 50% discount on input tokens when using cached prompts, with up to 80% reduction in latency.
For Gemini:
Gemini has a complex pricing structure with costs including regular input/output costs when the cache is missed, a 75% discount on input costs when the cache is used, and cache storage costs.
How Prompt Caching Can Be Used?
Prompt Caching is useful for scenarios where you want to send a large prompt context once and refer back to it in subsequent requests. This is especially useful for:
Analyzing long documents: Process and interact with entire books, legal documents, or other extensive texts without slowing down.
Helping in coding: Keep track of large codebases to provide more accurate suggestions, help with debugging, and ensure code consistency.
Setting up hyper-detailed instructions: Allow for the inclusion of numerous examples to improve AI output quality.
Solving complex issues: Address multi-step problems by maintaining a comprehensive understanding of the context throughout the process.
How to Enable Automatic Prompt Caching
For OpenAI models: Prompt Caching is automatically applied on the latest versions of GPT-4o, GPT-4o mini, o1-preview, and o1-mini.
For Claude and Gemini models:
Go to Model Settings
Expand the Advanced Model Parameter
Scroll down to enable the “Prompt Caching” option.
💡 Important Notes:
Avoid using Prompt Caching with Dynamic Context via API, as changing system prompts cannot be cached.
Best Practices for Using Prompt Caching
To get the most out of Prompt Caching, consider these best practices:
Place reusable content at the beginning of prompts for better cache efficiency.
Prompts that aren't used regularly are automatically removed from the cache. To prevent cache evictions, maintain consistent usage of prompts.
Regularly track cache hit rates, latency, and the proportion of cached tokens. Use these insights to fine-tune your caching strategy and maximize performance.
Last updated