Understanding Prompt Caching: Speeding Up LLMs and Reducing Costs
Published:
Prompt caching can significantly improve the speed and cost-effectiveness of large language models (LLMs); but what exactly is prompt caching?
In this post, I’ll break down:
- What prompt caching is and what it’s not
- How it works under the hood
- What types of prompt content can be cached
- How cache matching works
- When it’s most beneficial
- Practical constraints you need to know
By the end, you’ll understand how prompt caching can push LLM performance and efficiency to the next level.
What Prompt Caching Is Not
Before we define prompt caching, it helps to clarify what it isn’t.
One common misunderstanding is thinking prompt caching is the same as output caching.
For example:
- A database query returns results that are cached
- A future request with the same query can retrieve that cached result
- The system avoids re-running the query
That’s output caching; storing the result of a call to reuse later.
While output caching can be applied to LLM responses, it’s not what prompt caching refers to.
What Prompt Caching Is
Prompt caching focuses only on the input prompt, or more precisely, part of that input, and caches how the model interprets it.
Here’s the key idea:
When you send a prompt into an LLM, the model doesn’t begin generating output immediately. Instead, it performs an expensive internal computation called key/value (KV) pair generation.
- The model computes KV pairs for every token
- These pairs represent the model’s internal understanding of the text
- This phase often takes more compute than generating even the first output token
Prompt caching saves these computed KV pairs so that the model doesn’t have to recompute them again for similar input.
How Prompt Caching Works Under the Hood
When an LLM receives a prompt:
- It analyzes each token in every transformer layer
- It computes internal KV pairs for those tokens
- This analysis happens before output generation
With prompt caching:
- KV pairs from a previous prompt are stored
- When a new prompt partially matches a cached one, the model skips recomputing matching segments
- The model only processes the new portion of the prompt
This means developers can structure prompts so that large static content is cached once — and reused across multiple queries.
What Can Be Cached?
Prompt caching typically applies to static or semi-static parts of prompts:
Common Examples
✅ System prompts: instructions that define agent behavior
✅ Large documents: manuals, research papers, contracts
✅ Few-shot examples: demonstration examples for output formatting
✅ Static context blocks: any repeated context that remains unchanged
In contrast, dynamic portions (like a user’s question) usually come after these static elements and are not cached.
How the Cache Decides What to Reuse
Prompt caching systems use prefix matching.
- The system matches prompts from the beginning, token by token
- As long as new input matches the cached prompt, those KV pairs are reused
- Once there’s a difference, caching stops and standard processing resumes
This means prompt structure matters; static parts should come first.
