LLM calls are expensive and slow, but here’s the thing - users ask the same questions in different ways all the time. “What’s your refund policy?” and “How do I get my money back?” are different strings but the same question. Without semantic caching, you’re paying full price to answer identical questions over and over again.
I spent a weekend building a semantic cache that matches queries by meaning, not exact text - using only AWS-native services. S3 Vectors for similarity search, Bedrock for embeddings and LLM, Lambda for compute. Fully serverless, no external dependencies. The result? Cache hits that are 10x faster and cost nothing compared to calling the LLM.

The Problem
Every call to Amazon Bedrock costs money and takes 1-3 seconds. But here’s the thing - a huge chunk of queries are semantically identical to ones you’ve already answered. You’re literally paying to answer the same question over and over.
The Solution
Instead of matching exact strings, I used vector embeddings to match meaning. When a new query comes in:
- Convert the query into a vector embedding (using Titan V2)
- Search for similar queries in the cache (using S3 Vectors)
- If similarity is above 85%, return the cached response
- Otherwise, call the LLM and cache the result for next time
Simple concept. The trick was making it work with AWS-native services only.
The Tech Stack
| Component | Service |
|---|---|
| Vector Storage | Amazon S3 Vectors |
| Embeddings | Bedrock Titan V2 |
| LLM | Bedrock Claude Haiku 4.5 |
| Compute | Lambda |
| API | API Gateway HTTP API |
The best part? It’s fully serverless. No baseline costs. Pay only for what you use.
Results
After running some tests:
- Cache hits are ~10x faster than calling the LLM
- Semantic matching works - “capital of France” matches “France’s capital city”
- Graceful degradation - if the cache fails, it falls back to the LLM
What I Learned
- S3 Vectors is underrated - native similarity search without managing infrastructure
- Serverless means fast startup - requests start processing in ~300ms
- Similarity threshold matters - 0.85 worked well to avoid false matches while still catching rephrased questions
Try It Yourself
The complete code is available on GitHub. One-click deploy, one-click cleanup.
GitHub: github.com/sprider/semantic-cache-demo
The repo includes:
- Full infrastructure as code (ready to deploy)
- 71 unit tests
- One-click deploy and cleanup scripts
- Architecture diagrams
Fair warning: it creates AWS resources that cost money. But the scripts make cleanup easy, and a few hours of testing costs less than a dollar.
What’s Next?
This is a demo, not production-ready code. For real use, you’d want:
- API authentication
- Cache invalidation strategy
- Multi-region deployment
- Better observability
But as a proof of concept? It works. And it’s a pattern worth knowing.
Questions? Found a bug? Open an issue on the repo. Happy to chat about semantic caching, AWS architecture, or why vector databases are the future.