How I Built a Semantic Cache Using Only AWS Services

LLM calls are expensive and slow, but here’s the thing - users ask the same questions in different ways all the time. “What’s your refund policy?” and “How do I get my money back?” are different strings but the same question. Without semantic caching, you’re paying full price to answer identical questions over and over again.

I spent a weekend building a semantic cache that matches queries by meaning, not exact text - using only AWS-native services. S3 Vectors for similarity search, Bedrock for embeddings and LLM, Lambda for compute. Fully serverless, no external dependencies. The result? Cache hits that are 10x faster and cost nothing compared to calling the LLM.

Semantic Cache Architecture

The Problem

Every call to Amazon Bedrock costs money and takes 1-3 seconds. But here’s the thing - a huge chunk of queries are semantically identical to ones you’ve already answered. You’re literally paying to answer the same question over and over.

The Solution

Instead of matching exact strings, I used vector embeddings to match meaning. When a new query comes in:

Convert the query into a vector embedding (using Titan V2)
Search for similar queries in the cache (using S3 Vectors)
If similarity is above 85%, return the cached response
Otherwise, call the LLM and cache the result for next time

Simple concept. The trick was making it work with AWS-native services only.

The Tech Stack

Component	Service
Vector Storage	Amazon S3 Vectors
Embeddings	Bedrock Titan V2
LLM	Bedrock Claude Haiku 4.5
Compute	Lambda
API	API Gateway HTTP API

The best part? It’s fully serverless. No baseline costs. Pay only for what you use.

Results

After running some tests:

Cache hits are ~10x faster than calling the LLM
Semantic matching works - “capital of France” matches “France’s capital city”
Graceful degradation - if the cache fails, it falls back to the LLM

What I Learned

S3 Vectors is underrated - native similarity search without managing infrastructure
Serverless means fast startup - requests start processing in ~300ms
Similarity threshold matters - 0.85 worked well to avoid false matches while still catching rephrased questions

Try It Yourself

The complete code is available on GitHub. One-click deploy, one-click cleanup.

GitHub: github.com/sprider/semantic-cache-demo

The repo includes:

Full infrastructure as code (ready to deploy)
71 unit tests
One-click deploy and cleanup scripts
Architecture diagrams

Fair warning: it creates AWS resources that cost money. But the scripts make cleanup easy, and a few hours of testing costs less than a dollar.

What’s Next?

This is a demo, not production-ready code. For real use, you’d want:

API authentication
Cache invalidation strategy
Multi-region deployment
Better observability

Design alternatives

This demo uses S3 Vectors only for the cache layer. S3 Vectors has its own trade-offs (e.g. no built-in TTL, 40 KB metadata limit per vector). Combining S3 Vectors with DynamoDB—for example, storing vectors in S3 Vectors for similarity search and payloads or TTL in DynamoDB—lets you design differently for larger payloads, expiry, or exact-key lookups without changing the core flow shown here.

But as a proof of concept? It works. And it’s a pattern worth knowing.

Questions? Found a bug? Open an issue on the repo. Happy to chat about semantic caching, AWS architecture, or why vector databases are the future.

Joseph Velliah

How I Built a Semantic Cache Using Only AWS Services

The Problem

The Solution

The Tech Stack

Results

What I Learned

Try It Yourself

What’s Next?

Design alternatives

Read Next

How to Build Better AI Agent Tools: Cut Costs by 70% (MCP Server Case Study)

Claude Code Security: The Smart Way to Integrate AI

The Problem

The Solution

The Tech Stack

Results

What I Learned

Try It Yourself

What’s Next?

Design alternatives

Read Next

Tags