LLM calls are expensive and slow, but here’s the thing - users ask the same questions in different ways all the time. “What’s your refund policy?” and “How do I get my money back?” are different strings but the same question. Without semantic caching, you’re paying full price to answer identical questions over and over again.

I spent a weekend building a semantic cache that matches queries by meaning, not exact text - using only AWS-native services. S3 Vectors for similarity search, Bedrock for embeddings and LLM, Lambda for compute. Fully serverless, no external dependencies. The result? Cache hits that are 10x faster and cost nothing compared to calling the LLM.

Semantic Cache Architecture

The Problem

Every call to Amazon Bedrock costs money and takes 1-3 seconds. But here’s the thing - a huge chunk of queries are semantically identical to ones you’ve already answered. You’re literally paying to answer the same question over and over.

The Solution

Instead of matching exact strings, I used vector embeddings to match meaning. When a new query comes in:

  1. Convert the query into a vector embedding (using Titan V2)
  2. Search for similar queries in the cache (using S3 Vectors)
  3. If similarity is above 85%, return the cached response
  4. Otherwise, call the LLM and cache the result for next time

Simple concept. The trick was making it work with AWS-native services only.

The Tech Stack

Component Service
Vector Storage Amazon S3 Vectors
Embeddings Bedrock Titan V2
LLM Bedrock Claude Haiku 4.5
Compute Lambda
API API Gateway HTTP API

The best part? It’s fully serverless. No baseline costs. Pay only for what you use.

Results

After running some tests:

  • Cache hits are ~10x faster than calling the LLM
  • Semantic matching works - “capital of France” matches “France’s capital city”
  • Graceful degradation - if the cache fails, it falls back to the LLM

What I Learned

  1. S3 Vectors is underrated - native similarity search without managing infrastructure
  2. Serverless means fast startup - requests start processing in ~300ms
  3. Similarity threshold matters - 0.85 worked well to avoid false matches while still catching rephrased questions

Try It Yourself

The complete code is available on GitHub. One-click deploy, one-click cleanup.

GitHub: github.com/sprider/semantic-cache-demo

The repo includes:

  • Full infrastructure as code (ready to deploy)
  • 71 unit tests
  • One-click deploy and cleanup scripts
  • Architecture diagrams

Fair warning: it creates AWS resources that cost money. But the scripts make cleanup easy, and a few hours of testing costs less than a dollar.

What’s Next?

This is a demo, not production-ready code. For real use, you’d want:

  • API authentication
  • Cache invalidation strategy
  • Multi-region deployment
  • Better observability

But as a proof of concept? It works. And it’s a pattern worth knowing.

Questions? Found a bug? Open an issue on the repo. Happy to chat about semantic caching, AWS architecture, or why vector databases are the future.