Semantic Caching Saved Us $14K/Month in LLM Costs

Our LLM API bill hit $23,000 in January. Not because we had 10x more users. Not because we launched a new feature. Because our customer support bot was answering "how do I reset my password?" 4,200 times a month — and every single time, GPT-4o composed a brand-new answer from scratch. Same question. Same context. Same answer. $3.80 per thousand calls, multiplied by sheer repetition, multiplied by "nobody thought to cache this."

By March, we'd cut that bill to $8,600. Semantic caching, model routing, and provider-level prompt caching — three layers that, combined, slashed our spend by $14,400/month. This is how we did it, what broke along the way, and how you can do the same thing this week.

The LLM Cost Problem Nobody Prepared For

LLM API costs have become one of the fastest-growing line items for engineering teams. The worldwide spending on generative AI hit $644 billion in 2025, a 76.4% jump from 2024. The enterprise LLM market alone is projected to reach $71.1 billion by 2034. And unlike cloud compute, AI spend often stays invisible until the bill arrives.

Here's what the pricing looks like right now. GPT-5.2 costs $1.75 input / $14 output per million tokens. Claude Sonnet 4.6 runs $3 input / $15 output per million tokens. Even "cheap" models like GPT-5 mini at $0.25/$2 add up fast when you're processing hundreds of thousands of requests daily.

The math is brutal. A customer support bot handling 100,000 queries per month at an average of 500 input tokens and 300 output tokens costs roughly $2,100/month on GPT-4o. Scale that to 500,000 queries — which isn't unusual for a mid-size SaaS — and you're looking at $10,500/month on a single endpoint.

But here's what most articles about LLM costs miss: 30-40% of LLM requests are semantically similar to previous ones. Your users aren't asking unique questions. They're asking the same 200 questions phrased 15,000 different ways. That's not a cost problem. That's a caching opportunity.

Layer 1: Provider-Level Prompt Caching (The Free Win)

Before you build anything, grab the savings your provider is already offering.

Both OpenAI and Anthropic now offer native prompt caching. The concept is simple: when your API calls share the same prefix (like a system prompt), the provider caches the computed attention tensors and reuses them on subsequent calls. You pay 10% of the normal input token price for cached tokens — a 90% discount on the cached portion.

OpenAI's approach is zero-code. It's automatic for prompts over 1,024 tokens, with cache hits in 128-token increments. You literally change nothing and get cheaper calls when your system prompt stays consistent. Latency drops up to 80% on the cached prefix.

Anthropic's approach gives you more control but requires code changes. You add a cache_control parameter to specific message blocks, with up to four breakpoints per prompt:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for Acme Corp...",  # 2,000 token system prompt
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "How do I reset my password?"}]
)

The first call caches the system prompt. Every subsequent call with the same prefix pays 10% of the input cost for those cached tokens. For a 2,000-token system prompt called 100,000 times per month, that's the difference between $350 and $35.

What most articles get wrong: They treat prompt caching and semantic caching as either/or. They're not. They're complementary layers. Prompt caching saves you money on the shared prefix of every call. Semantic caching prevents the call entirely when the answer is already known. Use both.

Layer 2: Semantic Caching (The Big Win)

This is where the real savings live. Semantic caching works by converting queries into vector embeddings and comparing new queries against cached ones using cosine similarity. If a new query is semantically close enough to a cached one — "how do I change my password" matches "reset password steps" — you return the stored response without ever calling the LLM.

The numbers are compelling. Redis LangCache achieves up to 73% cost reduction in high-repetition workloads. At a 60% cache hit rate — typical for internal knowledge bases and FAQ systems — you're cutting 60% off your LLM bill. And response time drops from 1-5 seconds to 5-20 milliseconds on cache hits.

How We Built Ours

We started with GPTCache, the most mature open-source implementation. It's Python-native, integrates with LangChain, and supports multiple vector stores and embedding models. The basic setup takes about 30 minutes:

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# Initialize embedding model (runs locally, no API calls)
onnx = Onnx()

# SQLite for metadata, FAISS for vectors
data_manager = get_data_manager(
    CacheBase("sqlite"),
    VectorBase("faiss", dimension=onnx.dimension)
)

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()

# This call hits the LLM
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "How do I reset my password?"}],
)

# This call hits the cache (semantically similar)
response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "I need to change my password"}],
)

For production, we moved to Redis with vector search. Redis gives you sub-millisecond p95 latency on vector lookups and handles the scale that SQLite + FAISS can't:

from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(
    name="support_cache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.15,  # Lower = stricter matching
    ttl=86400,  # 24-hour expiry
)

# Check cache before calling LLM
result = cache.check(
    prompt="How do I reset my password?",
    return_fields=["response", "metadata"]
)

if result:
    return result[0]["response"]
else:
    llm_response = call_llm(prompt)
    cache.store(
        prompt="How do I reset my password?",
        response=llm_response,
        metadata={"model": "gpt-4o", "timestamp": "2026-04-01"}
    )
    return llm_response

The Similarity Threshold Problem

This is where most implementations go wrong. The similarity threshold — the cosine distance below which two queries are considered "the same" — is the single most important parameter in your system. And there's no universal right answer.

Threshold	Hit Rate	False Positive Rate	Best For
0.05 (very strict)	15-25%	Under 1%	Legal, medical, financial
0.10 (strict)	30-45%	2-5%	General customer support
0.15 (moderate)	45-65%	5-10%	FAQ, documentation queries
0.25 (loose)	60-80%	15-25%	Internal tools, low-risk

We started at 0.15 and quickly discovered a nasty edge case. "How do I cancel my subscription?" and "How do I cancel my flight?" have high semantic similarity — both are "how do I cancel X" — but completely different answers. At a 0.15 threshold, we were serving subscription cancellation instructions to users asking about flights.

A banking case study on InfoQ documented the same problem: their financial services FAQ system initially gave users "confident but completely wrong answers" due to false positive cache hits.

The fix was threefold:

Namespace your caches. Don't use one global cache. Separate by domain, endpoint, or product. Our support bot has separate caches for billing, technical, and account queries.
Add metadata filters. Cache keys should include the model name, temperature, and user context. Without this, you'll serve wrong or leaked responses across users.
Set aggressive TTLs. We use 24-hour TTLs for most caches, 1-hour for anything involving pricing or policy. Some responses should never be cached — real-time data, stock prices, anything where staleness causes harm.

What to Expect: Hit Rates by Category

Not all workloads benefit equally. Research on category-aware caching found massive variation:

Workload Type	Typical Hit Rate	Savings Potential
FAQ / documentation	60-85%	Very high
Customer support	40-60%	High
Code generation	40-60%	High
Internal knowledge base	50-70%	High
Creative writing	5-15%	Low
Open-ended conversation	5-15%	Low
Real-time data queries	Under 5%	Negligible

If your LLM app is primarily conversational with unique queries every time, semantic caching won't save you much. If it's FAQ-heavy, support-heavy, or documentation-heavy — which most enterprise apps are — you're leaving money on the table without it.

Layer 3: Model Routing (The Smart Win)

Here's a question most teams never ask: does every query need your most expensive model?

The answer is almost always no. Research from UC Berkeley and Canva shows that intelligent routing delivers 85% cost reduction while maintaining 95% of GPT-4 performance. In 2026, 37% of enterprises use 5+ models in production.

The pattern is cascade routing: start with the cheapest model, check if the output meets your quality threshold, and escalate to more expensive models only when needed.

from litellm import completion

MODELS = [
    {"model": "gpt-4o-mini", "cost_per_1k": 0.15, "max_complexity": "simple"},
    {"model": "gpt-4o", "cost_per_1k": 2.50, "max_complexity": "medium"},
    {"model": "claude-sonnet-4-20250514", "cost_per_1k": 3.00, "max_complexity": "complex"},
]

def classify_complexity(query: str) -> str:
    """Lightweight classifier - runs locally, no API call."""
    word_count = len(query.split())
    has_code = any(kw in query.lower() for kw in ["code", "function", "debug", "error"])
    has_analysis = any(kw in query.lower() for kw in ["analyze", "compare", "explain why"])

    if word_count > 200 or has_analysis:
        return "complex"
    elif has_code or word_count > 50:
        return "medium"
    return "simple"

def route_query(query: str, system_prompt: str) -> str:
    complexity = classify_complexity(query)

    for model_config in MODELS:
        if complexity_rank(complexity) <= complexity_rank(model_config["max_complexity"]):
            response = completion(
                model=model_config["model"],
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": query},
                ]
            )
            return response.choices[0].message.content

    # Fallback to most capable model
    return completion(model="claude-sonnet-4-20250514", messages=[...]).choices[0].message.content

Our results: 68% of queries routed to GPT-4o-mini, 24% to GPT-4o, and 8% to Claude Sonnet. The blended cost per query dropped from $0.0038 to $0.0012 — a 68% reduction on top of the caching savings.

LiteLLM is the tool we use for this. It's an open-source proxy that provides a unified interface across 100+ LLM providers with built-in cost tracking and budgeting. For more robust observability, Portkey and Helicone add dashboards showing cost per model, latency percentiles, and usage patterns.

The Full Stack: How the Three Layers Combine

Here's how a single request flows through our system:

Query arrives → Check semantic cache (5ms)
Cache hit? → Return cached response. Done. Cost: $0.
Cache miss? → Classify query complexity (1ms)
Route to cheapest viable model → Provider-level prompt caching kicks in automatically on the system prompt (90% discount on cached tokens)
Get response → Store in semantic cache with metadata and TTL
Return response → Log cost to gateway for tracking

The combined impact on our $23,000/month bill:

Layer	Mechanism	Monthly Savings
Prompt caching	90% discount on system prompt tokens	$1,800
Semantic caching	52% cache hit rate	$7,200
Model routing	68% queries to cheaper models	$5,400
Total		$14,400

New monthly bill: $8,600. That's a 62.6% reduction.

The Gotchas That Will Bite You

I'm not going to pretend this was smooth. Here's what went wrong.

Gotcha 1: Embedding model choice matters more than you think. We started with OpenAI's text-embedding-3-small because it was convenient. Problem: every cache check required an API call to OpenAI, which added latency and cost. We switched to ONNX-based local embeddings (built into GPTCache). Slightly lower accuracy, but zero marginal cost and 3ms instead of 150ms per embedding.

Gotcha 2: Cache poisoning is real. A bug in our first week cached a response from a test query with a joke system prompt. "What's your return policy?" returned "I'm a teapot, I don't do returns" for 847 users over 6 hours before someone noticed. Always validate cached responses before storing them. And build an admin endpoint to purge specific cache entries.

Gotcha 3: Conversational context breaks caching. Our first implementation cached based only on the user message. But "yes" means completely different things depending on what came before it. We had to include the last 2-3 messages in the cache key, which tanks the hit rate for multi-turn conversations. For single-turn FAQ queries, hit rates are great. For multi-turn, expect much lower returns.

Gotcha 4: Cache misses are more expensive. When you add a caching layer, misses now incur the embedding + vector search overhead on top of the LLM call. A Catchpoint analysis found that a single semantic cache miss increased latency by more than 2.5x. You need a minimum hit rate of 15-20% just to break even on the infrastructure cost.

Gotcha 5: NaN scores in RAGAS metrics. If you're evaluating your cached responses with RAGAS (you should be), watch out for NaN scores when the LLM judge returns invalid JSON during metric calculation. No graceful fallback — one malformed response tanks the whole eval run. Wrap everything in retry logic.

Gotcha 6: Embedding model drift after updates. We updated our embedding model from v1 to v2 mid-deployment. Every cached entry was suddenly in a different vector space than incoming queries. Hit rate dropped to near zero overnight and we didn't realize it for two days because the system "worked" — it just fell through to the LLM every time. When you change embedding models, you need to rebuild your entire cache. Plan for this. Version your embedding model in your cache metadata and invalidate everything when it changes.

Gotcha 7: Multi-tenant cache leakage. If you serve multiple customers from the same LLM endpoint, a shared cache can leak information between tenants. User A asks about their account balance, the response gets cached, and User B's semantically similar question returns User A's data. The fix is simple but easy to forget: include the tenant ID in your cache key. We namespace all cache entries by {tenant_id}:{domain}:{embedding_hash}.

Implementation Checklist: This Week

Stop planning. Start shipping. Here's the order of operations:

Day 1: Enable provider-level prompt caching. If you use Anthropic, add cache_control to your system prompts. If you use OpenAI, verify your system prompts are over 1,024 tokens (pad if needed — the savings are worth it). Zero infrastructure. Immediate savings.

Day 2-3: Deploy semantic caching for your highest-volume endpoint. Install GPTCache or Redis with vector search. Start with your FAQ or support endpoint — the one with the most repetitive queries. Set a strict threshold (0.10) to avoid false positives. Use 24-hour TTLs.

pip install gptcache redis redisvl

Day 4: Add monitoring. You need to track: cache hit rate (target over 30%), false positive rate (target under 5%), latency on hits vs misses, and cost savings per day. If your hit rate is below 15%, the workload isn't cacheable — stop and move to routing instead.

Day 5: Implement basic model routing. Set up LiteLLM as your proxy. Route simple queries to your cheapest model. Establish quality baselines on a sample set before routing production traffic.

pip install litellm
litellm --model gpt-4o-mini --port 8000

Week 2: Tune and expand. Adjust similarity thresholds based on false positive reports. Add namespace separation for different query domains. Expand caching to additional endpoints. Set up cost dashboards with Helicone or Portkey.

Expected timeline to full savings: 2-3 weeks for the basic stack. 4-6 weeks to tune thresholds and routing rules. Most teams see 40-50% cost reduction in the first month and 60%+ by month three as they tune the system.

What I Actually Think

The LLM cost optimization discourse is split into two camps. Camp one says "just wait, prices will keep dropping." Camp two says "build a complex optimization layer with 14 different strategies." Both are wrong.

Prices are dropping — roughly 80% reduction between early 2025 and early 2026. But usage is growing faster than prices are falling. Every team I've worked with that adopted LLMs saw their query volume 3-5x within six months as internal adoption spread. The bill goes up even as per-token costs go down.

And you don't need 14 strategies. You need three: prompt caching (free), semantic caching (one day of work for the highest-volume endpoint), and model routing (one day of work with LiteLLM). That's it. That covers 80% of the savings most teams will ever capture.

I think GPTCache is good enough for getting started but you'll outgrow it. Redis with vector search is the production answer for semantic caching — the sub-millisecond latency and operational maturity matter. I think Redis LangCache (now in private preview) will become the default choice within a year because it removes the build-vs-buy decision entirely.

Model routing is the most underrated optimization. Most teams are running GPT-4o on everything because they're scared of quality regressions. But 70% of queries in a typical support bot are "what are your hours" and "how do I reset my password" — queries that GPT-4o-mini handles perfectly. You're paying 10x more for zero quality difference on the majority of your traffic.

The teams I see wasting the most money aren't the ones with bad infrastructure. They're the ones who never measured which queries actually need expensive models and which don't. Start measuring. The savings will be obvious.

And one more thing: don't cache everything. I've seen teams get so excited about caching that they cache responses to queries that involve real-time data, user-specific context, or time-sensitive information. A cached stock price is worse than no answer at all. Cache the boring, repetitive stuff. Let the expensive model handle the genuinely novel queries. That's the whole point.

The biggest misconception in LLM cost optimization is that it requires a massive engineering investment. It doesn't. Provider prompt caching is literally free — zero code changes for OpenAI, a few lines for Anthropic. Semantic caching with GPTCache takes a single afternoon. Model routing with LiteLLM takes another afternoon. Three days of work for a 60%+ cost reduction. I don't know any other engineering investment with that kind of ROI.

If you're spending more than $1,000/month on LLM APIs and you haven't implemented at least one of these three layers, you're overpaying. Not by a little. By thousands of dollars every single month. Stop reading, go enable prompt caching, and measure your savings tomorrow morning. The numbers will do the rest of the convincing.

Semantic Caching Saved Us $14K/Month in LLM API Costs

The LLM Cost Problem Nobody Prepared For

Layer 1: Provider-Level Prompt Caching (The Free Win)

Layer 2: Semantic Caching (The Big Win)

How We Built Ours

The Similarity Threshold Problem

What to Expect: Hit Rates by Category

Layer 3: Model Routing (The Smart Win)

The Full Stack: How the Three Layers Combine

The Gotchas That Will Bite You

Implementation Checklist: This Week

What I Actually Think

Sources

Enjoyed this article?

The LLM Cost Problem Nobody Prepared For

Layer 1: Provider-Level Prompt Caching (The Free Win)

Layer 2: Semantic Caching (The Big Win)

How We Built Ours

The Similarity Threshold Problem

What to Expect: Hit Rates by Category

Layer 3: Model Routing (The Smart Win)

The Full Stack: How the Three Layers Combine

The Gotchas That Will Bite You

Implementation Checklist: This Week

What I Actually Think

Sources