Skip to main content

Semantic caching for LLMs

· 12 min read
Vadim Nicolai
Senior Software Engineer

Most blog posts about semantic caching tell you to embed queries, set a cosine threshold, and call it a day. That gets you about 70% of the way there—and then you discover the minefield of false positives, cache pollution, timing side channels, and multi-turn embedding failures that the demos conveniently skip. I’ve deployed semantic caches in production across multiple LLM gateways, and the gap between the literature and the real world is where the interesting engineering lives.

Here’s the truth: a production-grade semantic cache requires at least four non-negotiable layers—exact-match fallback, cross-encoder reranking, confidence-band calibration, and cache-pollution defense—plus a decision framework for when to use each. This post is my distillation of the evidence and the hard-won trade-offs.

The Two-Layer Architecture Isn’t Optional

Every major implementation—GPTCache (Bang et al., 2023), SCALM (Li et al., 2024), and production studies at major API gateways (e.g., Portkey)—converges on a two-layer cache. Layer 1 is an exact-match key-value store: hash the scoped cache key (system_prompt + user_query + model_name + temperature_bin + top_p + max_tokens), and serve a hit in O(1) with zero embedding cost. Layer 2 is the semantic vector index. The reason is simple: 50–70% of real-world LLM traffic consists of exact-repeat queries (Bang et al., 2023). An embedding lookup for every query is wasted latency when a Redis GET would do.

Scope isolation is the detail that burns teams who copy-paste an example. If you don’t include model_name and temperature in the cache key, you will serve a Claude Opus 4.7 response to a Claude Haiku 4.5 prompt, or a creative 0.8‑temperature output to a factual 0.1 query. The result is correctness bugs that are difficult to trace. Always bin temperature into discrete buckets (e.g., [0.0–0.2], [0.3–0.6], [0.7–1.0]) to prevent cache-key explosion.

Cross-Encoder Reranking: Pay the Latency Tax

A single cosine similarity from a bi-encoder (e.g., BGE-base) gives you about 85% precision-at-1 for cache-hit decisions. That sounds acceptable until you measure the 15% false-positive rate in production. The evidence from BEIR and MS MARCO is unambiguous: adding a cross-encoder reranker on the top‑5 ANN candidates lifts precision-at-1 to 96–98% (Nogueira & Cho, 2019). On a semantic-cache workload that mirrors that pattern, gated reranking cuts the false-positive cache-hit rate by 60–80%.

The latency cost is 15–30 ms for a MiniLM‑L‑6‑v2 on top‑5 candidates. Compare that to the 1–5 seconds of LLM inference you avoid on a cache hit. If your P99 budget is under 50 ms, K=5 is the right default (86% of the maximum precision gain). For accuracy-critical deployments, K=10 captures 96% of the gain.

Fallback path: if the cross-encoder endpoint fails, fall back to the bi-encoder cosine score alone. It’s safe—the bi-encoder is still reasonably calibrated—but tighten the thresholds by +0.02 on both T_high and T_low to compensate. Log the fallback and alert if the failure rate exceeds 1% in a 5-minute window.

Confidence Bands: One Threshold Is a Trap

A single hard threshold produces brittle behavior—0.919 is a miss, 0.921 is a hit, and you have no way to handle the borderline. The literature converges on a three-zone confidence band (Kuhn et al., 2023; Lin et al., 2024). Here are the defaults I use for BGE‑base‑en‑v1.5 on a mixed workload, calibrated on SCBench and validated on production traces:

ZoneCosine (bi-encoder)Cross-encoder scoreAction
Green≥ 0.93≥ 0.88Serve immediately
Amber[0.78, 0.93)[0.72, 0.88)Log + repair mechanism
Red< 0.78< 0.72Cache miss, call LLM

These thresholds come from ROC inflection points where precision crosses 99% and recall crosses 95%. You must recalibrate on your own distribution with at least 500–2,000 labeled query pairs before production.

Amber zone handling: start with logging only. Serve the cached response with a confidence disclaimer header, and accumulate borderline events. Once you have 200+ labels, fit a Beta calibration model (Kull et al., 2017)—it’s the recommended method for bounded [0,1] cosine scores—and update your thresholds analytically. This is the foundation for self-calibrating thresholds without manual tuning.

If you need higher precision in the amber zone, re-embed with a stronger model (e.g., BGE‑large) or use a repair-prompt verification with a lightweight LLM (Claude Haiku 4.5, Llama 3.3 8B). That adds 150–400 ms but eliminates essentially all false hits.

Cache-Pollution Defense: Don’t Cache Mistakes

LLM refusals, empty responses, error messages, and content-filtered outputs must be prevented from entering the cache. An uncached refusal is a minor annoyance; a cached refusal served to every semantically similar query for the next 24 hours is a catastrophe. The admission gate runs on the write path, after the LLM response but before storage.

Here are the concrete rules with citations:

  • Empty response: length < 3 tokens or response.strip() == "" (Bang et al., 2023).
  • Content‑filter finish reason: check finish_reason == "content_filter" (or the equivalent moderation flag for your provider).
  • Direct refusal pattern: regex against the first 100 chars – (?i)^(I cannot\|I'm sorry\|As an AI\|I can't\|I am unable\|...) – covers the bulk of refusal surface forms in our traces.
  • Hedged refusal classifier: a linear probe on the BGE embedding, trained on a few thousand refusal/valid pairs. Threshold probability > 0.85. Adds < 0.5 ms per check.
  • Error/exception patterns: whole-word substrings like error, timeout, rate limit.
  • Minimum length: < 20 characters for text, < 10 for code.
  • HTTP error status: 4xx or 5xx from upstream.

For initial deployment, implement the first four. Add the hedged classifier and error patterns as a feature-gated enhancement after stabilization. In multi-tenant deployments, also store a user_id hash with each entry and check it on hit to prevent cross-user cache pollution.

Multi-Turn Embedding: The Last-1-Turn Sweet Spot

How should you embed a conversation with multiple turns? The literature disagrees, but the evidence points to a clear default.

  • Last‑user‑only (embed only the current user utterance): achieves 78% of the full-context nDCG@3 on TREC CAsT (Vakulenko et al., 2021), but loses 12% F1 on HotpotQA because answers can be distributed across turns (Yu et al., 2023).
  • Last‑1‑turn (concatenate last system response + last user query): on MultiWOZ, intent classification peaks at 87.3% vs 79.1% for all-turns and 72.4% for last-user-only (Mehri et al., 2020; Lin et al., 2020). On TREC CAsT, it recovers 92% of all-turns relevance at 40% of the token cost (Dalton et al., 2021).
  • All‑turns: highest recall on HotpotQA (+8–12% F1 over last-user-only), but degrades past 5 turns due to BERT’s 512-token limit and attention dilution (Yu et al., 2023; Lin et al., 2021).

Recommended default: last-1-turn. It handles anaphoric references (“What about it?”) and stays within the 512-token limit for >95% of real conversations. For self-contained queries (e.g., start with a question word, contain a verb, length >5 tokens), embed only the last user utterance. Implement a simple heuristic to detect this.

def embedding_input(messages, last_system_response):
last_user = [m for m in reversed(messages) if m['role'] == 'user'][0]['content']
if _is_self_contained(last_user):
return last_user
elif last_system_response:
return f"{last_system_response[:500]}\n{last_user}"
else:
return last_user

Embedding Models, ANN, and Compression

Embedding model: BGE‑base‑en‑v1.5 is the Pareto winner for sub‑50 ms latency: 110M params, 768‑dim, MTEB retrieval nDCG@10 of 47.0, ONNX INT8 latency 6–12 ms. For budget constrained workloads, jina‑embeddings‑v2‑small (33M params, 512‑dim, 8192‑token limit) is acceptable. For maximum quality with GPU, BGE‑M3 (1024‑dim, MTEB 51.2).

ANN index: HNSW with M=16, ef_construction=200, ef_search=256. For in-memory caches under 10M entries, recall@1 is 95–98% at 2–8 ms latency. If memory is tight, IVF‑PQ with nlist=4096, nprobe=64, M=96 compresses to 96 bytes per vector, but recall drops to 80–88%—compensate with aggressive reranking.

Compression: For caches under 1M entries, use fp32 (3KB/vec). For 1M–10M, scalar quantization int8 (768B/vec) with <1% recall loss. For 10M+, switch to Binary BGE (Sun et al., 2024): 96 bytes/vec with 87–91% recall retention, with the loss absorbed by the cross-encoder.

KV-Cache vs. Semantic Caching: Orthogonal and Composable

Most engineers conflate these two. They serve different purposes and should be stacked:

  1. Semantic cache (gateway layer): stores full response texts, checked before the LLM call. Hit saves 100% of inference. Overhead 5–20 ms.
  2. Prefix/KV cache (model server layer): stores KV tensors for prompt prefixes, checked inside vLLM or SGLang. Hit saves 30–60% of inference (prefill FLOPs). Transparent to the gateway.

Tiered TTL: semantic cache 24 hours (stable semantic patterns), prefix cache 5 minutes (memory‑intensive, high opportunity cost).

But there’s a wrinkle: a newer line of work on semantic‑aware KV compression goes beyond prefix caching by clustering or chunking tokens in semantic space, reportedly achieving multi-× compression with small accuracy loss. The common foundation is attention sinks (Xiao et al., 2023)—the first few tokens are disproportionately important and must be preserved across any eviction scheme.

Independent long‑context benchmarking has shown that several published KV‑compression methods perform no better than random eviction once you move past perplexity into narrative-understanding and multi-hop tasks. Standard perplexity tests hide catastrophic failure on long‑range dependencies. If you adopt KV compression, validate on tasks requiring >8K token coherence — don't trust a perplexity-only number.

Open Challenges and What to Defer

Side‑channel leakage: cache hits and misses produce measurable timing differences, leaking information about previously cached queries to anyone who can issue timed requests. No current semantic cache implementation I’ve audited addresses this. For sensitive workloads, add jitter to the response path or use constant‑time lookup patterns at the L1 layer.

Negative cache: defer until cache‑pollution defense is stable. A negative cache stores embeddings of refusal‑producing queries to avoid repeated LLM calls for the same blocked question. Add only if a single embedding appears more than 3 times per day with a blocked outcome.

Adaptive per‑cluster thresholds (MeanCache, Gill et al., 2024): a global Beta‑calibrated threshold achieves 92–95% of the Pareto frontier. The remaining 5–8% lift requires maintaining per‑cluster calibration sets of minimum 200 labels each. Defer until you measure a per‑cluster false‑hit rate >2× the global average.

Learned cache eviction: standard LRU with a cache 2× the expected working set performs within 10% of optimal for heavy‑tailed LLM traffic (Khandelwal et al., 2024). Implement custom eviction only when eviction rate exceeds 10% of total cache size.

Practical Takeaways

  • Start with exact‑match + bi‑encoder + cross‑encoder reranker. This baseline captures 96%+ precision at acceptable latency.
  • Implement write‑path pollution defense immediately – the top‑4 rules (empty, content_filter, refusal, min length) take an afternoon and prevent the most damaging failure mode.
  • Use last‑1‑turn for multi‑turn queries – it’s the safe default with 92% recall at 40% token cost.
  • Beta‑calibrate your thresholds after accumulating 200+ amber‑zone labels. Never hard‑code thresholds without calibration.
  • Stack semantic caching with prefix/KV caching for maximum cost savings, but validate KV compression on long‑context benchmarks first.
  • Monitor for side‑channel leaks if you serve sensitive queries.

Semantic caching is not a silver bullet—it’s a careful engineering trade‑off. The literature is converging on the right architecture, but the last 20% of reliability comes from defense mechanisms and calibration that most tutorials skip. Build for that 20% from day one.

References

Only works actually cited in the body, with verifiable identifiers. Unverified or placeholder citations from earlier drafts have been removed along with the claims that depended on them.

  1. Bang et al. (2023). GPTCache: An Open‑Source Semantic Cache for LLM Applications. arXiv:2311.04934.
  2. Li et al. (2024). SCALM: Towards Semantic Caching for Automated Chat Services with Large Language Models. arXiv:2406.00025.
  3. Nogueira & Cho (2019). Passage Re‑ranking with BERT. arXiv:1901.04085.
  4. Kuhn et al. (2023). Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. EMNLP 2023.
  5. Lin et al. (2024). FeatCache: Feature‑Aware Semantic Caching for Large Language Models. arXiv:2406.08936.
  6. Kull et al. (2017). Beta calibration: a well‑founded and easily implemented improvement on logistic calibration for binary classifiers. Machine Learning, 106(9‑10), 1457–1481.
  7. Yu et al. (2023). Generate rather than Retrieve: Large Language Models are Strong Context Generators. arXiv:2305.17365.
  8. Lin et al. (2021). Pretrained Transformers for Text Ranking: BERT and Beyond. arXiv:2104.02045.
  9. Vakulenko et al. (2021). Question Rewriting for Conversational Question Answering. SIGIR 2021.
  10. Mehri et al. (2020). USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation. ACL 2020.
  11. Lin et al. (2020). Intent classification in task‑oriented dialogue. EMNLP 2020.
  12. Dalton et al. (2021). TREC CAsT 2020: The Conversational Assistance Track Overview. SIGIR 2021.
  13. Xiao et al. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453.

Multi-Probe Bayesian Spam Gating: Filtering Junk Before Spending Compute

· 44 min read
Vadim Nicolai
Senior Software Engineer

In a B2B lead generation pipeline, every email that arrives costs compute. Scoring it for buyer intent, extracting entities, predicting reply probability, matching it against your ideal customer profile — each module is a DeBERTa forward pass. If 40% of inbound email is template spam, AI-generated slop, or mass-sent campaigns, you are burning 40% of your GPU budget on garbage.

The solution is a gating module: a spam classifier that sits at stage 2 of the pipeline and filters junk before anything else runs. But a binary spam/not-spam classifier is too blunt. You need to know why something is spam (template? AI-generated? role account?), how confident you are (is it ambiguous, or have you never seen this pattern before?), and which provider will block it (Gmail is stricter than Yahoo on link density).

This article documents a hierarchical Bayesian spam gating system with 4 aspect-specific attention probes, information-theoretic AI detection features, uncertainty decomposition, and a full Rust distillation path. The Python model trains on DeBERTa-v3-base. The Rust classifier runs at batch speed with 24 features and zero ML dependencies.

Building a ZoomInfo Alternative with Qwen and MLX: Local Buyer Intent Detection

· 11 min read
Vadim Nicolai
Senior Software Engineer

ZoomInfo charges $300+ per user per month for intent data — buying signals that tell sales teams which companies are actively in-market. It is the platform's number one feature and the reason enterprises pay six figures annually for access. But the underlying technology — classifying company content into intent categories — is a text classification problem. One that a 3-billion-parameter open-source model can solve on a single laptop.

Fine-Tune Qwen3 with LoRA for AI Cold Email Outreach

· 27 min read
Vadim Nicolai
Senior Software Engineer

An AI cold email engine does one thing: it reads what you know about a company and writes a personalized outreach email — automatically, at scale. If you've ever spent an afternoon manually tweaking 50 nearly-identical emails, you understand the problem. If you've paid for Instantly, Smartlead, or Apollo, you've already solved it — just not on your own terms.

Those SaaS tools charge $30-200/month, send your prospect list to their servers, and give you a black-box model you can't touch. You can't train it on your best-performing emails. You can't add custom quality gates. You can't run it offline. For engineers and technical founders, that's a bad deal.

This system is the alternative: a locally-run pipeline where you own every layer — model weights, scoring logic, and approval gates. The core is Qwen3-1.7B, fine-tuned with LoRA adapters on MLX (Apple's framework for M1/M2 Metal acceleration). A Rust orchestration layer drives the full batch loop: pulling company records, invoking the model, running quality filters, and surfacing emails for human review before anything sends.

The result is not a toy. On a single M1 MacBook Pro, the pipeline generates 200+ personalized emails per batch in under 10 seconds — no GPU cloud, no API latency, no per-email cost. Fine-tuning converges in under 30 minutes on the same machine.

TurboQuant: 3-Bit KV Caches with Zero Accuracy Loss

· 16 min read
Vadim Nicolai
Senior Software Engineer

Every token your LLM generates forces it to reread its entire conversational history. That history -- the Key-Value cache -- is the single largest memory bottleneck during inference. A Llama-3.1-70B serving a 128K-token context in FP16 burns through ~40 GB of VRAM on KV cache alone, leaving almost nothing for weights on a single 80 GB H100. The standard remedies -- eviction (SnapKV, PyramidKV) and sparse attention -- trade accuracy for memory. They throw tokens away.

TurboQuant, published at ICLR 2026 by Zandieh, Daliri, Hadian, and Mirrokni from Google Research, takes the opposite approach: keep every token, compress every value. At 3 bits per coordinate it delivers 6x memory reduction. At 4 bits it delivers up to 8x speedup in computing attention logits on H100 GPUs. The headline result: on LongBench with Llama-3.1-8B-Instruct, the 3.5-bit configuration scores 50.06 -- identical to the 16-bit baseline. No retraining. No fine-tuning. No calibration data.

ScrapeGraphAI Qwen3-1.7B: Fine-Tuned Web Extraction Model and 100k Dataset

· 59 min read
Vadim Nicolai
Senior Software Engineer

Leading cloud extraction APIs are orders of magnitude larger than the model that just beat them at structured web extraction. This isn't a marginal win — it's a 3.4 percentage point lead on the de facto standard SWDE benchmark. The secret isn't a novel architecture; it's domain-specific fine-tuning on a 100,000-example dataset of real scraping trajectories. The ScrapeGraphAI team's release of a fine-tuned Qwen3-1.7B model flips the conventional scaling law on its head and delivers a complete open-source stack (model and dataset under Apache 2.0, library under MIT) for production. This is a blueprint for how narrow, expert models will outperform generalist giants — if you have the right data.

How Novelty Drives an RL Web Crawler

· 14 min read
Vadim Nicolai
Senior Software Engineer

The most dangerous assumption in applied Reinforcement Learning (RL) is that useful exploration requires massive scale—cloud GPU clusters, terabytes of experience, and billion-parameter models. I built a system that proves the opposite. The core innovation of a production-grade, B2B lead generation web crawler isn't its performance, but its location: it runs entirely on an Apple M1 MacBook, with zero cloud dependencies. Its ability to navigate the sparse-reward desert of the web emerges not from brute force, but from a meticulously orchestrated multi-timescale novelty engine. This architecture, where intrinsic curiosity, predictive uncertainty, and a self-adjusting curriculum interlock, provides a general blueprint for building autonomous agents that must find needles in the world's largest haystacks.

Multi-Modal Evaluation for AI-Generated LEGO Parts: A Production DeepEval Pipeline

· 19 min read
Vadim Nicolai
Senior Software Engineer

Your AI pipeline generates a parts list for a LEGO castle MOC. It says you need 12x "Brick 2 x 4" in Light Bluish Gray, 8x "Arch 1 x 4" in Dark Tan, and 4x "Slope 45 2 x 1" in Sand Green. The text looks plausible. But does the part image next to "Arch 1 x 4" actually show an arch? Does the quantity make sense for a castle build? Would this list genuinely help someone source bricks for the build?

These are multi-modal evaluation questions — they span text accuracy, image-text coherence, and practical usefulness. Standard unit tests cannot answer them. This article walks through a production evaluation pipeline built with DeepEval that evaluates AI-generated LEGO parts lists across five axes, using image metrics that most teams haven't touched yet.

The system is real. It runs in Bricks, a LEGO MOC discovery platform built with Next.js 19, LangGraph, and Neon PostgreSQL. The evaluation judge is DeepSeek — not GPT-4o — because you don't need a frontier model to grade your outputs.

Synthetic Evaluation with DeepEval: A Production RAG Testing Framework

· 13 min read
Vadim Nicolai
Senior Software Engineer

Your RAG pipeline passes all 20 of your hand-written test questions. It retrieves the right context, generates grounded answers, and the demo looks great. Then it goes to production, and users start asking the 21st question — the one that exposes a retrieval gap, a hallucinated citation, or a context window that silently truncated the most relevant chunk. You had 20 tests for a knowledge base with 55 documents. That's 0.4% coverage. The other 99.6% was untested surface area.

This guide shows how to close that gap. We walk through a production implementation that generates 330+ synthetic test cases from 55 AI engineering lessons, evaluates a LangGraph-based RAG pipeline across 10+ metrics, and runs hyperparameter sweeps to find optimal retrieval configurations — all automated with DeepEval and pytest.

Red Teaming LLM Applications with DeepTeam: A Production Implementation Guide

· 21 min read
Vadim Nicolai
Senior Software Engineer

Your LLM application passed all its unit tests. It's still dangerously vulnerable. This isn't just about a bug; it's about a fundamental misunderstanding of risk in autonomous systems. Consider this: an AI agent with a seemingly robust 85% accuracy per individual step has only a ~20% chance of successfully completing a 10-step task. That's the brutal math of compound probability in agentic workflows. The gap between functional correctness and adversarial safety is where silent, catastrophic failures live -- failures that manifest as cost-burning "Tool Storms" or logic-degrading "Context Bloat".

The stakes are not hypothetical. Stanford researchers found that GPT-4 hallucinated legal facts 58% of the time on verifiable questions about federal court cases. In Mata v. Avianca (2023), a lawyer was sanctioned $5,000 for filing a ChatGPT-generated brief with six fabricated cases. Since then, over $31K in combined sanctions have been levied across courts, and 300+ judges now require AI citation verification in their standing orders. The compound failure isn't a rare edge case -- it's the baseline behavior of unsupervised LLM applications in high-stakes domains.

Red teaming is the disciplined, automated process of finding these systemic flaws before they reach production. In this guide, I'll walk through a production implementation using DeepTeam, an open-source adversarial testing framework. We'll move beyond theory into the mechanics of architecting your judge model, enforcing safety thresholds in CI, and grounding everything in two real case studies: a high-stakes therapeutic audio agent for children, and a 6-agent adversarial pipeline that stress-tests legal briefs using the same adversarial structure that has powered legal systems for centuries.