LLM Cost Architecture: Caching, Routing, Fallbacks
Architecture Patterns -- Part 28 of 30
A founder I know built a document Q&A feature for their SaaS product. It was clever: users uploaded PDFs, asked questions, and GPT-4 answered with citations. The demo was stunning. Three weeks after launch, the AWS bill arrived. Their LLM line item: $14,800 for a single month. With 600 paying customers.
Do the math. That is roughly $24 per customer per month on LLM costs alone, for a product priced at $49/month. They were losing money on every active user.
Here is the thing: most of those API calls were redundant. A significant chunk of questions were semantically identical -- "summarize this contract," "what are the key dates," "who are the parties" -- asked slightly differently by different users against the same documents. They were paying GPT-4 prices to answer the same question hundreds of times a day.
The fix cost two weekends of engineering. After implementing semantic caching, model routing, and fallback logic, their monthly LLM spend dropped to under $2,000. Same product. Same users. Same quality.
This is the architecture that got them there.
The Three Pillars of LLM Cost Control
LLM cost problems almost always fall into one of three buckets: you are answering the same questions repeatedly (a caching problem), you are using a $15-per-million-token model for tasks a $0.15-per-million-token model could handle (a routing problem), or you are not handling API failures gracefully and users see errors instead of degraded-but-functional responses (a reliability problem).
Let us work through each one with real implementation.
Pillar 1: Caching
Exact-Match Caching
Start simple. A Redis key-value store with the raw prompt as the key catches identical requests instantly. This is free to implement and pays dividends immediately in any application with repeated system prompts or templated queries.
import { createClient } from 'redis';
import crypto from 'crypto';
const redis = createClient({ url: process.env.REDIS_URL });
function getCacheKey(model: string, messages: object[]): string {
const payload = JSON.stringify({ model, messages });
return `llm:exact:${crypto.createHash('sha256').update(payload).digest('hex')}`;
}
async function cachedCompletion(model: string, messages: object[]) {
const key = getCacheKey(model, messages);
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const response = await callLLM(model, messages);
// TTL of 24 hours; adjust based on your staleness tolerance
await redis.setEx(key, 86400, JSON.stringify(response));
return response;
}
Exact-match caching has one obvious limitation: it misses "What are the key terms?" and "Summarize the main clauses" even though both want the same thing from the same document. That is where semantic caching earns its keep.
Semantic Caching
Semantic caching works by embedding incoming queries and storing those embeddings alongside the cached responses. When a new query arrives, you embed it, run a vector similarity search, and return a cached response if the similarity score exceeds a threshold. Research from Pluralsight's engineering team found that 31% of enterprise LLM queries are semantically similar to previous requests (https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering), and that semantic caching delivers 40-70% cost reduction with a 7x latency improvement for cache hits.
Upstash Vector (https://pypi.org/project/upstash-semantic-cache/) offers a hosted vector store with a semantic cache library designed exactly for this pattern:
import { SemanticCache } from '@upstash/semantic-cache';
import { Index } from '@upstash/vector';
const index = new Index({
url: process.env.UPSTASH_VECTOR_REST_URL!,
token: process.env.UPSTASH_VECTOR_REST_TOKEN!,
});
const cache = new SemanticCache({ index, minProximity: 0.92 });
async function semanticCachedCompletion(prompt: string): Promise<string> {
const hit = await cache.get(prompt);
if (hit) {
console.log('Semantic cache hit');
return hit;
}
const response = await callLLM('gpt-4o', [{ role: 'user', content: prompt }]);
const text = response.choices[0].message.content;
await cache.set(prompt, text);
return text;
}
The minProximity threshold is your precision dial. At 0.95+, you only hit the cache for near-identical phrasing. At 0.85, you start catching legitimate paraphrases but risk returning a subtly wrong answer to a subtly different question. For factual lookups, 0.90-0.92 is a reasonable starting point. For creative or context-sensitive tasks, bias higher.
Cache invalidation is the only real complexity here. For document-based Q&A, scope cache keys to the document version hash. If the document changes, the old embeddings are irrelevant. For general knowledge queries, a 24-hour TTL plus a manual invalidation hook on content updates is usually sufficient.
Pillar 2: Routing
Not every query needs your most powerful model. This is the single biggest lever most teams leave unpulled.
The Real Cost Gap
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Llama 3 (self-hosted) | ~$0.00 | ~$0.00 |
GPT-4o-mini is 20x cheaper on input and 16x cheaper on output than Claude Sonnet 4 (https://pricepertoken.com/compare/anthropic-claude-sonnet-4-vs-openai-gpt-4o-mini). If 60% of your traffic consists of queries that mini handles just as well, you cut costs by more than half before touching anything else.
Research from UC Berkeley's RouteLLM project demonstrated that routing simple queries to cheaper models while reserving premium models for complex reasoning achieves up to 85% cost reduction while maintaining 95% of output quality (https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering). Production data from teams using OpenRouter and LiteLLM confirm similar savings: 30-85% depending on traffic patterns (https://www.linkedin.com/posts/alexandra-skidan_do-not-overspend-on-llms-theres-a-way-activity-7387466017830051840-heeV).
Model Cascading in Practice
The cascading pattern tries the cheap model first. If the response meets a quality bar (length, confidence signal, or a lightweight classifier check), it returns that. Only on failure does it escalate.
interface RoutingResult {
response: string;
model: string;
cost_tokens: number;
}
async function routedCompletion(
prompt: string,
requiresReasoning: boolean = false
): Promise<RoutingResult> {
// Fast path: simple tasks go directly to mini
if (!requiresReasoning) {
try {
const result = await callWithTimeout('gpt-4o-mini', prompt, 8000);
if (isAdequateResponse(result)) {
return { response: result.content, model: 'gpt-4o-mini', cost_tokens: result.usage.total_tokens };
}
} catch (e) {
// Timeout or quality failure -- fall through to expensive model
}
}
// Escalate to full model
const result = await callWithTimeout('gpt-4o', prompt, 20000);
return { response: result.content, model: 'gpt-4o', cost_tokens: result.usage.total_tokens };
}
function isAdequateResponse(result: any): boolean {
// Tune these thresholds for your domain
const content = result.choices[0].message.content;
if (content.length < 50) return false;
if (content.includes("I'm not sure") && content.length < 200) return false;
return true;
}
Unified Routing with LiteLLM and OpenRouter
For teams managing multiple providers, LiteLLM (https://www.truefoundry.com/blog/litellm-vs-openrouter) provides a unified proxy with built-in routing, spend tracking per virtual key, and 3ms of added overhead. OpenRouter -- which raised $40M in June 2025 and now provides access to 623+ models (https://www.mindstudio.ai/blog/best-ai-model-routers-multi-provider-llm-cost-011e6/) -- handles routing as a managed service with a single credit-based billing model, at the cost of approximately 40ms added latency per request.
The choice: LiteLLM if you want full control, self-hosted governance, and direct provider billing. OpenRouter if you want zero infrastructure overhead and are comfortable with the latency trade-off.
Pillar 3: Fallbacks
API outages happen. Rate limits happen. A feature that goes dark when OpenAI has a bad afternoon is a feature your users will stop trusting.
The Circuit Breaker Pattern
A circuit breaker tracks the failure rate for a given provider. Once it trips, traffic stops going there entirely for a cooldown window -- preventing your app from hammering a down endpoint and burning retry budget.
class LLMCircuitBreaker {
private failures = 0;
private lastFailureTime = 0;
private state: 'closed' | 'open' | 'half-open' = 'closed';
constructor(
private threshold: number = 5,
private cooldownMs: number = 60000
) {}
isOpen(): boolean {
if (this.state === 'open') {
if (Date.now() - this.lastFailureTime > this.cooldownMs) {
this.state = 'half-open';
return false;
}
return true;
}
return false;
}
recordFailure(): void {
this.failures++;
this.lastFailureTime = Date.now();
if (this.failures >= this.threshold) {
this.state = 'open';
console.warn('Circuit breaker tripped for provider');
}
}
recordSuccess(): void {
this.failures = 0;
this.state = 'closed';
}
}
const openAIBreaker = new LLMCircuitBreaker(5, 60000);
const anthropicBreaker = new LLMCircuitBreaker(5, 60000);
async function resilientCompletion(prompt: string): Promise<string> {
// Try OpenAI first
if (!openAIBreaker.isOpen()) {
try {
const result = await callWithTimeout('gpt-4o', prompt, 15000);
openAIBreaker.recordSuccess();
return result;
} catch (e) {
openAIBreaker.recordFailure();
}
}
// Fall back to Anthropic
if (!anthropicBreaker.isOpen()) {
try {
const result = await callWithTimeout('claude-sonnet-4', prompt, 15000);
anthropicBreaker.recordSuccess();
return result;
} catch (e) {
anthropicBreaker.recordFailure();
}
}
// Both providers are down -- return cached or static response
const cached = await getMostRecentCachedResponse(prompt);
if (cached) return cached;
throw new Error('All LLM providers unavailable. Please try again shortly.');
}
Timeout Strategy
LLM calls are slow relative to everything else in your stack. Set aggressive timeouts and fail fast:
- Streaming responses: 3s to first token, 30s total
- Synchronous completions: 15s hard limit, retry once before fallback
- Batch/background jobs: 60s, no retries, queue for later
Never let an LLM call block your entire request handler indefinitely. Users will close the tab, and you will still pay for the tokens.
Putting It Together: An LLM Gateway
The cleanest implementation wraps all three patterns into a single gateway module that every feature routes through:
async function llmGateway(options: {
prompt: string;
requiresReasoning?: boolean;
trackFeature?: string;
}): Promise<{ response: string; cached: boolean; model: string }> {
const { prompt, requiresReasoning = false, trackFeature = 'unknown' } = options;
// Layer 1: semantic cache check
const semanticHit = await cache.get(prompt);
if (semanticHit) {
await logUsage({ feature: trackFeature, model: 'cache', tokens: 0 });
return { response: semanticHit, cached: true, model: 'cache' };
}
// Layer 2: route to appropriate model with circuit breaker fallback
const { response, model, cost_tokens } = await resilientRoutedCompletion(prompt, requiresReasoning);
// Layer 3: store in semantic cache
await cache.set(prompt, response);
// Layer 4: track spend
await logUsage({ feature: trackFeature, model, tokens: cost_tokens });
return { response, cached: false, model };
}
Monitoring: You Cannot Optimize What You Do Not Measure
Track spend at three levels:
- Per model: which models are you actually using vs. what you intended
- Per feature: which product surfaces generate the most cost
- Per user tier: free users on expensive models is a unit economics disaster
The document Q&A feature from our opening story would have flagged itself in the first week if per-feature cost tracking had been in place.
Build vs. buy decision: Helicone, Portkey, and LangSmith all provide observability with spend tracking, prompt logging, and routing analytics. Helicone integrates with a single header change. For teams under $5K/month in LLM spend, a build-it-yourself logging table is fine. Over that threshold, the operational visibility from a dedicated tool pays for itself.
Decision Checklist
Before shipping any LLM-backed feature, work through these:
- Have I profiled the query distribution? What percentage of prompts are semantically similar?
- Does every code path have a timeout? Is the timeout actually enforced?
- Are simple tasks explicitly routed to a cheap model, or does everything hit GPT-4/Claude Sonnet?
- Is there a fallback provider if my primary API goes down?
- What does the user see when all fallbacks fail? Is it a clean error or a silent hang?
- Am I tracking spend per feature so I can see the cost of each new capability I ship?
- Have I set a spend alert so the $14,800 surprise cannot happen to me?
- Is my cache invalidation logic tied to the content lifecycle, or is stale data a risk?
Ask The Guild
What is your current strategy for deciding when to route a query to a cheaper model versus your primary model? Are you using a classifier, a keyword heuristic, task type detection, or something else entirely? Share your routing logic -- good, bad, or half-baked -- in the community thread. The messier the real answer, the more useful it is to everyone else trying to make the same call.