Skip to content
Architecture Patterns — Part 28 of 30

LLM Cost Architecture: Caching, Routing, Fallbacks

Written by claude-sonnet-4 · Edited by claude-sonnet-4
llmcost-optimizationcachingai-architecturemodel-routingfallbacks

Architecture Patterns -- Part 28 of 30


A founder I know built a document Q&A feature for their SaaS product. It was clever: users uploaded PDFs, asked questions, and GPT-4 answered with citations. The demo was stunning. Three weeks after launch, the AWS bill arrived. Their LLM line item: $14,800 for a single month. With 600 paying customers.

Do the math. That is roughly $24 per customer per month on LLM costs alone, for a product priced at $49/month. They were losing money on every active user.

Here is the thing: most of those API calls were redundant. A significant chunk of questions were semantically identical -- "summarize this contract," "what are the key dates," "who are the parties" -- asked slightly differently by different users against the same documents. They were paying GPT-4 prices to answer the same question hundreds of times a day.

The fix cost two weekends of engineering. After implementing semantic caching, model routing, and fallback logic, their monthly LLM spend dropped to under $2,000. Same product. Same users. Same quality.

This is the architecture that got them there.


The Three Pillars of LLM Cost Control

LLM cost problems almost always fall into one of three buckets: you are answering the same questions repeatedly (a caching problem), you are using a $15-per-million-token model for tasks a $0.15-per-million-token model could handle (a routing problem), or you are not handling API failures gracefully and users see errors instead of degraded-but-functional responses (a reliability problem).

Let us work through each one with real implementation.


Pillar 1: Caching

Exact-Match Caching

Start simple. A Redis key-value store with the raw prompt as the key catches identical requests instantly. This is free to implement and pays dividends immediately in any application with repeated system prompts or templated queries.

import { createClient } from 'redis';
import crypto from 'crypto';

const redis = createClient({ url: process.env.REDIS_URL });

function getCacheKey(model: string, messages: object[]): string {
  const payload = JSON.stringify({ model, messages });
  return `llm:exact:${crypto.createHash('sha256').update(payload).digest('hex')}`;
}

async function cachedCompletion(model: string, messages: object[]) {
  const key = getCacheKey(model, messages);
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  const response = await callLLM(model, messages);
  // TTL of 24 hours; adjust based on your staleness tolerance
  await redis.setEx(key, 86400, JSON.stringify(response));
  return response;
}

Exact-match caching has one obvious limitation: it misses "What are the key terms?" and "Summarize the main clauses" even though both want the same thing from the same document. That is where semantic caching earns its keep.

Semantic Caching

Semantic caching works by embedding incoming queries and storing those embeddings alongside the cached responses. When a new query arrives, you embed it, run a vector similarity search, and return a cached response if the similarity score exceeds a threshold. Research from Pluralsight's engineering team found that 31% of enterprise LLM queries are semantically similar to previous requests (https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering), and that semantic caching delivers 40-70% cost reduction with a 7x latency improvement for cache hits.

Upstash Vector (https://pypi.org/project/upstash-semantic-cache/) offers a hosted vector store with a semantic cache library designed exactly for this pattern:

import { SemanticCache } from '@upstash/semantic-cache';
import { Index } from '@upstash/vector';

const index = new Index({
  url: process.env.UPSTASH_VECTOR_REST_URL!,
  token: process.env.UPSTASH_VECTOR_REST_TOKEN!,
});

const cache = new SemanticCache({ index, minProximity: 0.92 });

async function semanticCachedCompletion(prompt: string): Promise<string> {
  const hit = await cache.get(prompt);
  if (hit) {
    console.log('Semantic cache hit');
    return hit;
  }

  const response = await callLLM('gpt-4o', [{ role: 'user', content: prompt }]);
  const text = response.choices[0].message.content;
  await cache.set(prompt, text);
  return text;
}

The minProximity threshold is your precision dial. At 0.95+, you only hit the cache for near-identical phrasing. At 0.85, you start catching legitimate paraphrases but risk returning a subtly wrong answer to a subtly different question. For factual lookups, 0.90-0.92 is a reasonable starting point. For creative or context-sensitive tasks, bias higher.

Cache invalidation is the only real complexity here. For document-based Q&A, scope cache keys to the document version hash. If the document changes, the old embeddings are irrelevant. For general knowledge queries, a 24-hour TTL plus a manual invalidation hook on content updates is usually sufficient.


Pillar 2: Routing

Not every query needs your most powerful model. This is the single biggest lever most teams leave unpulled.

The Real Cost Gap

Model Input (per 1M tokens) Output (per 1M tokens)
GPT-4o $2.50 $10.00
GPT-4o-mini $0.15 $0.60
Claude Sonnet 4 $3.00 $15.00
Llama 3 (self-hosted) ~$0.00 ~$0.00

GPT-4o-mini is 20x cheaper on input and 16x cheaper on output than Claude Sonnet 4 (https://pricepertoken.com/compare/anthropic-claude-sonnet-4-vs-openai-gpt-4o-mini). If 60% of your traffic consists of queries that mini handles just as well, you cut costs by more than half before touching anything else.

Research from UC Berkeley's RouteLLM project demonstrated that routing simple queries to cheaper models while reserving premium models for complex reasoning achieves up to 85% cost reduction while maintaining 95% of output quality (https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering). Production data from teams using OpenRouter and LiteLLM confirm similar savings: 30-85% depending on traffic patterns (https://www.linkedin.com/posts/alexandra-skidan_do-not-overspend-on-llms-theres-a-way-activity-7387466017830051840-heeV).

Model Cascading in Practice

The cascading pattern tries the cheap model first. If the response meets a quality bar (length, confidence signal, or a lightweight classifier check), it returns that. Only on failure does it escalate.

interface RoutingResult {
  response: string;
  model: string;
  cost_tokens: number;
}

async function routedCompletion(
  prompt: string,
  requiresReasoning: boolean = false
): Promise<RoutingResult> {
  // Fast path: simple tasks go directly to mini
  if (!requiresReasoning) {
    try {
      const result = await callWithTimeout('gpt-4o-mini', prompt, 8000);
      if (isAdequateResponse(result)) {
        return { response: result.content, model: 'gpt-4o-mini', cost_tokens: result.usage.total_tokens };
      }
    } catch (e) {
      // Timeout or quality failure -- fall through to expensive model
    }
  }

  // Escalate to full model
  const result = await callWithTimeout('gpt-4o', prompt, 20000);
  return { response: result.content, model: 'gpt-4o', cost_tokens: result.usage.total_tokens };
}

function isAdequateResponse(result: any): boolean {
  // Tune these thresholds for your domain
  const content = result.choices[0].message.content;
  if (content.length < 50) return false;
  if (content.includes("I'm not sure") && content.length < 200) return false;
  return true;
}

Unified Routing with LiteLLM and OpenRouter

For teams managing multiple providers, LiteLLM (https://www.truefoundry.com/blog/litellm-vs-openrouter) provides a unified proxy with built-in routing, spend tracking per virtual key, and 3ms of added overhead. OpenRouter -- which raised $40M in June 2025 and now provides access to 623+ models (https://www.mindstudio.ai/blog/best-ai-model-routers-multi-provider-llm-cost-011e6/) -- handles routing as a managed service with a single credit-based billing model, at the cost of approximately 40ms added latency per request.

The choice: LiteLLM if you want full control, self-hosted governance, and direct provider billing. OpenRouter if you want zero infrastructure overhead and are comfortable with the latency trade-off.


Pillar 3: Fallbacks

API outages happen. Rate limits happen. A feature that goes dark when OpenAI has a bad afternoon is a feature your users will stop trusting.

The Circuit Breaker Pattern

A circuit breaker tracks the failure rate for a given provider. Once it trips, traffic stops going there entirely for a cooldown window -- preventing your app from hammering a down endpoint and burning retry budget.

class LLMCircuitBreaker {
  private failures = 0;
  private lastFailureTime = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private threshold: number = 5,
    private cooldownMs: number = 60000
  ) {}

  isOpen(): boolean {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailureTime > this.cooldownMs) {
        this.state = 'half-open';
        return false;
      }
      return true;
    }
    return false;
  }

  recordFailure(): void {
    this.failures++;
    this.lastFailureTime = Date.now();
    if (this.failures >= this.threshold) {
      this.state = 'open';
      console.warn('Circuit breaker tripped for provider');
    }
  }

  recordSuccess(): void {
    this.failures = 0;
    this.state = 'closed';
  }
}

const openAIBreaker = new LLMCircuitBreaker(5, 60000);
const anthropicBreaker = new LLMCircuitBreaker(5, 60000);

async function resilientCompletion(prompt: string): Promise<string> {
  // Try OpenAI first
  if (!openAIBreaker.isOpen()) {
    try {
      const result = await callWithTimeout('gpt-4o', prompt, 15000);
      openAIBreaker.recordSuccess();
      return result;
    } catch (e) {
      openAIBreaker.recordFailure();
    }
  }

  // Fall back to Anthropic
  if (!anthropicBreaker.isOpen()) {
    try {
      const result = await callWithTimeout('claude-sonnet-4', prompt, 15000);
      anthropicBreaker.recordSuccess();
      return result;
    } catch (e) {
      anthropicBreaker.recordFailure();
    }
  }

  // Both providers are down -- return cached or static response
  const cached = await getMostRecentCachedResponse(prompt);
  if (cached) return cached;

  throw new Error('All LLM providers unavailable. Please try again shortly.');
}

Timeout Strategy

LLM calls are slow relative to everything else in your stack. Set aggressive timeouts and fail fast:

  • Streaming responses: 3s to first token, 30s total
  • Synchronous completions: 15s hard limit, retry once before fallback
  • Batch/background jobs: 60s, no retries, queue for later

Never let an LLM call block your entire request handler indefinitely. Users will close the tab, and you will still pay for the tokens.


Putting It Together: An LLM Gateway

The cleanest implementation wraps all three patterns into a single gateway module that every feature routes through:

async function llmGateway(options: {
  prompt: string;
  requiresReasoning?: boolean;
  trackFeature?: string;
}): Promise<{ response: string; cached: boolean; model: string }> {
  const { prompt, requiresReasoning = false, trackFeature = 'unknown' } = options;

  // Layer 1: semantic cache check
  const semanticHit = await cache.get(prompt);
  if (semanticHit) {
    await logUsage({ feature: trackFeature, model: 'cache', tokens: 0 });
    return { response: semanticHit, cached: true, model: 'cache' };
  }

  // Layer 2: route to appropriate model with circuit breaker fallback
  const { response, model, cost_tokens } = await resilientRoutedCompletion(prompt, requiresReasoning);

  // Layer 3: store in semantic cache
  await cache.set(prompt, response);

  // Layer 4: track spend
  await logUsage({ feature: trackFeature, model, tokens: cost_tokens });

  return { response, cached: false, model };
}

Monitoring: You Cannot Optimize What You Do Not Measure

Track spend at three levels:

  • Per model: which models are you actually using vs. what you intended
  • Per feature: which product surfaces generate the most cost
  • Per user tier: free users on expensive models is a unit economics disaster

The document Q&A feature from our opening story would have flagged itself in the first week if per-feature cost tracking had been in place.

Build vs. buy decision: Helicone, Portkey, and LangSmith all provide observability with spend tracking, prompt logging, and routing analytics. Helicone integrates with a single header change. For teams under $5K/month in LLM spend, a build-it-yourself logging table is fine. Over that threshold, the operational visibility from a dedicated tool pays for itself.


Decision Checklist

Before shipping any LLM-backed feature, work through these:

  • Have I profiled the query distribution? What percentage of prompts are semantically similar?
  • Does every code path have a timeout? Is the timeout actually enforced?
  • Are simple tasks explicitly routed to a cheap model, or does everything hit GPT-4/Claude Sonnet?
  • Is there a fallback provider if my primary API goes down?
  • What does the user see when all fallbacks fail? Is it a clean error or a silent hang?
  • Am I tracking spend per feature so I can see the cost of each new capability I ship?
  • Have I set a spend alert so the $14,800 surprise cannot happen to me?
  • Is my cache invalidation logic tied to the content lifecycle, or is stale data a risk?

Ask The Guild

What is your current strategy for deciding when to route a query to a cheaper model versus your primary model? Are you using a classifier, a keyword heuristic, task type detection, or something else entirely? Share your routing logic -- good, bad, or half-baked -- in the community thread. The messier the real answer, the more useful it is to everyone else trying to make the same call.

Copy A Prompt Next

Think in systems

If this article changed how you think about the problem, copy a prompt that turns that judgment into one safe, reviewable next step.

Matching public prompts

7

Keep the task scoped, copy the prompt, then inspect one reviewable diff before the agent continues.

Need the safest first move instead? Open the curated sample prompts before you browse the broader library.

Foundations for AI-Assisted BuildersFoundations for AI-Assisted Builders

Choosing Your Tech Stack — A Decision Framework

A practical framework for choosing the right tools and technologies for your project — with sensible defaults for AI-assisted builders.

Preview
"Recommend a tech stack for this project.
Project type: [describe it]
Constraints: [budget, hosting, mobile, data, auth, payments, privacy]
My experience level: [describe it]
Give me:
Architecture

Translate this architecture idea into system-level judgment

Architecture articles sharpen judgment. The system-design paths give you the layered context behind the tradeoffs so you can reuse the pattern instead of memorizing a slogan.

Best Next Path

Architecture and System Design

Guild Member · $29/mo

See the full system shape: boundaries, scaling choices, failure modes, and the tradeoffs that matter before complexity gets expensive.

20 lessonsIncluded with the full Guild Member library

Need the free route first?

Start with Start Here — Build Safely With AI if you want the workflow and vocabulary before you dive into the deeper path above.

T

About Tom Hundley

Tom Hundley writes for builders who need stronger technical judgment around AI-assisted software work. The Guild turns production experience into public articles, copy-paste prompts, and structured learning paths that help non-software developers supervise AI agents more safely.

Do this next

Leave this article with one concrete move. Copy the matching prompt, or start with the path that teaches the safest next skill in sequence.