Rate Limiting Architecture: Protecting API and Wallet

Architecture Patterns — Part 13 of 30

The $55,000 Lesson Nobody Wanted to Learn

In September 2025, a student posted on Reddit with a subject line that made every developer's stomach drop: they'd been hit with a $55,444.78 Google Cloud bill after their Gemini API key was exposed on GitHub. No rate limit. No spend cap enforced at the application layer. No circuit breaker. Just an exposed key, a bot that found it, and a bill that went to collections.

That same October, developers on the Google AI Developers Forum were reporting unexpected multi-day billing spikes on Gemini 2.5 Pro with no application-level guard to catch it. The platform's billing alerts exist, but they fire after the damage is done.

Day 12 covered rate limiting as a security tool against abuse and brute-force attacks. Day 9 touched on it from an LLM prompt engineering angle. Today we go architectural — the decisions that determine whether your rate limiter actually works at production scale, survives infrastructure failures, and protects your budget as effectively as it protects your uptime.

This is where most vibe-coded systems fail: not from lack of rate limiting conceptually, but from choosing the wrong algorithm, placing limiters in the wrong layer, or building distributed systems that fall apart the moment Redis hiccups.

The Four Algorithms: A Decision Framework, Not a Menu

Every rate limiting tutorial shows you the four algorithms. Almost none of them tell you which one to actually pick. Here's the framework.

Fixed Window: Simpler Than You Think, Buggier Than You Remember

Fixed window divides time into discrete buckets — say, one-minute windows — and counts requests per bucket. When the count hits the limit, requests are rejected until the next window opens.

import redis
import time

r = redis.Redis(host='localhost', port=6379)

def is_allowed_fixed_window(user_id: str, limit: int, window_seconds: int) -> bool:
    # Key changes every window period
    window_key = int(time.time() / window_seconds)
    key = f"rl:fixed:{user_id}:{window_key}"
    
    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds * 2)  # 2x TTL for safety
    results = pipe.execute()
    
    current_count = results[0]
    return current_count <= limit

The boundary burst problem is real and worth understanding. If your limit is 100 requests per minute and a user sends 100 requests at 11:59:30, then 100 more at 12:00:01, they've sent 200 requests in a 31-second span without triggering a single rejection. This isn't theoretical — it's what scrapers and abuse scripts are specifically designed to exploit.

When to use it anyway: Fixed window is the right choice for internal service-to-service communication, background job queues, and any scenario where the burst boundary problem doesn't create a real threat. It's also the right starting point: at ~20 lines of code and 16 bytes per user in Redis, it's cheap. According to a production implementation study from October 2025, fixed window handles 90% of real-world cases acceptably and is far easier to reason about under failure conditions.

Sliding Window: Precision at a Price

Sliding window maintains a log of individual request timestamps and evaluates against a rolling lookback period rather than a fixed boundary.

import redis
import time

r = redis.Redis(host='localhost', port=6379)

# Uses Redis Sorted Set for O(log N) operations
def is_allowed_sliding_window(user_id: str, limit: int, window_seconds: int) -> bool:
    now = time.time()
    window_start = now - window_seconds
    key = f"rl:sliding:{user_id}"
    
    lua_script = """
    local key = KEYS[1]
    local now = tonumber(ARGV[1])
    local window_start = tonumber(ARGV[2])
    local limit = tonumber(ARGV[3])
    local window_seconds = tonumber(ARGV[4])
    
    -- Remove timestamps outside the window
    redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)
    
    -- Count current requests in window
    local count = redis.call('ZCARD', key)
    
    if count < limit then
        -- Add this request with timestamp as score
        redis.call('ZADD', key, now, now .. math.random())
        redis.call('EXPIRE', key, window_seconds * 2)
        return 1
    end
    
    return 0
    """
    
    result = r.eval(lua_script, 1, key, now, window_start, limit, window_seconds)
    return bool(result)

The Lua script is non-negotiable here. Without atomicity, two concurrent requests can both read count < limit, both add their timestamps, and both get allowed when only one should have been. This is a race condition that only shows up under load — the kind that gets blamed on "weird Redis behavior" in postmortems.

Memory cost: Sliding window stores 8 bytes per request per user. For 1 million users at a limit of 100 requests per window, that's 800 MB of Redis memory just for rate limit state. Compare that to 16 MB for fixed window. This is a real constraint, not a footnote.

When to use it: External-facing APIs where abuse exploiting the fixed window boundary is a genuine threat. Payment endpoints. LLM API proxies where each token matters to your budget. Any system where "effectively rate limited" isn't good enough and you need precise enforcement.

Token Bucket: The Production Standard

Token bucket is what AWS API Gateway, Stripe, and most major API providers actually use. It models a bucket that refills at a constant rate — each request consumes one token, excess tokens accumulate up to the bucket's capacity, and requests are rejected when the bucket is empty.

The key insight is the burst allowance: tokens accumulate during quiet periods, letting legitimate users absorb traffic spikes without hitting limits. A developer testing their integration at 3 AM, then running a batch job the next morning, gets a better experience. A bot hammering your API at a steady rate hits the limit reliably.

// TypeScript implementation using Redis
import { createClient } from 'redis';

const client = createClient();
await client.connect();

const tokenBucketScript = `
  local key = KEYS[1]
  local capacity = tonumber(ARGV[1])
  local refill_rate = tonumber(ARGV[2])  -- tokens per second
  local now = tonumber(ARGV[3])
  local requested = tonumber(ARGV[4])
  
  local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
  local tokens = tonumber(bucket[1]) or capacity
  local last_refill = tonumber(bucket[2]) or now
  
  -- Calculate tokens earned since last request
  local elapsed = now - last_refill
  local new_tokens = math.min(capacity, tokens + (elapsed * refill_rate))
  
  if new_tokens >= requested then
    -- Allow: deduct tokens
    redis.call('HMSET', key, 'tokens', new_tokens - requested, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return {1, math.floor(new_tokens - requested)}
  else
    -- Deny: update last_refill but don't change tokens
    redis.call('HMSET', key, 'tokens', new_tokens, 'last_refill', now)
    redis.call('EXPIRE', key, 3600)
    return {0, math.floor(new_tokens)}
  end
`;

async function isAllowed(
  userId: string, 
  capacity: number, 
  refillRate: number
): Promise<{ allowed: boolean; tokensRemaining: number }> {
  const now = Date.now() / 1000;  // Unix timestamp in seconds
  const result = await client.eval(
    tokenBucketScript,
    { keys: [`rl:token:${userId}`], arguments: [capacity, refillRate, now, 1].map(String) }
  ) as [number, number];
  
  return { allowed: result[0] === 1, tokensRemaining: result[1] };
}

Two parameters you must tune: bucket capacity (burst size) and refill rate (sustained throughput). A capacity of 100 with a refill rate of 100/minute means users get up to 100 instant requests, then steady-state 100/minute. Get this wrong and you've either over-throttled legitimate users or under-throttled abusers. Start with capacity = limit × 1.5, refill rate = limit / window.

Leaky Bucket: When You Need a Steady Drip

Leaky bucket is the inverse of token bucket conceptually. Requests enter a queue (the bucket), and the queue drains at a constant rate. If the queue is full, new requests are dropped. Unlike token bucket, leaky bucket produces perfectly metered output — exactly N requests per second, always.

When this matters: You're proxying to a downstream service that can't handle bursts — a legacy internal API, a third-party service with strict per-second limits, or any system where you need smooth traffic shaping rather than burst absorption. Leaky bucket is not the right choice for end-user-facing APIs because it means a user's burst of 5 fast requests gets queued and delayed even if they're well within their hourly limit.

The architecture note: Leaky bucket in a distributed system requires simulating a queue in Redis, which doesn't have native queue semantics with the precise timing you need. The implementation complexity is significant. If you need leaky bucket behavior, consider whether a message queue (SQS, RabbitMQ) with a single consumer at a fixed poll rate is a better fit than a custom Redis implementation.

Where to Put Your Rate Limiter: The Layer Decision

This is where most architecture decisions go wrong. Picking the right algorithm is secondary to placing your rate limiter in the right layer. Here's the decision framework:

Request →  [Edge/CDN]  →  [API Gateway]  →  [App Middleware]  →  [Service Logic]
               ↑               ↑                  ↑                    ↑
           IP-based        Auth-based          User-based          Resource-based
           Cheap/Fast      Medium              Precise             Expensive

Edge (Cloudflare, Fastly, CloudFront)

Edge rate limiting happens before requests hit your infrastructure. In September 2025, Cloudflare introduced IETF-standard rate limiting headers — Ratelimit and Ratelimit-Policy — so clients can proactively back off before hitting limits. This is the right place for:

IP-based blocking of obvious scrapers and DDoS traffic
Geographic rate limiting (100 req/min from US, 10 req/min from Tor exit nodes)
Bot detection before requests burn your compute budget

The limitation: edge rate limiters don't know about your users. They can't distinguish between a legitimate power user and an abusive one sharing the same corporate NAT IP. Use edge for volume protection, not precision enforcement.

# Cloudflare Workers KV-backed rate limiter (simplified)
# Deployed at the edge, not your servers
export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    const ip = request.headers.get('CF-Connecting-IP') ?? 'unknown';
    const key = `rl:ip:${ip}`;
    
    const count = await env.RATE_LIMIT_KV.get(key);
    const currentCount = count ? parseInt(count) : 0;
    
    if (currentCount >= 100) {
      return new Response('Rate limit exceeded', { 
        status: 429,
        headers: { 'Retry-After': '60' }
      });
    }
    
    await env.RATE_LIMIT_KV.put(key, String(currentCount + 1), { expirationTtl: 60 });
    return fetch(request);
  }
};

API Gateway / Middleware

This is where authentication context is available. After verifying a JWT or API key, you know who is making the request — not just where it's coming from. This is the correct layer for:

Per-user limits based on subscription tier
Per-endpoint limits (login endpoint: 5/min; search: 60/min)
Tenant-level limits in multi-tenant SaaS

# FastAPI middleware example
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware
import redis.asyncio as redis

class RateLimitMiddleware(BaseHTTPMiddleware):
    def __init__(self, app, redis_url: str):
        super().__init__(app)
        self.redis = redis.from_url(redis_url)
        
        # Endpoint-specific limits
        self.limits = {
            '/api/auth/login': (5, 60),      # 5 per minute
            '/api/search': (60, 60),          # 60 per minute  
            '/api/ai/generate': (10, 60),     # 10 per minute
            'default': (100, 60),             # 100 per minute
        }
    
    async def dispatch(self, request: Request, call_next):
        user_id = getattr(request.state, 'user_id', None)
        if not user_id:
            return await call_next(request)  # unauthenticated handled at edge
        
        path = request.url.path
        limit, window = self.limits.get(path, self.limits['default'])
        
        allowed = await self.check_rate_limit(user_id, path, limit, window)
        if not allowed:
            raise HTTPException(
                status_code=429,
                detail={'error': 'rate_limit_exceeded', 'retry_after': window}
            )
        
        return await call_next(request)
    
    async def check_rate_limit(self, user_id: str, path: str, 
                                limit: int, window: int) -> bool:
        key = f"rl:{user_id}:{path}"
        current = await self.redis.incr(key)
        if current == 1:
            await self.redis.expire(key, window)
        return current <= limit

Application Layer

The application layer is for resource-specific rate limiting that requires business logic: limiting concurrent AI generation requests per user, preventing duplicate order submissions, or capping how many rows a user can export per day. These limits can't live at the middleware layer because they depend on data that only the application understands.

The important architectural rule: You want multiple layers, not one. Edge blocks the flood. Gateway enforces user contracts. Application protects specific resources. Each layer handles what it's best positioned to handle.

Distributed Rate Limiting: The Hard Problems

A rate limiter that works on a single server is just a dictionary with a TTL. The interesting problems begin when you have multiple instances behind a load balancer.

The Consistency Problem

If you have 5 app servers each maintaining their own in-memory rate limit state, a user with a 100 requests/minute limit effectively gets 500 requests/minute — one full limit per server. Redis solves the shared state problem, but introduces new ones.

As explored in a February 2026 academic paper on scalable rate limiting, the fundamental CAP theorem trade-off for rate limiting systems is: choose AP over CP. Favor availability and partition tolerance over strict consistency. Here's why:

If Redis is unavailable and you fail-closed (block all requests), your API goes down
If Redis is unavailable and you fail-open (allow all requests), you temporarily over-serve
For most use cases, brief over-serving during a Redis partition is preferable to a full outage

The exception: if you're rate limiting to protect billing (e.g., capping AI API spend), fail-open can be expensive. In that case, you need local fallback limits that are more conservative than your real limits.

class ResilientRateLimiter:
    def __init__(self, redis_client, fallback_limit_multiplier=0.1):
        self.redis = redis_client
        self.fallback_multiplier = fallback_limit_multiplier
        
        # Local in-memory fallback (per-instance)
        self.local_counts = {}
    
    async def is_allowed(self, key: str, limit: int, window: int) -> bool:
        try:
            # Try Redis first (shared, accurate)
            return await self._redis_check(key, limit, window)
        except redis.RedisError:
            # Fallback: use conservative local limit
            # 10% of real limit per instance = still within budget at scale
            local_limit = max(1, int(limit * self.fallback_multiplier))
            return self._local_check(key, local_limit, window)
    
    def _local_check(self, key: str, limit: int, window: int) -> bool:
        now = time.time()
        if key not in self.local_counts:
            self.local_counts[key] = []
        
        # Prune old entries
        self.local_counts[key] = [
            t for t in self.local_counts[key] 
            if now - t < window
        ]
        
        if len(self.local_counts[key]) < limit:
            self.local_counts[key].append(now)
            return True
        return False

The Memory Leak Problem

This one surprises people in production. A Redis-backed rate limiter with sliding window creates a key for every user. Without TTLs, these keys accumulate forever. A 2025 production implementation report documented 10 GB of Redis memory consumed by rate limit state for 100,000 users after just one week — keys with no expiration.

The fix is always EXPIRE, set to at least 2x your window size. The Lua scripts above include this, but it's easy to miss when you're iterating quickly. Add Redis memory monitoring and alert when rate limit key memory exceeds a threshold.

Connection Pool is Not Optional

Also from production experience: naively creating a new Redis connection per request drops throughput to ~200 requests/second. With connection pooling (Lettuce, ioredis, redis-py's connection pool), the same hardware handles 50,000+ requests/second. This is a 250x difference. Do not deploy a Redis-backed rate limiter without configuring a connection pool.

# redis-py with connection pool
import redis

pool = redis.ConnectionPool(
    host='your-redis-host',
    port=6379,
    max_connections=32,  # tune based on concurrency
    socket_keepalive=True,
    socket_keepalive_options={
        'TCP_KEEPIDLE': 1,
        'TCP_KEEPINTVL': 3,
        'TCP_KEEPCNT': 5
    }
)

# All rate limiter instances share this pool
r = redis.Redis(connection_pool=pool)

Rate Limiting as a Cost Firewall

This is where the $55,444 lesson applies directly. Rate limiting isn't just about protecting your service from overload — it's about protecting your wallet from runaway usage. The architecture changes slightly when cost protection is the primary goal.

Tiered Cost Limits

For any service that calls an LLM API (or any metered service), implement two layers of limits:

Request rate limits — standard token bucket per user
Token/cost budget limits — daily or monthly spend cap per user and globally

class CostAwareRateLimiter:
    def __init__(self, redis_client):
        self.redis = redis_client
        
        # Per-user daily token budget
        self.daily_token_limits = {
            'free': 50_000,
            'pro': 500_000,
            'enterprise': 5_000_000,
        }
        
        # Global daily safety cap (protects your bill)
        self.global_daily_limit = 10_000_000  # 10M tokens/day hard stop
    
    async def check_and_record_usage(
        self, 
        user_id: str, 
        tier: str, 
        estimated_tokens: int
    ) -> dict:
        today = datetime.utcnow().strftime('%Y-%m-%d')
        user_key = f"budget:{user_id}:{today}"
        global_key = f"budget:global:{today}"
        
        user_limit = self.daily_token_limits.get(tier, self.daily_token_limits['free'])
        
        # Atomic check-and-increment for both user and global
        lua_check = """
        local user_key = KEYS[1]
        local global_key = KEYS[2]
        local requested = tonumber(ARGV[1])
        local user_limit = tonumber(ARGV[2])
        local global_limit = tonumber(ARGV[3])
        
        local user_used = tonumber(redis.call('GET', user_key) or 0)
        local global_used = tonumber(redis.call('GET', global_key) or 0)
        
        if user_used + requested > user_limit then
            return {0, 'user_limit_exceeded', user_used, user_limit}
        end
        
        if global_used + requested > global_limit then
            return {0, 'global_limit_exceeded', global_used, global_limit}
        end
        
        redis.call('INCRBY', user_key, requested)
        redis.call('EXPIRE', user_key, 86400)
        redis.call('INCRBY', global_key, requested)
        redis.call('EXPIRE', global_key, 86400)
        
        return {1, 'ok', user_used + requested, user_limit}
        """
        
        result = await self.redis.eval(
            lua_check, 
            2, user_key, global_key,
            estimated_tokens, user_limit, self.global_daily_limit
        )
        
        return {
            'allowed': bool(result[0]),
            'reason': result[1],
            'used': result[2],
            'limit': result[3],
            'remaining': max(0, result[3] - result[2])
        }

The Cloudflare AI Gateway Pattern

For agentic workloads, the rate limiting problem shifts. As Cloudflare documented in March 2026, traditional IP-based rate limiting fails when multiple agents share the same origin IP. The correct pattern is identity-keyed rate limiting:

Rate limit by agent identity header, not IP
Set separate limits on write operations (database mutations, external API calls) vs read operations — write limits should be tighter
Implement token budget limits per agent session, not just per minute
Set a cost ceiling per gateway (daily/monthly) that triggers a 429 with Retry-After when hit

This is the same principle applied to AI agents that applies to human users — but the blast radius when an agent goes rogue is much larger, and it happens in seconds rather than minutes.

Communicating Limits to Clients

A rate limiter that silently rejects requests is an infrastructure mystery. Good rate limiting architecture includes standardized response headers:

from fastapi import Response

async def add_rate_limit_headers(
    response: Response,
    limit: int,
    remaining: int,
    reset_at: float,
    retry_after: int = None
):
    """Add IETF-standard rate limit headers (RFC draft 7)"""
    response.headers['RateLimit-Limit'] = str(limit)
    response.headers['RateLimit-Remaining'] = str(remaining)
    response.headers['RateLimit-Reset'] = str(int(reset_at))
    
    if retry_after is not None:
        response.headers['Retry-After'] = str(retry_after)
        response.headers['X-RateLimit-Reason'] = 'rate_limit_exceeded'

When a client receives RateLimit-Remaining: 5, they can back off proactively. When they get Retry-After: 30, they know exactly how long to wait. This reduces the thundering herd problem — instead of all clients immediately retrying on a 429, they stagger their retries based on the header.

The caveat from production: returning tokensRemaining lets sophisticated clients game the system by timing requests to use exactly their full quota. For external-facing APIs where abuse is a concern, you may want to return remaining counts only when they're below a threshold (e.g., <20% remaining).

The Architecture Decision Tree

When you're designing rate limiting for a new system, work through these questions in order:

1. What are you protecting against?

Cost overrun → token bucket with daily budget caps at application layer
Abuse/scraping → sliding window at API gateway + edge IP blocking
Downstream service protection → leaky bucket at the integration boundary
General fairness → fixed window per user at middleware

2. How many instances will be behind your load balancer?

One → in-memory is fine, but build Redis now before you need it
Two or more → Redis-backed only, never in-memory

3. What's your Redis failure mode?

Fail-open → acceptable for most APIs, not for billing protection
Fail-closed → appropriate when over-serving is catastrophically expensive
Local fallback → best for billing-sensitive AI API proxies (conservative local limit)

4. Do you need burst tolerance?

Yes → token bucket
No, need precise enforcement → sliding window
Don't care, need simplicity → fixed window

5. Are you rate limiting AI agent identities or humans?

Humans → per-user request limits + daily token budget
Agents → per-agent-identity limits + per-session token budget + asymmetric write limits

Checklist

Choose your algorithm based on the threat, not the tutorial. Fixed window for internal services. Token bucket for user-facing APIs. Sliding window when boundary bursts are a real exploit.
Layer your rate limiters. Edge for IP-based flood protection. Gateway for per-user limits. Application for resource-specific constraints.
All Redis operations must be atomic. Use Lua scripts or transactions. A non-atomic rate limiter under concurrent load is not a rate limiter.
Configure TTLs on every Redis key. EXPIRE at 2x your window minimum. Monitor Redis memory used by rate limit keys separately.
Configure a connection pool. No exceptions. New-connection-per-request = ~200 req/s ceiling.
Implement a Redis failure strategy. Fail-open, fail-closed, or local fallback — document the choice and test it explicitly.
Add token/cost budget limits for any LLM or metered API calls. Request rate limits alone don't protect your bill.
Return IETF-standard rate limit headers. RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, and Retry-After on 429s.
Set a global daily spend cap in addition to per-user limits. One compromised account shouldn't be able to exhaust your monthly budget.
Test your rate limiter under concurrent load. Race conditions only appear at scale. Use Locust, k6, or Artillery before you go to production.
For agentic workloads: rate limit by agent identity, not IP. Set tighter limits on write operations than read operations.

Ask The Guild

Here's a real architectural decision that has no universal answer: where do you draw the line between per-user fairness limits and global cost-protection limits?

Specifically: if a user is on your "Pro" tier and has been promised high-volume API access, but your global daily cap is about to be hit because three other Pro users ran large batch jobs today — do you enforce the global cap and disappoint the fourth user? Or do you accept the overage cost to honor the commitment?

This is a product decision masquerading as an architecture question, and the answer changes how you structure your rate limiting layers.

Share in the Guild Discord: What's your strategy for global cost caps vs per-user fairness? Have you been burned by missing rate limits on an LLM API call? What's your Redis failure mode strategy — fail-open or fail-closed — and has it ever cost you?