Error Handling Architecture: Boundaries, Retry, Fallback
Architecture Patterns — Part 19 of 30
The Night Retry Logic Made Everything Worse
October 19, 2025. 11:48 PM Pacific time. A latent race condition in AWS's DNS automation triggers. Two DNS Enactor processes conflict — one slow, one fast — and the fast one wins, then runs a cleanup routine that deletes all DNS records for the DynamoDB endpoint in us-east-1. In seconds, DynamoDB is unreachable for the entire region.
This is bad. But what happens next is worse.
AWS engineers identify the problem. They restore the correct DNS records by 2:25 AM. By 2:40 AM, external applications can reach DynamoDB again. Recovery should be nearly complete.
Instead, the outage continues for thirteen more hours.
Here's why: the moment DNS resolved, millions of EC2 instances and Lambda functions — every one that had been retrying failed DynamoDB connections — simultaneously flooded the database control plane. Connection requests that had been queuing for hours all fired at once. The retry storm overwhelmed the recovering infrastructure. DNS failed again. The cycle repeated. What should have been a 2-3 hour incident stretched to 15 hours and affected over 1,000 companies worldwide.
The engineers who built those retry loops weren't negligent. Every system that retried was doing exactly what it was designed to do: keep trying until it succeeds. The failure wasn't in the individual retry implementations. The failure was architectural — nobody had designed the collective behavior of all those retriers operating simultaneously under partial recovery.
This is the problem space we're working in today. Error handling architecture isn't about writing good try/catch blocks. It's about designing systems that fail safely, recover gracefully, and don't turn someone else's bad day into a catastrophe.
Three Layers, One Framework
Error handling architecture operates at three distinct layers, each with different tools and different failure modes:
- Boundaries — where errors are caught, contained, and isolated from the rest of the system
- Retry — when and how to re-attempt failed operations without creating new failures
- Fallback — what to serve or do when an operation cannot succeed
Most engineers think about these in isolation. The architecture challenge is designing how they interact — because a boundary without a good fallback is just a blank screen, and a retry without a circuit breaker is a weapon pointed at your own infrastructure.
Layer 1: Error Boundaries
An error boundary answers one question: how far does this failure propagate?
By default, unhandled errors propagate as far as the runtime allows — which is usually to the top, taking everything with them. Error boundaries are explicit contracts that say: the failure stops here, and here's what we serve instead.
In the Browser: React Error Boundaries
Before React 16, a JavaScript error in any component would corrupt the entire component tree and leave users staring at a blank screen. React Error Boundaries fix this by catching errors during rendering, in lifecycle methods, and in constructors — before they propagate up the tree.
The pattern is a class component implementing getDerivedStateFromError and componentDidCatch:
// ErrorBoundary.tsx
import React, { Component, ErrorInfo, ReactNode } from 'react';
interface Props {
children: ReactNode;
fallback?: ReactNode;
boundaryName: string;
onError?: (error: Error, info: ErrorInfo) => void;
}
interface State {
hasError: boolean;
error: Error | null;
}
export class ErrorBoundary extends Component<Props, State> {
state: State = { hasError: false, error: null };
static getDerivedStateFromError(error: Error): State {
return { hasError: true, error };
}
componentDidCatch(error: Error, info: ErrorInfo): void {
// Report to your observability stack
console.error(`[${this.props.boundaryName}] caught:`, error, info.componentStack);
this.props.onError?.(error, info);
}
handleReset = () => this.setState({ hasError: false, error: null });
render(): ReactNode {
if (this.state.hasError) {
return this.props.fallback ?? (
<div role="alert" style={{ padding: '1rem', textAlign: 'center' }}>
<p>This section is temporarily unavailable.</p>
<button onClick={this.handleReset}>Try Again</button>
</div>
);
}
return this.props.children;
}
}
The critical architectural decision isn't how to write the boundary — it's where to place it. Three levels cover virtually every production scenario:
// App.tsx — Three-layer boundary placement
function App() {
return (
// Layer 1: App-level last resort — catches anything that escapes
<ErrorBoundary boundaryName="app-root" fallback={<AppCrashFallback />}>
<Navbar />
<Routes>
{/* Layer 2: Route-level — each page fails independently */}
<Route
path="/dashboard"
element={
<ErrorBoundary boundaryName="dashboard-page">
<Dashboard />
</ErrorBoundary>
}
/>
</Routes>
<Footer />
</ErrorBoundary>
);
}
function Dashboard() {
return (
<div>
{/* Layer 3: Widget-level — independent features fail independently */}
<ErrorBoundary boundaryName="revenue-chart" fallback={<ChartUnavailable />}>
<RevenueChart />
</ErrorBoundary>
<ErrorBoundary boundaryName="user-activity" fallback={<ActivityUnavailable />}>
<UserActivityFeed />
</ErrorBoundary>
</div>
);
}
A critical limitation: React Error Boundaries don't catch async errors — errors thrown in useEffect, setTimeout, or fetch callbacks escape the boundary. Bridge this gap with a useAsyncError hook:
// useAsyncError.ts — route async errors into the nearest boundary
import { useState, useCallback } from 'react';
export function useAsyncError() {
const [, setError] = useState<Error>();
return useCallback((error: Error) => {
setError(() => { throw error; });
}, []);
}
// Usage in a data-fetching component
function ProductList() {
const throwError = useAsyncError();
useEffect(() => {
fetch('/api/products')
.then(res => {
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
})
.catch(throwError); // Routes error into nearest ErrorBoundary
}, [throwError]);
// ...
}
In Services: Python Exception Boundaries
The same principle applies server-side. The question is the same: how far does this failure propagate?
# service_boundaries.py
import logging
from functools import wraps
from typing import TypeVar, Callable, Optional
logger = logging.getLogger(__name__)
T = TypeVar('T')
def service_boundary(fallback_value=None, reraise: bool = False):
"""Decorator that defines an error boundary for a service call."""
def decorator(func: Callable[..., T]) -> Callable[..., Optional[T]]:
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
logger.error(
f"Service boundary caught error in {func.__name__}",
exc_info=True,
extra={"function": func.__name__, "error_type": type(e).__name__}
)
if reraise:
raise
return fallback_value
return wrapper
return decorator
# Usage: the boundary is explicit, the fallback is declared at the call site
@service_boundary(fallback_value=[])
def get_recommended_products(user_id: str) -> list:
"""Fetches recommendations — safe to degrade to empty list."""
return recommendation_service.get(user_id)
@service_boundary(fallback_value=None, reraise=True)
def process_payment(order_id: str, amount: float):
"""Payment cannot silently fail — always reraise."""
return payment_gateway.charge(order_id, amount)
The architecture insight here: the fallback value is a design decision, not an implementation detail. get_recommended_products returning [] is acceptable — the page renders without recommendations. process_payment returning None silently is catastrophic. The boundary makes that contract explicit in the code.
Layer 2: Retry — The Dangerous Kindness
Retry logic is probably the most misunderstood reliability pattern. Every engineer instinctively reaches for it. Most implementations create new problems.
The AWS October 2025 outage and the June 2025 GCP outage share the same root cause amplifier: when partial recovery happens, all the waiting retriers synchronize and flood the recovering system simultaneously. This is the retry storm problem.
The three mechanisms that prevent retry storms are exponential backoff, jitter, and circuit breakers.
Exponential Backoff with Jitter
Naïve retry: wait 1 second, then try again. On every retry.
The problem: if 10,000 clients all failed at the same moment, they all retry at the same moment. You've replaced one spike with a series of synchronized spikes.
Exponential backoff: wait 1s, then 2s, then 4s, then 8s. The retries spread out over time.
Still the problem: if 10,000 clients all failed at the same moment, they still all wait exactly 1 second, then exactly 2 seconds. Synchronized.
Jitter: add randomness to the wait time. Now 10,000 clients spread their retries across the window instead of piling onto the exact same moment.
# retry.py — production-grade retry with exponential backoff and jitter
import time
import random
from typing import TypeVar, Callable, Type, Tuple
T = TypeVar('T')
def retry_with_backoff(
func: Callable[..., T],
max_attempts: int = 3,
base_delay: float = 1.0,
max_delay: float = 30.0,
jitter: float = 0.25,
retryable_exceptions: Tuple[Type[Exception], ...] = (Exception,),
) -> T:
"""
Retry with full jitter: random(0, min(cap, base * 2^attempt)).
Full jitter is generally preferred over equal jitter for preventing
synchronized retry storms across many clients.
"""
last_exception = None
for attempt in range(max_attempts):
try:
return func()
except retryable_exceptions as e:
last_exception = e
if attempt == max_attempts - 1:
break # No more retries
# Exponential backoff with full jitter
cap = min(max_delay, base_delay * (2 ** attempt))
sleep_time = random.uniform(0, cap)
print(f"Attempt {attempt + 1} failed: {e}. "
f"Retrying in {sleep_time:.2f}s...")
time.sleep(sleep_time)
raise last_exception
# Usage
try:
result = retry_with_backoff(
lambda: database.query("SELECT ..."),
max_attempts=3,
base_delay=1.0,
max_delay=30.0,
retryable_exceptions=(ConnectionError, TimeoutError),
)
except Exception as e:
logger.error("All retry attempts exhausted", exc_info=True)
# Fall through to fallback logic
What Not to Retry
This is the decision that distinguishes good retry architecture from bad:
| Error type | Retry? | Reason |
|---|---|---|
| Network timeout | Yes, with backoff | Transient — may succeed on retry |
| HTTP 503 (Service Unavailable) | Yes, with backoff | Server overloaded — give it time |
| HTTP 429 (Too Many Requests) | Yes, honor Retry-After header |
Respect the rate limiter |
| HTTP 500 (Internal Server Error) | Sometimes — depends on idempotency | Server-side bug may be transient |
| HTTP 400 (Bad Request) | Never | Client sent invalid data — won't change |
| HTTP 401 / 403 (Auth error) | Never | Permission problem — won't resolve on retry |
| HTTP 404 (Not Found) | Never | Resource doesn't exist |
| Non-idempotent mutation (POST) | Only with idempotency keys | Risk of duplicate writes |
The critical insight: retrying non-idempotent operations without idempotency keys creates duplicate data. A payment that retries can charge a customer twice. Always pass an idempotency key for mutations:
// Idempotent retry for payment mutations
async function chargeWithIdempotency(
orderId: string,
amount: number
): Promise<PaymentResult> {
// Generate a stable key — same order always produces same key
const idempotencyKey = `charge-${orderId}-${amount}`;
return retryWithBackoff(() =>
fetch('/api/payments', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Idempotency-Key': idempotencyKey,
},
body: JSON.stringify({ orderId, amount }),
}).then(res => res.json())
);
}
Layer 3: Circuit Breakers — Teaching Systems to Give Up
Retry with backoff handles the client side of recovery. Circuit breakers handle the dependency side — they prevent a client from endlessly hammering a service that is clearly not going to recover.
The circuit breaker state machine has three states:
- Closed: Normal operation. Calls flow through. Failures are counted.
- Open: The failure threshold was crossed. Calls are rejected immediately (fast-fail) without touching the dependency. The dependency gets time to recover.
- Half-Open: After a timeout window, one probe request is allowed through. If it succeeds, the breaker closes. If it fails, the breaker reopens.
Research published in the International Journal of Scientific Research found that circuit breaker patterns reduce cascading failures by 83.5% in production environments. That number is intuitive once you understand the mechanism: the breaker caps the retry load that reaches a struggling dependency, giving it the headroom to actually recover.
Here's a production-grade Python implementation:
# circuit_breaker.py
import time
import threading
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, TypeVar, Optional
T = TypeVar('T')
class BreakerState(Enum):
CLOSED = "closed" # Normal: calls pass through
OPEN = "open" # Tripped: fast-fail
HALF_OPEN = "half_open" # Testing: one probe allowed
@dataclass
class CircuitBreaker:
name: str
failure_threshold: int = 5 # Trip after 5 failures
recovery_timeout: float = 30.0 # Try recovery after 30s
success_threshold: int = 2 # 2 successes to close from half-open
_state: BreakerState = field(default=BreakerState.CLOSED, init=False)
_failure_count: int = field(default=0, init=False)
_success_count: int = field(default=0, init=False)
_last_failure_time: Optional[float] = field(default=None, init=False)
_lock: threading.Lock = field(default_factory=threading.Lock, init=False)
def call(self, func: Callable[..., T], *args, **kwargs) -> T:
with self._lock:
state = self._get_current_state()
if state == BreakerState.OPEN:
raise CircuitOpenError(
f"Circuit '{self.name}' is OPEN — "
f"dependency unavailable, fast-failing"
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _get_current_state(self) -> BreakerState:
if (self._state == BreakerState.OPEN and
self._last_failure_time and
time.monotonic() - self._last_failure_time > self.recovery_timeout):
self._state = BreakerState.HALF_OPEN
self._success_count = 0
return self._state
def _on_success(self):
with self._lock:
if self._state == BreakerState.HALF_OPEN:
self._success_count += 1
if self._success_count >= self.success_threshold:
self._state = BreakerState.CLOSED
self._failure_count = 0
elif self._state == BreakerState.CLOSED:
self._failure_count = 0 # Reset on any success
def _on_failure(self):
with self._lock:
self._last_failure_time = time.monotonic()
if self._state == BreakerState.HALF_OPEN:
self._state = BreakerState.OPEN # Immediately reopen
elif self._state == BreakerState.CLOSED:
self._failure_count += 1
if self._failure_count >= self.failure_threshold:
self._state = BreakerState.OPEN
class CircuitOpenError(Exception):
pass
# Usage — one breaker per dependency, not per request
payment_breaker = CircuitBreaker(
name="payment-gateway",
failure_threshold=5,
recovery_timeout=60.0,
)
def charge_user(user_id: str, amount: float):
try:
return payment_breaker.call(
payment_gateway.charge,
user_id,
amount
)
except CircuitOpenError:
# The breaker is open — fall through to fallback
return queue_for_retry(user_id, amount)
In JavaScript, Resilience4j and Opossum provide production-tested implementations. For most teams, reach for a library before rolling your own.
Layer 4: Fallbacks — Designing the Degraded Experience
Every error boundary needs to answer the same question: what do we actually serve when this fails?
Fallback design is a product decision disguised as an engineering problem. The options, roughly in order of degradation:
- Cached stale data — serve the last successful response. Works for read-heavy features where slight staleness is acceptable.
- Partial response — return what you can, omit what you can't. A product page loads even if recommendations fail.
- Default/empty state — return a sensible zero state. Unread count shows 0. Recommendations show nothing.
- Graceful service degradation — disable the feature entirely, tell the user clearly why.
- Queue for later — accept the request, process it asynchronously when the dependency recovers.
The June 2025 GCP outage is instructive here. A misconfigured quota policy triggered a null pointer exception that crashed Service Control globally, taking down Gmail, Drive, Meet, and most GCP APIs. Services that had local caches or async fallback queues recovered quickly for their users. Services that had no fallback strategy — hard dependency on GCP APIs with no local cache — were completely dark for the 3-6 hour duration.
The pattern in Python:
# fallback_strategy.py
import json
import redis
from datetime import timedelta
cache = redis.Redis(host='localhost', port=6379)
def get_user_recommendations(user_id: str) -> list:
cache_key = f"recs:{user_id}"
try:
# 1. Try the recommendation service
recs = recommendation_breaker.call(
recommendation_service.get,
user_id
)
# Cache success for 10 minutes
cache.setex(cache_key, timedelta(minutes=10), json.dumps(recs))
return recs
except (CircuitOpenError, ConnectionError, TimeoutError):
# 2. Fall back to cache (stale is better than nothing)
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
# 3. Fall back to popular items (generic default)
popular = cache.get('popular_items')
if popular:
return json.loads(popular)[:10]
# 4. Final fallback: empty list
# The page renders without recommendations — acceptable degradation
return []
The staircase of fallbacks — live → cached → generic → empty — is a proven pattern. Each step is a deliberate product decision about what the minimum acceptable experience is.
Putting It Together: The Error Handling Decision Tree
When you encounter a failure scenario in your design, run this:
1. BOUNDARY: Where should this error stop propagating?
└─ Can the feature fail without taking the page/service down? → YES: isolate it in a boundary
└─ Does failure here mean failure everywhere? → Design to change that if possible
2. RETRY: Should this operation retry?
└─ Is the error transient (network, timeout, 503)? → YES: retry with backoff + jitter
└─ Is the error deterministic (400, 401, 404)? → NO: never retry, fail immediately
└─ Is the operation non-idempotent? → Only retry if you have idempotency keys
└─ How many clients are retrying simultaneously? → Add circuit breaker if >1 client
3. CIRCUIT BREAKER: Do we need to protect the dependency?
└─ Is this a shared dependency (DB, external API, cache)? → YES: put a breaker on it
└─ One breaker per dependency endpoint, not per request
└─ Emit metrics on state changes — silent breakers are useless
4. FALLBACK: What do we serve when this fails?
└─ Can we serve stale cached data? → Yes: implement stale-on-error cache
└─ Can we return a partial response? → Yes: return what you have
└─ Can we default to empty/zero? → Yes: acceptable for non-critical features
└─ None of the above? → Queue for later, tell the user clearly
The Observability Requirement
None of this works without visibility. A circuit breaker that trips silently is worse than no circuit breaker — you're hiding the failure instead of recovering from it.
Minimum instrumentation for every error handling layer:
# What to log at each layer
import structlog
log = structlog.get_logger()
# Boundary: log every caught error with context
log.error("boundary.caught",
boundary_name="payment-flow",
error_type=type(e).__name__,
user_id=user_id,
request_id=request_id
)
# Retry: log every attempt and final outcome
log.warning("retry.attempt",
attempt=attempt_num,
delay_seconds=sleep_time,
function=func.__name__,
error=str(e)
)
# Circuit breaker: emit state changes as events, not just logs
log.warning("circuit_breaker.state_change",
breaker_name=self.name,
from_state=old_state.value,
to_state=new_state.value,
failure_count=self._failure_count
)
# Fallback: log which fallback level was used
log.info("fallback.used",
service="recommendations",
fallback_level="stale_cache", # or "generic" or "empty"
user_id=user_id
)
Alert on: circuit breakers that open, breakers that stay open more than 5 minutes, fallback rates above baseline (means primary is degraded), and error boundary catches in your frontend monitoring.
Architecture Checklist: Error Handling
Before shipping any feature that talks to an external dependency:
- Boundaries placed: Error isolation is explicit — failures cannot propagate unchecked to the top
- Fallback defined: Every boundary has a specific, intentional fallback — not just "show error"
- Retry logic present: Transient failures retry; deterministic failures fail immediately
- Backoff + jitter implemented: No fixed-interval retries; full jitter prevents synchronized retry storms
- Idempotency keys for mutations: Non-idempotent operations have idempotency protection before retry
- Circuit breaker on shared dependencies: Every database, cache, and external API connection is protected
- Breaker metrics emitted: State transitions (CLOSED → OPEN → HALF_OPEN) generate observable events
- What not to retry is documented: 4xx errors never retry; 5xx retries are conditional on idempotency
- Stale cache fallback considered: For read-heavy features, cached stale data beats a blank screen
- Async queue fallback considered: For critical mutations, queuing for later beats a hard failure
- React async errors bridged:
useAsyncErrorhooks bridge async failures into Error Boundary catch - Load tested under failure: Retry + circuit breaker behavior verified under simulated dependency failure
Ask The Guild
Community Prompt:
The AWS October 2025 outage lasted 13 extra hours because millions of clients retried simultaneously the moment DNS resolved. What does your retry logic look like right now — does it have jitter? Have you ever shipped a retry storm to production, or caught one in staging? Drop your war story in the Guild Discord. Bonus points if you've had to implement a circuit breaker while an incident was in progress — that's a story worth hearing.
Tom Hundley is a software architect with 25 years of experience. He coaches teams building production systems at scale through the AI Coding Guild.