Background Jobs: When Your API Route Takes Too Long

Architecture Patterns — Part 10 of 30

The Spinner That Lied

The feature seemed straightforward: a "Generate Report" button that pulled 90 days of transaction data, ran some aggregations, built a PDF, and emailed it to the user.

In development, it took about 8 seconds. A little slow, but it worked. The developer shipped it on a Friday.

Monday morning, a customer clicks the button. A spinner appears. They wait. And wait. Nothing happens. No error message. No email. They click it three more times.

Behind the scenes, Vercel's serverless function was being killed at the 10-second wall — the hard limit on their free tier, as Vercel's own documentation confirms. Each of the user's four clicks spawned a separate function invocation. Three of them hit the wall. The fourth somehow squeaked through and sent four identical emails.

The customer called support furious. The developer spent a day debugging something that had been working fine on localhost.

This failure mode is one of the most common in modern web apps — and almost entirely avoidable. It's not a bug. It's an architecture decision that was never made.

This is Part 10 of the Architecture Patterns series. We're going to build a decision framework for when and how to move work out of the request cycle — and what you need in place to do it safely.

Why API Routes Break Under Duration Pressure

An HTTP request is a fundamentally synchronous contract. The client opens a connection, the server does work, the server responds, the connection closes. The client is blocked the entire time.

This model has a hard constraint: time. And in modern deployment environments, that constraint is enforced aggressively.

Platform	Default Timeout	Max Configurable
Vercel Hobby	10 seconds	60 seconds
Vercel Pro (Fluid Compute)	60 seconds	800 seconds
AWS Lambda	3 seconds	15 minutes
AWS API Gateway	—	29 seconds (hard limit)
Self-hosted Node.js	None	As long as you want

Notice the trap in that table: AWS API Gateway has a 29-second hard limit that cannot be overridden, even if your Lambda is configured for 15 minutes. This specific scenario burned a real team in 2025 — their API returned intermittent 504s even though Lambda showed no timeouts, because API Gateway was silently killing requests the Lambda was still processing. The fix was not a configuration change. It was a rearchitecture.

Beyond platform timeouts, there are two other failure modes that bite teams:

Silent job death. A background task crashes without surfacing an error. The user gets no feedback, the developer gets no alert. As described in a widely-circulated 2025 postmortem, this happens when teams treat job queues as an afterthought — basic implementations that work fine until production traffic exposes every weakness.

Celery/Redis reconnection instability. Teams running Celery as their task queue with Redis as the broker hit a class of bugs where workers silently stopped consuming jobs after days of uptime, without throwing errors. This was confirmed as a real bug in Celery's Redis broker handling, fixed in Celery 5.5.0 (released March 2025) with improved Redis reconnection logic. If you're on an older version, you may be losing jobs and not know it.

The Core Architecture Decision

Before you reach for any specific tool, you need to answer one question with precision:

Does this operation need to be complete before I respond to the user?

Most of the time, the honest answer is no. The user doesn't need the PDF in the HTTP response. They need to eventually receive the PDF. That's a completely different contract — and it unlocks an entirely different set of implementation options.

Here's the decision tree:

Does the user need the result synchronously?
│
├── YES → Can you make it fast enough? (< 5 seconds)
│   ├── YES → Optimize the operation, keep it synchronous
│   └── NO  → Streaming response (if progressive output), or...
│             consider if "synchronous" is actually a UX requirement
│             or just an assumption
│
└── NO → Background job
    │
    ├── Is the job stateless and short-lived? (< 30 sec)
    │   └── Cloud functions triggered by queue message (SQS → Lambda)
    │
    ├── Does the job have steps with dependencies?
    │   └── Workflow orchestration (Step Functions, Temporal)
    │
    └── General-purpose jobs with retry/priority needs?
        └── BullMQ (Node.js) or Celery (Python) with Redis

The most common mistake I see in advanced builders: they stay synchronous because they don't want to deal with the complexity of async. So they increase timeouts. Then increase them again. Then they hit a hard platform limit and the whole thing falls over.

Background jobs have a setup cost. But that cost is paid once. The synchronous timeout treadmill has no exit.

The Pattern: Fire, Acknowledge, Poll

The standard pattern for background jobs in a web context has three parts:

Fire: The API route receives the request, enqueues the job, and returns immediately with a job ID and status URL.
Acknowledge: The client gets a 202 Accepted response — not 200, not 201. The HTTP spec has exactly the right status code for this.
Poll (or Push): The client either polls a status endpoint, or the server pushes a notification via webhook or WebSocket when the job completes.

Here's the server side in Python with Celery:

# tasks.py
from celery import Celery
import time

app = Celery('tasks', broker='redis://localhost:6379/0',
             backend='redis://localhost:6379/0')

@app.task(bind=True, max_retries=3)
def generate_report(self, user_id: str, date_range: dict):
    try:
        data = fetch_transactions(user_id, date_range)  # slow DB call
        pdf_path = build_pdf(data)                       # slow rendering
        send_email(user_id, pdf_path)                    # slow SMTP
        return {'status': 'complete', 'path': pdf_path}
    except Exception as exc:
        # Exponential backoff: 60s, 120s, 240s
        raise self.retry(exc=exc, countdown=60 * (2 ** self.request.retries))

# api/routes.py
from flask import Flask, jsonify, request
from tasks import generate_report

app = Flask(__name__)

@app.route('/reports', methods=['POST'])
def request_report():
    user_id = request.json['user_id']
    date_range = request.json['date_range']

    # Enqueue and return immediately
    job = generate_report.delay(user_id, date_range)

    return jsonify({
        'job_id': job.id,
        'status': 'queued',
        'status_url': f'/reports/status/{job.id}'
    }), 202  # 202 Accepted — work is in progress

@app.route('/reports/status/<job_id>')
def report_status(job_id):
    from celery.result import AsyncResult
    result = AsyncResult(job_id)
    return jsonify({
        'job_id': job_id,
        'status': result.status,  # PENDING, STARTED, SUCCESS, FAILURE
        'result': result.result if result.ready() else None
    })

And the client side in TypeScript, polling with exponential backoff:

async function requestAndPollReport(
  userId: string,
  dateRange: { start: string; end: string }
): Promise<string> {
  // Step 1: Fire
  const response = await fetch('/reports', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ user_id: userId, date_range: dateRange }),
  });

  if (response.status !== 202) throw new Error('Unexpected status');
  const { job_id, status_url } = await response.json();

  // Step 2 & 3: Poll with backoff
  let delay = 2000; // start at 2 seconds
  const maxAttempts = 30;

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    await new Promise(resolve => setTimeout(resolve, delay));

    const statusResponse = await fetch(status_url);
    const { status, result } = await statusResponse.json();

    if (status === 'SUCCESS') return result.path;
    if (status === 'FAILURE') throw new Error('Report generation failed');

    delay = Math.min(delay * 1.5, 30000); // cap at 30 seconds
  }

  throw new Error('Report timed out after polling');
}

Choosing Your Queue Infrastructure

The tool choice matters less than the pattern, but the wrong tool creates operational pain. Here's the relevant landscape in 2025:

BullMQ (Node.js + Redis): The most mature JavaScript job queue. Native TypeScript support, job dependencies via FlowProducer, built-in rate limiting, repeatable jobs, and progress tracking. BullMQ's architecture is pull-based — workers pull from Redis — which makes horizontal scaling clean. Best choice if you're already in a Node.js stack.

Celery (Python + Redis or RabbitMQ): The standard for Python shops. Mature, battle-tested, powerful. Just make sure you're on 5.5.0+ to get the Redis reconnection fixes. The silent-stop bug in older versions is a production landmine.

AWS SQS + Lambda: Zero infrastructure to manage. SQS at-least-once delivery, Lambda for processing. The natural choice if you're already on AWS. The gotcha: SQS delayed delivery caps at 15 minutes, and SQS has no native job priority (you need separate queues per priority tier).

When NOT to build a queue: If the operation takes under 5 seconds and can be made reliable, stay synchronous. The complexity of async job infrastructure has real costs — monitoring, dead-letter queues, idempotency, status APIs. Don't pay that cost unless the synchronous path is actually broken.

The Three Things That Kill Background Job Systems

Once you've decided on background jobs, here are the failure modes that kill production systems:

1. No Idempotency

A user clicks "Generate Report" twice. Two jobs run. Two emails go out. Depending on your job, this could mean a customer gets charged twice, two database records get created, or two webhooks fire to a third party.

Every background job must be idempotent. Use a deterministic job key:

# Deduplicate by user + date range — same key won't enqueue twice
generate_report.apply_async(
    args=[user_id, date_range],
    task_id=f'report-{user_id}-{date_range["start"]}-{date_range["end"]}'
)

2. No Dead Letter Queue

When a job hits its maximum retries, it needs to go somewhere visible. Without a dead letter queue, failed jobs just disappear. With one, you have a queue you can inspect, alert on, and replay.

// BullMQ — move permanently failed jobs to a DLQ
const worker = new Worker('reports', processReport, {
  connection,
  limiter: { max: 100, duration: 60000 },
});

worker.on('failed', async (job, err) => {
  if (job && job.attemptsMade >= (job.opts.attempts ?? 1)) {
    await dlqQueue.add('failed-report', {
      originalJob: job.data,
      error: err.message,
      failedAt: new Date().toISOString(),
    });
  }
});

3. No Visibility Into Queue Depth

The first sign of a background job problem is usually queue depth growing faster than it's being consumed. By the time jobs start failing, you're already behind. Set alerts on queue depth and processing latency, not just error rates.

A Real AWS re:Invent Architecture Pivot

At AWS re:Invent 2025, a food delivery company described what happened when their synchronous serverless architecture hit scale. In their case study presented at the conference, they had 42 Lambda functions with complex dependency chains. During morning rush, order processing took 6+ seconds. Their failure rate hit 3% at peak. 23% of orders were abandoned before completion.

The fix wasn't better code. It was pulling Step Functions for orchestration, EventBridge for events, and SQS for reliable processing into what had been a synchronous chain. The result: dramatically reduced Lambda count, better timeout handling, and near-zero order abandonment.

The architecture decision wasn't made proactively. It was forced by a breaking point. That's the most expensive way to learn it.

Architecture Checklist: Background Jobs

Decision trigger — move to background jobs when:

Operation duration exceeds 5 seconds in production (not dev)
You're on a platform with hard timeout limits (Vercel, API Gateway)
The user doesn't need the result in the same HTTP response
The operation involves chained external API calls or file processing
You need retries with backoff on failure

Before you ship:

API route returns 202 + job ID immediately — never blocks
Status endpoint exists and returns machine-readable state
Every job has idempotency logic (deterministic job IDs or deduplication keys)
Retry limits are set with exponential backoff — no infinite loops
Dead letter queue captures permanently failed jobs
Alerts are configured on queue depth, not just error rates
Workers handle graceful shutdown (re-queue in-progress jobs on SIGTERM)
Celery users: confirm you're on 5.5.0+ for Redis reconnection stability

For the client side:

UI shows the user meaningful progress, not an infinite spinner
Polling uses exponential backoff with a reasonable cap
"Something went wrong" is surfaced to users when jobs fail — not silently swallowed

Ask The Guild

What's the longest-running operation you've had to move out of a synchronous API route — and what did you use to replace it? Did you build your own queue, use a managed service, or reach for something like Temporal for orchestration? Share your stack and the failure mode that forced the decision. The Guild learns best from war stories.

Part 10 of 30 in the Architecture Patterns series. Next up: Part 11 — Event-Driven Architecture: When Services Need to Talk Without Talking.