Skip to content
Production Ready — Part 4 of 30

Structured Logging: Making Your Logs Searchable

Written by claude-sonnet-4 · Edited by claude-sonnet-4
loggingstructured-loggingobservabilityjsonpythonnodejstypescriptproductiondebuggingdevops

Production Ready — Part 4 of 30


It was 2:47 AM on a Tuesday. I was on a bridge call with six engineers staring at a wall of logs that looked like this:

[INFO] 2025-09-14 02:47:12 - Processing request
[ERROR] 2025-09-14 02:47:13 - Something went wrong
[INFO] 2025-09-14 02:47:13 - Done
[ERROR] 2025-09-14 02:47:14 - Something went wrong

Payments were failing for a subset of users. Revenue was bleeding. The on-call engineer kept repeating the phrase I dread most: "I'm just grepping through logs."

We spent 90 minutes in that call before someone thought to check a different service. Turned out the issue was a card processor returning a new error code we'd never seen. That information was in the logs — buried in a formatted string, invisible to any search tool, impossible to aggregate.

If those logs had been structured, we'd have found the payment_processor_code: DECLINED_NEW_REASON field in under two minutes.

That night cost the company roughly $40,000 in lost transactions. Structured logging would have cost about four hours of developer time to implement.


What Structured Logging Actually Is

Plain-text logging is what most of us start with. It's natural — you're a human writing messages for humans to read:

[2025-09-14 14:23:01] ERROR Payment failed for user 12345, amount $99.99, reason: card declined

The problem is that machines can't reliably parse this. If you want to find all failures for users with amounts over $50, you're writing a regex. If the format changes slightly next sprint, your regex breaks. If you're running 50 microservices, every team has a slightly different format, and now your grep scripts are a mess of special cases.

Structured logging means your log entries are data, not sentences. Each field is a named key with a typed value:

{
  "timestamp": "2025-09-14T14:23:01.412Z",
  "level": "error",
  "service": "payment-service",
  "version": "1.4.2",
  "message": "Transaction failed",
  "user_id": 12345,
  "payment": {
    "amount": 99.99,
    "currency": "USD",
    "processor_code": "CARD_DECLINED",
    "reason": "insufficient_funds"
  },
  "request_id": "req_8f3b2e1a",
  "duration_ms": 234
}

Now you can query: "Show me all payment failures in the last hour where payment.amount > 50 and payment.processor_code is not CARD_DECLINED." That query runs in two seconds instead of ninety minutes.


The Real Cost of Unstructured Logs

A 2025 case study on DEV Community documented a Next.js application that switched from console.log() strings to structured JSON. The before/after numbers are striking:

Metric Before After
Monthly observability cost $10,400 $3,800
Query response time 30–60 seconds 2–5 seconds
Data ingest volume 924 GB/month 340 GB/month
Alert false positives baseline reduced 40%
Log retention window 15 days 30 days

That's a 63% cost reduction — not from switching vendors, not from optimizing infrastructure, just from making logs machine-readable so they could be intelligently sampled and filtered.

The key insight: unstructured logs force you to keep everything because you can't easily tell what's noise. Structured logs let you keep 100% of errors and security events, sample routine API requests at 10%, and drop debug logs in production entirely. Same visibility, a fraction of the volume.


Setting Up Structured Logging in Python

Python's built-in logging module can output JSON with the right handler. The most common library is python-json-logger.

Security note: In early 2025, python-json-logger had CVE-2025-27607 — a dependency confusion vulnerability in dev installs on Python 3.13. It's patched in version 3.3.0+. Always pin your dependencies and run pip install --upgrade python-json-logger>=3.3.0.

import logging
import sys
from pythonjsonlogger import jsonlogger

def setup_logging(service_name: str, version: str, environment: str):
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)

    handler = logging.StreamHandler(sys.stdout)

    formatter = jsonlogger.JsonFormatter(
        fmt='%(asctime)s %(levelname)s %(name)s %(message)s',
        datefmt='%Y-%m-%dT%H:%M:%S',
        rename_fields={'levelname': 'level', 'asctime': 'timestamp'}
    )
    handler.setFormatter(formatter)
    logger.addHandler(handler)

    # Bind base context to every log entry
    logging.LoggerAdapter(logger, {
        'service': service_name,
        'version': version,
        'environment': environment
    })

    return logger

logger = setup_logging('payment-service', '1.4.2', 'production')

# Usage — pass structured fields as keyword arguments
logger.info('Transaction completed', extra={
    'event': 'payment_success',
    'user_id': 12345,
    'amount': 99.99,
    'currency': 'USD',
    'duration_ms': 234,
    'request_id': 'req_8f3b2e1a'
})

logger.error('Transaction failed', extra={
    'event': 'payment_failure',
    'user_id': 12345,
    'amount': 99.99,
    'processor_code': 'CARD_DECLINED',
    'request_id': 'req_8f3b2e1a'
})

Output (one line per event, machine-readable):

{"timestamp": "2025-09-14T14:23:01", "level": "INFO", "message": "Transaction completed", "service": "payment-service", "event": "payment_success", "user_id": 12345, "amount": 99.99, "request_id": "req_8f3b2e1a"}

Setting Up Structured Logging in Node.js / TypeScript

For Node.js, Pino is the go-to library — it's fast, outputs JSON by default, and has a great ecosystem.

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: {
    service: 'payment-service',
    version: process.env.npm_package_version,
    environment: process.env.NODE_ENV,
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // In production, don't pretty-print — raw JSON for log aggregators
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty' }
    : undefined,
});

// Create child loggers with shared context for a request
export function createRequestLogger(requestId: string, userId?: number) {
  return logger.child({ request_id: requestId, user_id: userId });
}

// Usage in an Express route
app.post('/payments', async (req, res) => {
  const log = createRequestLogger(req.headers['x-request-id'] as string, req.user?.id);

  log.info({ event: 'payment_initiated', amount: req.body.amount, currency: req.body.currency });

  try {
    const result = await processPayment(req.body);
    log.info({ event: 'payment_success', transaction_id: result.id, duration_ms: result.durationMs });
    res.json(result);
  } catch (err) {
    log.error({
      event: 'payment_failure',
      error_type: err.constructor.name,
      processor_code: err.processorCode,
      err, // Pino serializes Error objects correctly
    }, 'Transaction failed');
    res.status(500).json({ error: 'Payment failed' });
  }
});

Critical detail: Pino serializes Error objects properly — message, stack trace, and any custom properties all appear as structured fields. With console.error(err), a stack trace becomes multi-line text that most log aggregators split into separate events, losing the connection to the original request.


The Three Fields You Cannot Skip

Every log entry needs these, no exceptions:

1. request_id (or trace_id)

This is what lets you follow a single request across 10 microservices. Generate it at the entry point (API gateway, load balancer, first service) and thread it through every downstream call via HTTP headers.

// Middleware: attach request ID to every log in this request's lifecycle
app.use((req, res, next) => {
  req.requestId = req.headers['x-request-id'] || crypto.randomUUID();
  res.setHeader('x-request-id', req.requestId);
  req.log = logger.child({ request_id: req.requestId });
  next();
});

2. event (a machine-readable event name)

Don't just write message: "User logged in". Add event: "user_login_success". This is the field you query dashboards and alerts on. It never changes even if the message text gets rewording.

3. duration_ms

Log how long things take. Always. A payment that normally takes 200ms taking 4000ms is a signal — even if it succeeds. You'll only notice the trend if you can aggregate duration_ms across thousands of requests.


Verifying Your Setup

Before you ship, test that your logs are actually valid JSON:

# Start your app and pipe logs through jq
npm start 2>&1 | jq .

# If jq parses it cleanly, you're good. If you see parse errors,
# you have multi-line output or something that isn't JSON.

# Spot-check for the fields you care about
npm start 2>&1 | jq 'select(.level == "error") | {event, user_id, error_type}'

# For Python
python app.py 2>&1 | python3 -m json.tool

If jq returns errors, you almost certainly have a multi-line stack trace leaking through. Fix it before deploying — a broken log format is worse than no log format because it silently corrupts your aggregation pipeline.


What NOT to Log

Structured logging makes it dangerously easy to log too much. JSON is flexible — it's tempting to serialize entire request bodies or user objects.

Don't.

  • Never log passwords, tokens, or API keys — obvious, but easy to miss in a hurried debug session
  • Never log full credit card numbers (GDPR and PCI DSS will ruin your day)
  • Redact PII at the source, not downstream — a pipeline filter is a second line of defense, not the first
  • Don't log entire request/response bodies unless you have a specific, scoped reason — that's where the leaks live
# Bad — logs the full user object, which might contain PII
logger.info('User updated', extra={'user': user.__dict__})

# Good — log only the fields you need
logger.info('User updated', extra={
    'event': 'user_profile_updated',
    'user_id': user.id,
    'fields_changed': list(changed_fields.keys())  # field names, not values
})

Log Levels: Use Them Deliberately

Structured logs with wrong levels are still a mess. Here's the map I use:

Level When to use Production volume
DEBUG Variable values, step tracing Off by default
INFO Normal business events (login, purchase, export) Sampled (10–30%)
WARN Expected problems, graceful degradation 100%
ERROR Failures requiring attention 100%
FATAL / CRITICAL Service is going down 100%

According to observability best practices documented for 2026, your log levels must be configurable at runtime without a redeploy. When an incident happens, you need to temporarily crank up verbosity on one service without touching the others. If that requires a deployment, you've added 10+ minutes to every incident response.


The Structured Logging Checklist

Before you consider a service production-ready, verify every item:

  • All log output is valid JSON (test with jq .)
  • Every entry has: timestamp (ISO 8601, UTC), level, service, version, environment
  • Every request-scoped entry has a request_id or trace_id
  • Business events use a consistent event field with machine-readable names
  • duration_ms is logged for all external calls (DB, APIs, queues)
  • Errors include error_type and full stack trace as a single escaped string (not multi-line)
  • No passwords, tokens, or full PII in any log entry
  • Log levels are configurable at runtime without redeployment
  • DEBUG logs are off by default in production
  • Multi-line stack traces are serialized as single escaped strings
  • You've run the output through jq and confirmed it parses cleanly
  • python-json-logger is pinned to >= 3.3.0 if used in Python

Ask The Guild

Community prompt: What's the worst debugging experience you've had because of poor logging? Share your story — the chaos of the 3 AM incident, what the logs said (or didn't say), and how you'd write them differently now. Bonus points if you have a before/after of a log line that changed how you debug. Drop it in the comments and let's build a hall of shame (and glory) for log formats.


Tom Hundley is a software architect with 25 years of experience. He has been on more 3 AM bridge calls than he'd like to count, and believes that good observability is an act of kindness toward your future on-call self.

Copy A Prompt Next

Review and debug

If this article changed how you think about the problem, copy a prompt that turns that judgment into one safe, reviewable next step.

Matching public prompts

23

Keep the task scoped, copy the prompt, then inspect one reviewable diff before the agent continues.

Need the safest first move instead? Open the curated sample prompts before you browse the broader library.

DebuggingWorking With AI Tools

Debug This Without Thrashing

Use this when the app is already broken and you need the agent to isolate one likely cause, propose a narrow fix, and define how to verify it.

Preview
"Help me debug this issue systematically.
Feature: [what is broken]
Error or symptom: [full message or precise symptom]
Expected behavior: [what should happen]
Actual behavior: [what happens instead]
Production Ready

Use this production insight inside a full build sequence

Production articles show you what breaks in the real world. The right path turns that lesson into a sequence you can ship with instead of just nodding at.

Best Next Path

DevOps and Deployment

Guild Member · $29/mo

Connect the code to production: CI/CD, hosting, observability, DNS, and the runtime habits that keep launches boring.

25 lessonsIncluded with the full Guild Member library

Need the free route first?

Start with Start Here — Build Safely With AI if you want the workflow and vocabulary before you dive into the deeper path above.

T

About Tom Hundley

Tom Hundley writes for builders who need stronger technical judgment around AI-assisted software work. The Guild turns production experience into public articles, copy-paste prompts, and structured learning paths that help non-software developers supervise AI agents more safely.

Do this next

Leave this article with one concrete move. Copy the matching prompt, or start with the path that teaches the safest next skill in sequence.