Structured Logging: Making Your Logs Searchable
Production Ready — Part 4 of 30
It was 2:47 AM on a Tuesday. I was on a bridge call with six engineers staring at a wall of logs that looked like this:
[INFO] 2025-09-14 02:47:12 - Processing request
[ERROR] 2025-09-14 02:47:13 - Something went wrong
[INFO] 2025-09-14 02:47:13 - Done
[ERROR] 2025-09-14 02:47:14 - Something went wrong
Payments were failing for a subset of users. Revenue was bleeding. The on-call engineer kept repeating the phrase I dread most: "I'm just grepping through logs."
We spent 90 minutes in that call before someone thought to check a different service. Turned out the issue was a card processor returning a new error code we'd never seen. That information was in the logs — buried in a formatted string, invisible to any search tool, impossible to aggregate.
If those logs had been structured, we'd have found the payment_processor_code: DECLINED_NEW_REASON field in under two minutes.
That night cost the company roughly $40,000 in lost transactions. Structured logging would have cost about four hours of developer time to implement.
What Structured Logging Actually Is
Plain-text logging is what most of us start with. It's natural — you're a human writing messages for humans to read:
[2025-09-14 14:23:01] ERROR Payment failed for user 12345, amount $99.99, reason: card declined
The problem is that machines can't reliably parse this. If you want to find all failures for users with amounts over $50, you're writing a regex. If the format changes slightly next sprint, your regex breaks. If you're running 50 microservices, every team has a slightly different format, and now your grep scripts are a mess of special cases.
Structured logging means your log entries are data, not sentences. Each field is a named key with a typed value:
{
"timestamp": "2025-09-14T14:23:01.412Z",
"level": "error",
"service": "payment-service",
"version": "1.4.2",
"message": "Transaction failed",
"user_id": 12345,
"payment": {
"amount": 99.99,
"currency": "USD",
"processor_code": "CARD_DECLINED",
"reason": "insufficient_funds"
},
"request_id": "req_8f3b2e1a",
"duration_ms": 234
}
Now you can query: "Show me all payment failures in the last hour where payment.amount > 50 and payment.processor_code is not CARD_DECLINED." That query runs in two seconds instead of ninety minutes.
The Real Cost of Unstructured Logs
A 2025 case study on DEV Community documented a Next.js application that switched from console.log() strings to structured JSON. The before/after numbers are striking:
| Metric | Before | After |
|---|---|---|
| Monthly observability cost | $10,400 | $3,800 |
| Query response time | 30–60 seconds | 2–5 seconds |
| Data ingest volume | 924 GB/month | 340 GB/month |
| Alert false positives | baseline | reduced 40% |
| Log retention window | 15 days | 30 days |
That's a 63% cost reduction — not from switching vendors, not from optimizing infrastructure, just from making logs machine-readable so they could be intelligently sampled and filtered.
The key insight: unstructured logs force you to keep everything because you can't easily tell what's noise. Structured logs let you keep 100% of errors and security events, sample routine API requests at 10%, and drop debug logs in production entirely. Same visibility, a fraction of the volume.
Setting Up Structured Logging in Python
Python's built-in logging module can output JSON with the right handler. The most common library is python-json-logger.
Security note: In early 2025,
python-json-loggerhad CVE-2025-27607 — a dependency confusion vulnerability in dev installs on Python 3.13. It's patched in version 3.3.0+. Always pin your dependencies and runpip install --upgrade python-json-logger>=3.3.0.
import logging
import sys
from pythonjsonlogger import jsonlogger
def setup_logging(service_name: str, version: str, environment: str):
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stdout)
formatter = jsonlogger.JsonFormatter(
fmt='%(asctime)s %(levelname)s %(name)s %(message)s',
datefmt='%Y-%m-%dT%H:%M:%S',
rename_fields={'levelname': 'level', 'asctime': 'timestamp'}
)
handler.setFormatter(formatter)
logger.addHandler(handler)
# Bind base context to every log entry
logging.LoggerAdapter(logger, {
'service': service_name,
'version': version,
'environment': environment
})
return logger
logger = setup_logging('payment-service', '1.4.2', 'production')
# Usage — pass structured fields as keyword arguments
logger.info('Transaction completed', extra={
'event': 'payment_success',
'user_id': 12345,
'amount': 99.99,
'currency': 'USD',
'duration_ms': 234,
'request_id': 'req_8f3b2e1a'
})
logger.error('Transaction failed', extra={
'event': 'payment_failure',
'user_id': 12345,
'amount': 99.99,
'processor_code': 'CARD_DECLINED',
'request_id': 'req_8f3b2e1a'
})
Output (one line per event, machine-readable):
{"timestamp": "2025-09-14T14:23:01", "level": "INFO", "message": "Transaction completed", "service": "payment-service", "event": "payment_success", "user_id": 12345, "amount": 99.99, "request_id": "req_8f3b2e1a"}
Setting Up Structured Logging in Node.js / TypeScript
For Node.js, Pino is the go-to library — it's fast, outputs JSON by default, and has a great ecosystem.
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
base: {
service: 'payment-service',
version: process.env.npm_package_version,
environment: process.env.NODE_ENV,
},
timestamp: pino.stdTimeFunctions.isoTime,
// In production, don't pretty-print — raw JSON for log aggregators
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty' }
: undefined,
});
// Create child loggers with shared context for a request
export function createRequestLogger(requestId: string, userId?: number) {
return logger.child({ request_id: requestId, user_id: userId });
}
// Usage in an Express route
app.post('/payments', async (req, res) => {
const log = createRequestLogger(req.headers['x-request-id'] as string, req.user?.id);
log.info({ event: 'payment_initiated', amount: req.body.amount, currency: req.body.currency });
try {
const result = await processPayment(req.body);
log.info({ event: 'payment_success', transaction_id: result.id, duration_ms: result.durationMs });
res.json(result);
} catch (err) {
log.error({
event: 'payment_failure',
error_type: err.constructor.name,
processor_code: err.processorCode,
err, // Pino serializes Error objects correctly
}, 'Transaction failed');
res.status(500).json({ error: 'Payment failed' });
}
});
Critical detail: Pino serializes Error objects properly — message, stack trace, and any custom properties all appear as structured fields. With console.error(err), a stack trace becomes multi-line text that most log aggregators split into separate events, losing the connection to the original request.
The Three Fields You Cannot Skip
Every log entry needs these, no exceptions:
1. request_id (or trace_id)
This is what lets you follow a single request across 10 microservices. Generate it at the entry point (API gateway, load balancer, first service) and thread it through every downstream call via HTTP headers.
// Middleware: attach request ID to every log in this request's lifecycle
app.use((req, res, next) => {
req.requestId = req.headers['x-request-id'] || crypto.randomUUID();
res.setHeader('x-request-id', req.requestId);
req.log = logger.child({ request_id: req.requestId });
next();
});
2. event (a machine-readable event name)
Don't just write message: "User logged in". Add event: "user_login_success". This is the field you query dashboards and alerts on. It never changes even if the message text gets rewording.
3. duration_ms
Log how long things take. Always. A payment that normally takes 200ms taking 4000ms is a signal — even if it succeeds. You'll only notice the trend if you can aggregate duration_ms across thousands of requests.
Verifying Your Setup
Before you ship, test that your logs are actually valid JSON:
# Start your app and pipe logs through jq
npm start 2>&1 | jq .
# If jq parses it cleanly, you're good. If you see parse errors,
# you have multi-line output or something that isn't JSON.
# Spot-check for the fields you care about
npm start 2>&1 | jq 'select(.level == "error") | {event, user_id, error_type}'
# For Python
python app.py 2>&1 | python3 -m json.tool
If jq returns errors, you almost certainly have a multi-line stack trace leaking through. Fix it before deploying — a broken log format is worse than no log format because it silently corrupts your aggregation pipeline.
What NOT to Log
Structured logging makes it dangerously easy to log too much. JSON is flexible — it's tempting to serialize entire request bodies or user objects.
Don't.
- Never log passwords, tokens, or API keys — obvious, but easy to miss in a hurried debug session
- Never log full credit card numbers (GDPR and PCI DSS will ruin your day)
- Redact PII at the source, not downstream — a pipeline filter is a second line of defense, not the first
- Don't log entire request/response bodies unless you have a specific, scoped reason — that's where the leaks live
# Bad — logs the full user object, which might contain PII
logger.info('User updated', extra={'user': user.__dict__})
# Good — log only the fields you need
logger.info('User updated', extra={
'event': 'user_profile_updated',
'user_id': user.id,
'fields_changed': list(changed_fields.keys()) # field names, not values
})
Log Levels: Use Them Deliberately
Structured logs with wrong levels are still a mess. Here's the map I use:
| Level | When to use | Production volume |
|---|---|---|
DEBUG |
Variable values, step tracing | Off by default |
INFO |
Normal business events (login, purchase, export) | Sampled (10–30%) |
WARN |
Expected problems, graceful degradation | 100% |
ERROR |
Failures requiring attention | 100% |
FATAL / CRITICAL |
Service is going down | 100% |
According to observability best practices documented for 2026, your log levels must be configurable at runtime without a redeploy. When an incident happens, you need to temporarily crank up verbosity on one service without touching the others. If that requires a deployment, you've added 10+ minutes to every incident response.
The Structured Logging Checklist
Before you consider a service production-ready, verify every item:
- All log output is valid JSON (test with
jq .) - Every entry has:
timestamp(ISO 8601, UTC),level,service,version,environment - Every request-scoped entry has a
request_idortrace_id - Business events use a consistent
eventfield with machine-readable names -
duration_msis logged for all external calls (DB, APIs, queues) - Errors include
error_typeand full stack trace as a single escaped string (not multi-line) - No passwords, tokens, or full PII in any log entry
- Log levels are configurable at runtime without redeployment
-
DEBUGlogs are off by default in production - Multi-line stack traces are serialized as single escaped strings
- You've run the output through
jqand confirmed it parses cleanly -
python-json-loggeris pinned to >= 3.3.0 if used in Python
Ask The Guild
Community prompt: What's the worst debugging experience you've had because of poor logging? Share your story — the chaos of the 3 AM incident, what the logs said (or didn't say), and how you'd write them differently now. Bonus points if you have a before/after of a log line that changed how you debug. Drop it in the comments and let's build a hall of shame (and glory) for log formats.
Tom Hundley is a software architect with 25 years of experience. He has been on more 3 AM bridge calls than he'd like to count, and believes that good observability is an act of kindness toward your future on-call self.