What to Log, What to Skip, What to Never Record
Series: Production Ready — Part 3 of 30
The 4.7 Million Record Mistake That Wasn't a Hack
In April 2025, Blue Shield of California notified 4.7 million members of a data breach. No criminal hacker was involved. No zero-day exploit. No ransomware gang.
A developer had installed Google Analytics — the most common website tracking tool on the internet — and the default configuration silently forwarded member health data to Google Ads for nearly three years. Insurance plan types, doctor search queries, patient names, health claim dates. All of it flowing into an advertising platform, completely undetected, because someone didn't think carefully about what their analytics tool was recording.
The fix, once they found it, was a configuration change. The damage — legal exposure, a federal HIPAA breach report, and a reputational gut punch affecting millions of people — was enormous.
This is the logging trap that kills production apps: it's almost never an attacker who finds the sensitive data in your logs first. It's an audit. A misconfiguration. A junior dev who runs a debug command in the wrong environment. A log aggregation tool that forwards everything to a third-party SaaS dashboard with weaker security than your production systems.
You built logging to help yourself. Used carelessly, it becomes a liability that helps everyone else.
Let's fix that.
Why Logging Is Harder Than It Looks
When you're deep in a debugging session and something is behaving strangely, the instinct is to log everything. Dump the entire request. Print the full response object. Log the user object so you can see exactly what's happening.
That instinct is correct in development. It is dangerous in production.
The problem is threefold:
Logs outlive the code that generated them. That debug line you added during a fire drill two years ago? Still shipping to your log aggregator. Still searchable. Still potentially visible to anyone with log access.
Log access is often broader than data access. The database has row-level security, column masking, role-based access. Logs frequently have none of that. Anyone with access to your logging dashboard sees everything.
Third-party log destinations inherit all your data. If you're using Datadog, Splunk, Papertrail, or Elastic Cloud, your logs are being transmitted to and stored on external infrastructure. That's fine — until you log a password.
The Three Categories
Let me give you a practical framework. Every piece of data you consider logging falls into one of three buckets.
Category 1: Log This — Operational Signal
These are the events that make your on-call rotation survivable. They answer the question: What is the system doing?
Log these:
- HTTP request method, path, and status code (not full URL with query params if they carry tokens)
- Response times and latency percentiles
- Database query execution times (without the bound parameter values if they're user data)
- Cache hit/miss rates
- Job queue depths and processing times
- External API call outcomes: success, failure, timeout
- Application startup and shutdown events
- Configuration values that were loaded (not secrets)
- Authentication events: login succeeded/failed, token issued, session expired
- Business events: order placed, payment processed, user registered (with IDs, not the full objects)
import logging
import time
logger = logging.getLogger(__name__)
def process_order(order_id: str, user_id: str):
start = time.time()
try:
result = _run_order_logic(order_id)
duration_ms = (time.time() - start) * 1000
logger.info(
"order.processed",
extra={
"order_id": order_id,
"user_id": user_id,
"duration_ms": round(duration_ms, 2),
"status": "success"
}
)
return result
except Exception as e:
logger.error(
"order.failed",
extra={
"order_id": order_id,
"user_id": user_id,
"error_type": type(e).__name__
},
exc_info=True
)
raise
Notice what's in that log: IDs, duration, outcome, error type. Not the order contents. Not the user's name or email. IDs are references — they let you look up the real data in the database, under the access controls that belong there.
Category 2: Skip This — Low Signal, High Noise
These events are tempting to log but add cost and clutter without meaningful operational value.
Skip these in production:
- Individual function entry/exit unless profiling a specific bottleneck
- Loop iterations ("Processing item 1 of 10,000…")
- Successful health check pings (these will be your most common log line and carry zero information)
- Every SQL query in an ORM (enable slow query logging instead, with a threshold)
- Verbose framework internals you didn't write and can't act on
// BAD: This will spam 50,000 lines a day in production
app.use((req, res, next) => {
console.log(`[DEBUG] Middleware entered for ${req.method} ${req.url}`);
console.log(`[DEBUG] Headers: ${JSON.stringify(req.headers)}`);
next();
});
// GOOD: Log at the right level, only what you can act on
app.use((req, res, next) => {
if (process.env.LOG_LEVEL === 'debug') {
logger.debug('request.received', { method: req.method, path: req.path });
}
next();
});
High log volume is not just a storage cost problem. Noisy logs bury the signal. When your alerting system is drowning in DEBUG: health check ok entries, the actual error that matters is harder to find. Logging is not free — Datadog, Splunk, and every other major SaaS log platform charges by ingestion volume. Logging everything is an expensive way to make your logs less useful.
Category 3: Never Record This
This is the hard line. These items must never appear in any log, under any circumstances, in any environment including development and staging.
Never log:
- Passwords (raw, hashed, or partial)
- API keys, tokens, and secrets
- Credit card numbers, CVVs, or bank account numbers
- Social Security numbers, government IDs, passport numbers
- Full authentication tokens (JWT, OAuth, session tokens)
- Medical or health information (PHI under HIPAA)
- Private encryption keys or certificates
- Full addresses combined with names (together they become PII)
- Race, religion, biometric data, or any GDPR Special Category data
This is not theoretical. In March 2025, a CVE was filed against Ansible Automation Platform (CVE-2025-2877) because the debug logging mode was writing inventory passwords to log output in plaintext. A CVSS 6.5 severity vulnerability — not from a clever exploit, but from a log statement that should never have shipped. The fix is trivial. The exposure window, for anyone running that configuration in production, was not.
# NEVER do this — a classic "let me debug this login issue" mistake
def authenticate_user(username: str, password: str):
logger.debug(f"Attempting login for {username} with password {password}") # FATAL
...
# NEVER do this — the token is the secret
def issue_token(user_id: str) -> str:
token = generate_jwt(user_id)
logger.info(f"Issued token for user {user_id}: {token}") # FATAL
return token
# Do this instead
def authenticate_user(username: str, password: str):
logger.info("auth.attempt", extra={"username": username}) # log the attempt
...
def issue_token(user_id: str) -> str:
token = generate_jwt(user_id)
logger.info("token.issued", extra={"user_id": user_id}) # log the event, not the token
return token
Structured Logging: Stop Using Strings
If you're still building log messages by string concatenation, this is the single biggest improvement you can make today.
# Old way — unqueryable, hard to parse, easy to accidentally include bad data
logger.info(f"User {user.email} placed order {order.id} for ${order.total}")
# Structured way — queryable, parseable, and you control exactly what ships
logger.info("order.placed", extra={
"user_id": user.id, # ID only, not the email
"order_id": order.id,
"total_cents": order.total_cents, # integers, not formatted currency strings
"item_count": len(order.items)
})
In TypeScript/Node.js, use a structured logger like pino or winston:
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
redact: ['req.headers.authorization', 'req.body.password', 'req.body.token'],
});
// pino's redact option automatically strips sensitive paths from logs
// even if they slip through from other parts of your code
Pino's redact option is a safety net — you define paths that should never appear in output, and pino strips them automatically. Use it. It won't catch everything, but it catches the patterns you know about, consistently.
Sanitizing Before You Ship
Sometimes you need to log data that might contain sensitive fields, and you can't always control what comes in. Build a sanitizer:
import re
from typing import Any
SENSITIVE_KEYS = {
'password', 'passwd', 'secret', 'token', 'api_key', 'apikey',
'authorization', 'credit_card', 'card_number', 'cvv', 'ssn',
'social_security', 'private_key', 'access_token', 'refresh_token'
}
def sanitize_for_logging(data: Any, depth: int = 0) -> Any:
"""Recursively sanitize a dict/list before logging it."""
if depth > 5: # avoid infinite recursion on deeply nested structures
return "[truncated]"
if isinstance(data, dict):
return {
k: "[REDACTED]" if k.lower() in SENSITIVE_KEYS else sanitize_for_logging(v, depth + 1)
for k, v in data.items()
}
if isinstance(data, list):
return [sanitize_for_logging(item, depth + 1) for item in data]
if isinstance(data, str) and len(data) > 1000:
return data[:200] + "...[truncated]"
return data
# Usage
logger.info("webhook.received", extra={"payload": sanitize_for_logging(raw_payload)})
This is not a guarantee — a determined developer can still log secrets directly. But it removes the accidental case, which is 90% of incidents.
Log Levels Are Not Decoration
One of the most common mistakes I see in production codebases is logger.info() everywhere. Levels exist for a reason:
| Level | When to use |
|---|---|
DEBUG |
Only in development. Never enable in production by default. |
INFO |
Normal operations. Events you want to see in the happy path. |
WARNING |
Something unexpected happened but the system recovered. Needs review. |
ERROR |
A request or operation failed. Needs investigation. |
CRITICAL |
The service is down or severely degraded. Page someone now. |
In production, run at INFO or WARNING. If you have an incident and need DEBUG, turn it on dynamically for a specific service, investigate, and turn it back off. Spotify does exactly this — DEBUG in dev, INFO in staging, WARN in production, with dynamic level adjustment via feature flags.
Make it configurable without a redeploy:
# Docker / Kubernetes environment variable
LOG_LEVEL=warning
# Or via your config management — flip this without touching code
A Note on Analytics Tools and Telemetry
The Blue Shield incident is a template for a class of logging mistake that's easy to overlook: third-party analytics, tracking pixels, and telemetry SDKs that you install to monitor your users, which end up monitoring your users a little too thoroughly.
Before you add any analytics or monitoring SDK to a page or service that handles authenticated users or sensitive data:
- Read exactly what data the SDK collects by default
- Check whether it phones home to a third-party server
- Disable automatic data collection and explicitly whitelist what you allow
- Audit the integration again six months later — SDK defaults change
This applies to Google Analytics, Mixpanel, Segment, Amplitude, Sentry user context, Datadog RUM, and every other tool in this category. They're all useful. They can all become a pipeline for leaking user data if misconfigured.
Checklist: Logging for Production
Before your next deployment, verify each of these:
- No secrets in logs — run a grep across your codebase:
grep -rn "password\|api_key\|secret\|token" --include="*.py" --include="*.ts" | grep -i "log\|print\|console" - Log level is set to
INFOorWARNINGin production — confirm via environment variable, not hardcoded - Structured logging in use — no more f-strings or string concatenation in log calls
- Sensitive keys are redacted — either via a sanitizer function or your logger's built-in redact config
- Third-party analytics tools audited — confirmed they're not collecting PII or health data
- Log access is role-restricted — not everyone on the team needs full log access
- Retention policy set — debug/verbose logs purged after 7–14 days; security/audit logs retained per compliance requirements
- Slow query logging enabled with a threshold instead of logging all queries
- Log volume monitored — set an alert if log ingestion spikes unexpectedly (data exfiltration can look like logging noise)
- No full request/response body logging on authenticated endpoints
Ask The Guild
Here's a scenario worth thinking about and discussing with the community:
Your team is debugging a production issue. Someone suggests temporarily enabling DEBUG level logging on the authentication service to trace a token validation bug. The change takes 30 seconds to deploy. What's your process for doing this safely — and how do you make sure you turn it off?
Drop your team's process in the Guild Discord. Bonus points if you've been burned by a debug log that stayed on longer than intended — those stories are the best teachers.
Tom Hundley is a software architect who has spent 25 years helping developers avoid the mistakes that take down production systems at 2 AM. He writes the Production Ready series for the AI Coding Guild.