Monitoring Architecture: Metrics, Logs, Traces, Alerts
Architecture Patterns -- Part 26 of 30
It is 2:47 AM on a Tuesday. A fintech startup's checkout flow has been silently returning 503s for eleven minutes. No one knows. The on-call engineer is asleep. The monitoring system is firing alerts -- but it has been firing alerts every four hours for three months about CPU spikes, disk usage, and a staging environment that no one cleaned up. The on-call engineer trained himself to dismiss them before he was fully awake.
By morning, the team discovers they lost 40 minutes of transaction processing. The post-mortem is brutal. "We had metrics," someone says. "We had logs." They had data everywhere and visibility nowhere.
That is not a monitoring problem. That is a monitoring architecture problem.
The Three Pillars, and Why You Need All of Them
The three pillars of observability -- metrics, logs, and traces -- are not interchangeable. Each answers a different question:
- Metrics: Is something wrong? (rate, latency, error count)
- Logs: What exactly happened? (the narrative of an event)
- Traces: Where did it go wrong? (the path of a single request through your system)
The failure mode for most teams is treating these as separate tools rather than a unified story. A metric tells you checkout error rate jumped at 2:36 AM. A trace shows you the request hit the payment service and hung at the Stripe API call. A log tells you the Stripe client threw a timeout with a specific request ID.
You need all three to close the loop from "something is broken" to "here is exactly what broke and why."
As of 2025, the industry has converged on OpenTelemetry as the instrumentation standard that collects all three. It crossed 95% adoption for new cloud-native instrumentation in 2026, and 81% of users consider it production-ready. The debate about whether to adopt it is over.
Metrics: What to Actually Measure
Most teams measure the wrong things. System-level CPU, memory, and disk are seductive because they are easy to collect. But they rarely tell you what your users are experiencing.
Use two frameworks to decide what to instrument.
The RED Method (for services):
- Rate: requests per second
- Errors: percentage of requests that fail
- Duration: distribution of request latency (p50, p95, p99)
The USE Method (for resources -- databases, queues, caches):
- Utilization: how busy the resource is (percentage of time)
- Saturation: how much work is queued waiting
- Errors: error events
The p99 latency number deserves particular attention. Your p50 (median) might be fine while your p99 is degraded -- meaning your slowest users (often on mobile, in bad network conditions, or hitting cold-start serverless functions) are having a terrible experience that averages out in your dashboard.
// Custom metric with OpenTelemetry SDK
import { metrics } from "@opentelemetry/api";
const meter = metrics.getMeter("checkout-service");
const checkoutDuration = meter.createHistogram("checkout.duration", {
description: "Duration of checkout requests in milliseconds",
unit: "ms",
});
export async function processCheckout(cart: Cart) {
const start = Date.now();
try {
const result = await stripe.paymentIntents.create({ ... });
checkoutDuration.record(Date.now() - start, { status: "success" });
return result;
} catch (err) {
checkoutDuration.record(Date.now() - start, { status: "error" });
throw err;
}
}
Logs: Structured or Useless
We covered what to log in Day 24. The principle here is simpler: if your logs are not structured JSON, they are nearly impossible to query at scale.
// Bad: unstructured log
console.log(`User ${userId} failed to pay: ${err.message}`);
// Good: structured JSON log
logger.error("payment_failed", {
userId,
orderId,
errorCode: err.code,
errorMessage: err.message,
traceId: span.spanContext().traceId,
durationMs: Date.now() - start,
});
That traceId field is the key. It is what connects a log entry to a distributed trace, letting you jump from "this error happened" to "here is the full request journey that produced it."
Log level discipline matters more than most teams admit:
ERROR: something broke that needs human attentionWARN: something unexpected happened but the request succeededINFO: normal business events (user signed up, order placed)DEBUG: development-only, never in production by default
The discipline is not adding levels -- it is not logging everything at ERROR. When everything is urgent, nothing is.
Traces: Following a Request Through the Dark
Distributed tracing solves a specific problem: in a system with multiple services, a single user request might touch your Next.js API route, a Node.js background worker, a PostgreSQL database, a Redis cache, and a third-party API. When that request is slow, which hop is slow?
A trace is a tree of spans. Each span represents a unit of work -- an HTTP request, a database query, a cache lookup -- with a start time, duration, and metadata. All spans in a single request share a traceId.
OpenTelemetry makes this automatic for most of your HTTP and database calls once you instrument it:
// instrumentation.ts (Next.js 14+ App Router)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// next.config.js - enable instrumentation hook
{
"experimental": {
"instrumentationHook": true
}
}
Sentry's Next.js SDK now uses OpenTelemetry under the hood, meaning any OTel spans you create will automatically appear in Sentry traces. You get distributed tracing without running a separate trace backend.
Alerts: The Art of Not Crying Wolf
The data on alert fatigue is damning. According to incident.io's 2025 research, 67% of alerts are ignored daily, with an 85% false positive rate. Runframe's State of Incident Management 2026 found that 73% of organizations experienced outages caused by ignored or suppressed alerts. The fintech team in the opening story is not unusual -- they are the majority.
Two principles fix most alert fatigue problems:
Alert on symptoms, not causes.
"Error rate above 5%" is a symptom. "CPU above 80%" is a cause -- and often not a problem at all. Customers do not care that your CPU is high; they care if their requests are failing or slow. Alert on what users experience.
Every alert must have a clear action.
If an engineer cannot answer "what do I do when this fires?" within 30 seconds, the alert is not ready to go to production. If the answer is "look at the dashboard and decide," the alert is noise. Write the runbook before you write the alert.
Severity tiers that actually work:
| Severity | Criteria | Response |
|---|---|---|
| P1 - Critical | Error rate >5% or p99 latency >5s | Page on-call immediately, 24/7 |
| P2 - High | Error rate >1% or p99 >2s for 10+ min | Notify on-call during business hours |
| P3 - Medium | Degraded but not user-impacting | Ticket for next sprint |
| P4 - Low | Trend worth watching | Weekly review |
Monitoring Stack by Scale
| Scale | Stack | Approximate Cost | Tradeoffs |
|---|---|---|---|
| Solo / small team | Vercel Analytics + Sentry + Axiom | $0-50/mo | Fast setup, minimal ops overhead, Vercel-native |
| Growing team (5-20 eng) | Grafana Cloud + OpenTelemetry Collector | $19-200/mo | Generous free tier, OTel-native, flexible backends |
| Enterprise / cost-sensitive | Self-hosted Grafana + Prometheus + Jaeger | Infra cost only | Full control, engineering overhead, no managed SLAs |
| Full-featured SaaS | Datadog | $23+/host/month | Best integrations, highest cost -- $40/host for APM |
For most teams building on Vercel, the pragmatic starting point is:
- Sentry for error tracking and performance monitoring -- one
npm install, connected to Vercel in minutes - Axiom for structured log ingestion from Next.js (wrap your config with
withAxiom(), done) - Vercel Web Analytics for user-facing metrics without a separate SDK
This stack covers errors, logs, and basic performance at near-zero operational cost. When you outgrow it, you have OpenTelemetry instrumentation in place and can export to Grafana Cloud or Datadog without rewriting your application code.
Sentry in a Next.js App: The Minimum Viable Setup
npx @sentry/wizard@latest -i nextjs
That single command configures sentry.client.config.ts, sentry.server.config.ts, and sentry.edge.config.ts, and adds the Sentry source maps upload to your build.
The two things most teams skip:
// sentry.server.config.ts
Sentry.init({
dsn: process.env.SENTRY_DSN,
tracesSampleRate: 0.1, // 10% of transactions -- start here, not 100%
profilesSampleRate: 0.1,
// Custom fingerprinting to prevent duplicate issues
beforeSend(event) {
if (event.exception?.values?.[0]?.type === "ChunkLoadError") {
return null; // Don't send known browser cache issues
}
return event;
},
});
Set tracesSampleRate to something less than 1.0 immediately. At 100%, you will hit Sentry's quota limits fast and disable tracing entirely -- the worst possible outcome.
The Monitoring Maturity Model
Most teams do not need to build all of this at once. Here is a practical progression:
Level 1 -- You know when it's broken: Sentry for errors, basic uptime check. You find out from Sentry, not from users.
Level 2 -- You know how broken it is: Add structured logging (Axiom or Logtail), add error rate and latency metrics. You can answer "how many users were affected?"
Level 3 -- You know why it's broken: Add distributed traces via OpenTelemetry. You can answer "which service, which query, which external call caused the slowdown?"
Level 4 -- You know before users do: Add anomaly detection, SLO-based alerting, and capacity trend analysis. You catch degradation before it crosses error thresholds.
Most small teams should get to Level 2 before deploying to production, Level 3 before they have paying customers depending on the system.
Dashboard Design: Focus Over Coverage
The "single pane of glass" -- one dashboard that shows everything -- is a trap. A dashboard that shows 40 metrics simultaneously is a dashboard no one looks at.
Better principle: one dashboard per role and scenario.
- Service health dashboard: RED metrics for each service, error rate prominently at top
- Incident dashboard: linked from alerts, shows only the metrics relevant to diagnosing that specific alert
- Business metrics dashboard: orders, signups, revenue -- for leadership, no technical jargon
The test for a good dashboard: if you are woken up at 2:47 AM and need to decide in 60 seconds whether to page your whole team, does this dashboard give you that answer?
Decision Checklist
Before shipping any new service or feature to production, verify:
- Error rate alert configured with a defined runbook
- p99 latency alert with a threshold based on user experience (not an arbitrary number)
- Structured logging in place with
traceIdon every log entry - OpenTelemetry tracing configured -- at least for HTTP and database calls
- Alert severity tiers defined -- not everything is P1
- Dashboard exists that answers "is this working right now?" in under 60 seconds
- On-call rotation documented -- who gets paged and when
- Sentry (or equivalent)
tracesSampleRateset below 1.0 in production
Ask The Guild
What is your current monitoring stack, and what is the gap you have not filled yet? Are you on a solo/small-team setup that has outgrown itself, or still missing distributed tracing after years of trying to get there?
Drop your stack in the community thread -- specifically, what the last incident taught you about what you were missing.