Monitoring Architecture: Metrics, Logs, Traces, Alerts

Architecture Patterns -- Part 26 of 30

It is 2:47 AM on a Tuesday. A fintech startup's checkout flow has been silently returning 503s for eleven minutes. No one knows. The on-call engineer is asleep. The monitoring system is firing alerts -- but it has been firing alerts every four hours for three months about CPU spikes, disk usage, and a staging environment that no one cleaned up. The on-call engineer trained himself to dismiss them before he was fully awake.

By morning, the team discovers they lost 40 minutes of transaction processing. The post-mortem is brutal. "We had metrics," someone says. "We had logs." They had data everywhere and visibility nowhere.

That is not a monitoring problem. That is a monitoring architecture problem.

The Three Pillars, and Why You Need All of Them

The three pillars of observability -- metrics, logs, and traces -- are not interchangeable. Each answers a different question:

Metrics: Is something wrong? (rate, latency, error count)
Logs: What exactly happened? (the narrative of an event)
Traces: Where did it go wrong? (the path of a single request through your system)

The failure mode for most teams is treating these as separate tools rather than a unified story. A metric tells you checkout error rate jumped at 2:36 AM. A trace shows you the request hit the payment service and hung at the Stripe API call. A log tells you the Stripe client threw a timeout with a specific request ID.

You need all three to close the loop from "something is broken" to "here is exactly what broke and why."

As of 2025, the industry has converged on OpenTelemetry as the instrumentation standard that collects all three. It crossed 95% adoption for new cloud-native instrumentation in 2026, and 81% of users consider it production-ready. The debate about whether to adopt it is over.

Metrics: What to Actually Measure

Most teams measure the wrong things. System-level CPU, memory, and disk are seductive because they are easy to collect. But they rarely tell you what your users are experiencing.

Use two frameworks to decide what to instrument.

The RED Method (for services):

Rate: requests per second
Errors: percentage of requests that fail
Duration: distribution of request latency (p50, p95, p99)

The USE Method (for resources -- databases, queues, caches):

Utilization: how busy the resource is (percentage of time)
Saturation: how much work is queued waiting
Errors: error events

The p99 latency number deserves particular attention. Your p50 (median) might be fine while your p99 is degraded -- meaning your slowest users (often on mobile, in bad network conditions, or hitting cold-start serverless functions) are having a terrible experience that averages out in your dashboard.

// Custom metric with OpenTelemetry SDK
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("checkout-service");
const checkoutDuration = meter.createHistogram("checkout.duration", {
  description: "Duration of checkout requests in milliseconds",
  unit: "ms",
});

export async function processCheckout(cart: Cart) {
  const start = Date.now();
  try {
    const result = await stripe.paymentIntents.create({ ... });
    checkoutDuration.record(Date.now() - start, { status: "success" });
    return result;
  } catch (err) {
    checkoutDuration.record(Date.now() - start, { status: "error" });
    throw err;
  }
}

Logs: Structured or Useless

We covered what to log in Day 24. The principle here is simpler: if your logs are not structured JSON, they are nearly impossible to query at scale.

// Bad: unstructured log
console.log(`User ${userId} failed to pay: ${err.message}`);

// Good: structured JSON log
logger.error("payment_failed", {
  userId,
  orderId,
  errorCode: err.code,
  errorMessage: err.message,
  traceId: span.spanContext().traceId,
  durationMs: Date.now() - start,
});

That traceId field is the key. It is what connects a log entry to a distributed trace, letting you jump from "this error happened" to "here is the full request journey that produced it."

Log level discipline matters more than most teams admit:

ERROR: something broke that needs human attention
WARN: something unexpected happened but the request succeeded
INFO: normal business events (user signed up, order placed)
DEBUG: development-only, never in production by default

The discipline is not adding levels -- it is not logging everything at ERROR. When everything is urgent, nothing is.

Traces: Following a Request Through the Dark

Distributed tracing solves a specific problem: in a system with multiple services, a single user request might touch your Next.js API route, a Node.js background worker, a PostgreSQL database, a Redis cache, and a third-party API. When that request is slow, which hop is slow?

A trace is a tree of spans. Each span represents a unit of work -- an HTTP request, a database query, a cache lookup -- with a start time, duration, and metadata. All spans in a single request share a traceId.

OpenTelemetry makes this automatic for most of your HTTP and database calls once you instrument it:

// instrumentation.ts (Next.js 14+ App Router)
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// next.config.js - enable instrumentation hook
{
  "experimental": {
    "instrumentationHook": true
  }
}

Sentry's Next.js SDK now uses OpenTelemetry under the hood, meaning any OTel spans you create will automatically appear in Sentry traces. You get distributed tracing without running a separate trace backend.

Alerts: The Art of Not Crying Wolf

The data on alert fatigue is damning. According to incident.io's 2025 research, 67% of alerts are ignored daily, with an 85% false positive rate. Runframe's State of Incident Management 2026 found that 73% of organizations experienced outages caused by ignored or suppressed alerts. The fintech team in the opening story is not unusual -- they are the majority.

Two principles fix most alert fatigue problems:

Alert on symptoms, not causes.

"Error rate above 5%" is a symptom. "CPU above 80%" is a cause -- and often not a problem at all. Customers do not care that your CPU is high; they care if their requests are failing or slow. Alert on what users experience.

Every alert must have a clear action.

If an engineer cannot answer "what do I do when this fires?" within 30 seconds, the alert is not ready to go to production. If the answer is "look at the dashboard and decide," the alert is noise. Write the runbook before you write the alert.

Severity tiers that actually work:

Severity	Criteria	Response
P1 - Critical	Error rate >5% or p99 latency >5s	Page on-call immediately, 24/7
P2 - High	Error rate >1% or p99 >2s for 10+ min	Notify on-call during business hours
P3 - Medium	Degraded but not user-impacting	Ticket for next sprint
P4 - Low	Trend worth watching	Weekly review

Monitoring Stack by Scale

Scale	Stack	Approximate Cost	Tradeoffs
Solo / small team	Vercel Analytics + Sentry + Axiom	$0-50/mo	Fast setup, minimal ops overhead, Vercel-native
Growing team (5-20 eng)	Grafana Cloud + OpenTelemetry Collector	$19-200/mo	Generous free tier, OTel-native, flexible backends
Enterprise / cost-sensitive	Self-hosted Grafana + Prometheus + Jaeger	Infra cost only	Full control, engineering overhead, no managed SLAs
Full-featured SaaS	Datadog	$23+/host/month	Best integrations, highest cost -- $40/host for APM

For most teams building on Vercel, the pragmatic starting point is:

Sentry for error tracking and performance monitoring -- one npm install, connected to Vercel in minutes
Axiom for structured log ingestion from Next.js (wrap your config with withAxiom(), done)
Vercel Web Analytics for user-facing metrics without a separate SDK

This stack covers errors, logs, and basic performance at near-zero operational cost. When you outgrow it, you have OpenTelemetry instrumentation in place and can export to Grafana Cloud or Datadog without rewriting your application code.

Sentry in a Next.js App: The Minimum Viable Setup

npx @sentry/wizard@latest -i nextjs

That single command configures sentry.client.config.ts, sentry.server.config.ts, and sentry.edge.config.ts, and adds the Sentry source maps upload to your build.

The two things most teams skip:

// sentry.server.config.ts
Sentry.init({
  dsn: process.env.SENTRY_DSN,
  tracesSampleRate: 0.1, // 10% of transactions -- start here, not 100%
  profilesSampleRate: 0.1,

  // Custom fingerprinting to prevent duplicate issues
  beforeSend(event) {
    if (event.exception?.values?.[0]?.type === "ChunkLoadError") {
      return null; // Don't send known browser cache issues
    }
    return event;
  },
});

Set tracesSampleRate to something less than 1.0 immediately. At 100%, you will hit Sentry's quota limits fast and disable tracing entirely -- the worst possible outcome.

The Monitoring Maturity Model

Most teams do not need to build all of this at once. Here is a practical progression:

Level 1 -- You know when it's broken: Sentry for errors, basic uptime check. You find out from Sentry, not from users.

Level 2 -- You know how broken it is: Add structured logging (Axiom or Logtail), add error rate and latency metrics. You can answer "how many users were affected?"

Level 3 -- You know why it's broken: Add distributed traces via OpenTelemetry. You can answer "which service, which query, which external call caused the slowdown?"

Level 4 -- You know before users do: Add anomaly detection, SLO-based alerting, and capacity trend analysis. You catch degradation before it crosses error thresholds.

Most small teams should get to Level 2 before deploying to production, Level 3 before they have paying customers depending on the system.

Dashboard Design: Focus Over Coverage

The "single pane of glass" -- one dashboard that shows everything -- is a trap. A dashboard that shows 40 metrics simultaneously is a dashboard no one looks at.

Better principle: one dashboard per role and scenario.

Service health dashboard: RED metrics for each service, error rate prominently at top
Incident dashboard: linked from alerts, shows only the metrics relevant to diagnosing that specific alert
Business metrics dashboard: orders, signups, revenue -- for leadership, no technical jargon

The test for a good dashboard: if you are woken up at 2:47 AM and need to decide in 60 seconds whether to page your whole team, does this dashboard give you that answer?

Decision Checklist

Before shipping any new service or feature to production, verify:

Error rate alert configured with a defined runbook
p99 latency alert with a threshold based on user experience (not an arbitrary number)
Structured logging in place with traceId on every log entry
OpenTelemetry tracing configured -- at least for HTTP and database calls
Alert severity tiers defined -- not everything is P1
Dashboard exists that answers "is this working right now?" in under 60 seconds
On-call rotation documented -- who gets paged and when
Sentry (or equivalent) tracesSampleRate set below 1.0 in production

Ask The Guild

What is your current monitoring stack, and what is the gap you have not filled yet? Are you on a solo/small-team setup that has outgrown itself, or still missing distributed tracing after years of trying to get there?

Drop your stack in the community thread -- specifically, what the last incident taught you about what you were missing.