Skip to content
Production Ready — Part 24 of 30

Health Checks: Know Before Your Users Do

Written by claude-sonnet-4 · Edited by claude-sonnet-4
health-checksmonitoringuptimevercelobservabilityproduction

Production Ready -- Part 24 of 30

Health Checks: Know Before Your Users Do

It was a Tuesday morning when Marcus got the message. Not from his monitoring dashboard. Not from an alert. From a reply on X: "hey @marcusbuilds your app has been completely broken for like 6 hours, just FYI." He checked his phone. Three hours of meetings, zero alerts. He opened his app. Blank screen. Database connection pool exhausted since 3 AM.

That moment -- learning your production app is down from a stranger on the internet -- is one of the most avoidable embarrassments in software. It is also one of the most common.

According to New Relic's Observability 2025 Report, 41% of IT leaders report that service issues are identified through manual checks, customer complaints, or incident tickets after the fact. Forty-one percent. And when outages do get detected by internal tooling, the median time to detection for high-business-impact incidents is 37 minutes -- meaning users have already been suffering for over half an hour before anyone on the team knows there is a problem.

You shipped the product. You figured out distribution. Now you need to know when it breaks before your users do. That starts with health checks.


What a Health Check Endpoint Is

A health check is a dedicated API route -- typically /api/health -- that your application exposes. When everything is working, it returns 200 OK. When something is wrong, it returns 503 Service Unavailable. That is the entire concept.

External monitoring services ping this URL every 30 to 60 seconds. If they get anything other than a 200, they page you immediately. Your users might not have even noticed yet.

The endpoint costs you almost nothing to build. The absence of it can cost you hours of undiscovered downtime.


Shallow vs. Deep Health Checks

Not all health checks are created equal.

A shallow health check answers one question: is the process alive? If your server can respond to an HTTP request at all, it returns 200. This catches crashes and deployment failures, but it will not catch a broken database connection or a misconfigured environment variable that makes your core feature unusable.

A deep health check actually exercises your dependencies. It tries to reach the database. It checks that critical environment variables are set. It measures latency. This catches the far more common failure mode: your process is running, but something it depends on is broken.

For a production app, you want a deep health check.


Building a Health Check in Next.js

Here is a production-ready health check API route for a Next.js app with a PostgreSQL database (using Prisma):

// app/api/health/route.ts
import { NextResponse } from "next/server";
import { prisma } from "@/lib/prisma";

const startTime = Date.now();

export async function GET() {
  const checks: Record<string, unknown> = {};
  let healthy = true;

  // Database check
  const dbStart = Date.now();
  try {
    await prisma.$queryRaw`SELECT 1`;
    checks.database = {
      status: "ok",
      latency_ms: Date.now() - dbStart,
    };
  } catch (err) {
    healthy = false;
    checks.database = {
      status: "error",
      message: err instanceof Error ? err.message : "unknown error",
    };
  }

  const body = {
    status: healthy ? "ok" : "degraded",
    version: process.env.npm_package_version ?? "unknown",
    uptime_seconds: Math.floor((Date.now() - startTime) / 1000),
    checks,
    timestamp: new Date().toISOString(),
  };

  return NextResponse.json(body, {
    status: healthy ? 200 : 503,
  });
}

A few things worth noting here:

  • Return 503 on failure, not just a JSON error message. Monitoring services look at the HTTP status code, not the body. A 200 with { "status": "error" } will fool every uptime monitor on the market.
  • Include db latency. A database that responds in 800ms when it normally responds in 8ms is a degraded database. You want to catch this before your p95 latency spikes for users.
  • Include version and uptime. When you are debugging an incident, knowing which deployment is running and how long it has been up is immediately useful.
  • Keep it fast. Set a timeout on your database check. A health check that hangs for 30 seconds is worse than no health check.

Add a timeout wrapper around the database call if you want to be rigorous:

const timeoutPromise = new Promise((_, reject) =>
  setTimeout(() => reject(new Error("db check timed out")), 3000)
);

await Promise.race([prisma.$queryRaw`SELECT 1`, timeoutPromise]);

Test it locally:

curl -s http://localhost:3000/api/health | jq .

You should see something like:

{
  "status": "ok",
  "version": "1.4.2",
  "uptime_seconds": 3721,
  "checks": {
    "database": {
      "status": "ok",
      "latency_ms": 4
    }
  },
  "timestamp": "2025-10-14T09:22:11.000Z"
}

Uptime Monitoring: Services Worth Knowing

Building the endpoint is step one. Someone has to actually call it every minute. That is what uptime monitoring services do.

UptimeRobot is the starting point for most indie developers. The free tier gives you 50 monitors with 5-minute check intervals. That means you could be down for up to five minutes before detection -- acceptable for a side project, not for a production SaaS. The $7/month Pro plan drops to 1-minute intervals.

Better Stack (formerly Better Uptime) is the upgrade for teams that need faster response. Their paid plans include 30-second check intervals, on-call scheduling with escalation policies, and an incident timeline view that is genuinely useful during a postmortem. It integrates cleanly with Slack, PagerDuty, and webhooks.

Checkly takes a different angle: it lets you write Playwright-based synthetic tests that simulate real user interactions. Instead of just checking that /api/health returns 200, you can check that a user can actually sign in and complete a core workflow. More setup, more signal.

Hyperping is worth mentioning for Next.js developers specifically -- it has a clean setup flow and monitors from multiple geographic regions, which matters if your app has a global audience.

For serious applications, the monitoring service is not an either/or choice -- you layer them. A cheap UptimeRobot check catches catastrophic failures. Checkly catches broken user flows. Better Stack handles on-call routing.


Alerting: Where Do the Alerts Go?

A health check that fires an alert to an email you check twice a day is not a health check. It is a delayed incident report.

Think about where you actually want the alert to land:

  • Slack: Great for teams. Set up a #alerts channel and pipe all monitoring there. Low friction, high visibility.
  • SMS / Phone call: For anything that breaks revenue. Better Stack and PagerDuty both support phone call escalation so an alert cannot be silently ignored.
  • PagerDuty: Standard for engineering teams with on-call rotations. Overkill for a solo project, essential for a team.

Match the alert destination to the stakes. A personal project can go to email. A B2B SaaS where downtime means customers cannot do their jobs should wake someone up.


Health Checks in Vercel

If you deploy to Vercel, you can tie health checks directly into the deployment pipeline using Vercel's Deployment Checks API. This means a new deployment will not receive live traffic until it passes your checks.

Configure it in your project settings under Settings > Deployment Checks, then connect a GitHub Action or external integration that validates your /api/health endpoint returns 200 before the deployment alias is promoted to production.

Vercel also uses health signals internally. When Vercel itself experienced a service disruption in October 2025, their incident timeline shows health check monitoring as the mechanism they used to confirm recovery before re-routing traffic -- "We observe health checks passing in the iad1 builds cluster, and route Secure Compute builds to the cluster." That is the same pattern you should be using.


Status Pages: Let Users Know Before They Have to Ask

When something does break, users will go looking for answers. If they cannot find a status page, they will post on X, email support, and assume the worst.

A public status page is your incident communication channel. It does not have to be fancy.

Instatus and Statuspage (by Atlassian) are the two most common options. Both let you display uptime history, post incident updates in real time, and let users subscribe to notifications.

The discipline here matters more than the tool. When an incident starts: post that you are investigating within 5 minutes. Update every 15-20 minutes even if you have nothing new. Post a resolution message with a brief explanation. Users forgive downtime. They do not forgive silence.


The Monitoring Stack for a Vibe Coder Project

Here is what a reasonable production monitoring setup looks like for a Next.js app in 2025:

Layer Tool Cost
Health check endpoint /api/health (you build this) Free
Uptime monitoring UptimeRobot (free) or Better Stack $0-$24/mo
Synthetic tests Checkly $0 on free tier
Alerting Slack + SMS on paid plan Included
Status page Instatus $0 on free tier
Vercel deployment checks Vercel Checks API Included

Total cost for a serious setup: under $30/month. Total cost of finding out your app was down for six hours from a user on X: harder to put a number on.


Action Items

  • Add /api/health to your Next.js app today. Shallow check is fine to start, deep check with database ping is the goal.
  • Make sure your health endpoint returns 503 (not 200) when dependencies fail.
  • Sign up for UptimeRobot free tier and point a monitor at your health URL.
  • Create a #alerts Slack channel and connect your uptime monitor to it.
  • Set up a public status page on Instatus or Statuspage.
  • If you are on Vercel Pro or Enterprise, configure Deployment Checks to gate traffic routing on health check results.
  • Add curl -s https://yourapp.com/api/health | jq . to your post-deployment checklist.

Ask The Guild

What does your health check response body include? Have you ever been caught off guard by an outage your monitoring missed -- or found an issue before users did because of a good health check? Share your setup in the comments. Bonus points if you have a monitoring stack you have battle-tested.

Copy A Prompt Next

Review and debug

If this article changed how you think about the problem, copy a prompt that turns that judgment into one safe, reviewable next step.

Matching public prompts

23

Keep the task scoped, copy the prompt, then inspect one reviewable diff before the agent continues.

Need the safest first move instead? Open the curated sample prompts before you browse the broader library.

Working With AI ToolsWorking With AI Tools

v0 by Vercel — UI Components From a Text Prompt

Generate production-ready UI components with v0 and integrate them into your projects.

Preview
"I want v0 to generate a React component for this screen:
[describe the UI, data fields, visual style, empty state, loading state, and mobile behavior]
The component must:
1. work in a Next.js + Tailwind project
2. be easy to wire to real data later
Production Ready

Use this production insight inside a full build sequence

Production articles show you what breaks in the real world. The right path turns that lesson into a sequence you can ship with instead of just nodding at.

Best Next Path

DevOps and Deployment

Guild Member · $29/mo

Connect the code to production: CI/CD, hosting, observability, DNS, and the runtime habits that keep launches boring.

25 lessonsIncluded with the full Guild Member library

Need the free route first?

Start with Start Here — Build Safely With AI if you want the workflow and vocabulary before you dive into the deeper path above.

T

About Tom Hundley

Tom Hundley writes for builders who need stronger technical judgment around AI-assisted software work. The Guild turns production experience into public articles, copy-paste prompts, and structured learning paths that help non-software developers supervise AI agents more safely.

Do this next

Leave this article with one concrete move. Copy the matching prompt, or start with the path that teaches the safest next skill in sequence.