Health Checks: Know Before Your Users Do
Production Ready -- Part 24 of 30
Health Checks: Know Before Your Users Do
It was a Tuesday morning when Marcus got the message. Not from his monitoring dashboard. Not from an alert. From a reply on X: "hey @marcusbuilds your app has been completely broken for like 6 hours, just FYI." He checked his phone. Three hours of meetings, zero alerts. He opened his app. Blank screen. Database connection pool exhausted since 3 AM.
That moment -- learning your production app is down from a stranger on the internet -- is one of the most avoidable embarrassments in software. It is also one of the most common.
According to New Relic's Observability 2025 Report, 41% of IT leaders report that service issues are identified through manual checks, customer complaints, or incident tickets after the fact. Forty-one percent. And when outages do get detected by internal tooling, the median time to detection for high-business-impact incidents is 37 minutes -- meaning users have already been suffering for over half an hour before anyone on the team knows there is a problem.
You shipped the product. You figured out distribution. Now you need to know when it breaks before your users do. That starts with health checks.
What a Health Check Endpoint Is
A health check is a dedicated API route -- typically /api/health -- that your application exposes. When everything is working, it returns 200 OK. When something is wrong, it returns 503 Service Unavailable. That is the entire concept.
External monitoring services ping this URL every 30 to 60 seconds. If they get anything other than a 200, they page you immediately. Your users might not have even noticed yet.
The endpoint costs you almost nothing to build. The absence of it can cost you hours of undiscovered downtime.
Shallow vs. Deep Health Checks
Not all health checks are created equal.
A shallow health check answers one question: is the process alive? If your server can respond to an HTTP request at all, it returns 200. This catches crashes and deployment failures, but it will not catch a broken database connection or a misconfigured environment variable that makes your core feature unusable.
A deep health check actually exercises your dependencies. It tries to reach the database. It checks that critical environment variables are set. It measures latency. This catches the far more common failure mode: your process is running, but something it depends on is broken.
For a production app, you want a deep health check.
Building a Health Check in Next.js
Here is a production-ready health check API route for a Next.js app with a PostgreSQL database (using Prisma):
// app/api/health/route.ts
import { NextResponse } from "next/server";
import { prisma } from "@/lib/prisma";
const startTime = Date.now();
export async function GET() {
const checks: Record<string, unknown> = {};
let healthy = true;
// Database check
const dbStart = Date.now();
try {
await prisma.$queryRaw`SELECT 1`;
checks.database = {
status: "ok",
latency_ms: Date.now() - dbStart,
};
} catch (err) {
healthy = false;
checks.database = {
status: "error",
message: err instanceof Error ? err.message : "unknown error",
};
}
const body = {
status: healthy ? "ok" : "degraded",
version: process.env.npm_package_version ?? "unknown",
uptime_seconds: Math.floor((Date.now() - startTime) / 1000),
checks,
timestamp: new Date().toISOString(),
};
return NextResponse.json(body, {
status: healthy ? 200 : 503,
});
}
A few things worth noting here:
- Return 503 on failure, not just a JSON error message. Monitoring services look at the HTTP status code, not the body. A 200 with
{ "status": "error" }will fool every uptime monitor on the market. - Include db latency. A database that responds in 800ms when it normally responds in 8ms is a degraded database. You want to catch this before your p95 latency spikes for users.
- Include version and uptime. When you are debugging an incident, knowing which deployment is running and how long it has been up is immediately useful.
- Keep it fast. Set a timeout on your database check. A health check that hangs for 30 seconds is worse than no health check.
Add a timeout wrapper around the database call if you want to be rigorous:
const timeoutPromise = new Promise((_, reject) =>
setTimeout(() => reject(new Error("db check timed out")), 3000)
);
await Promise.race([prisma.$queryRaw`SELECT 1`, timeoutPromise]);
Test it locally:
curl -s http://localhost:3000/api/health | jq .
You should see something like:
{
"status": "ok",
"version": "1.4.2",
"uptime_seconds": 3721,
"checks": {
"database": {
"status": "ok",
"latency_ms": 4
}
},
"timestamp": "2025-10-14T09:22:11.000Z"
}
Uptime Monitoring: Services Worth Knowing
Building the endpoint is step one. Someone has to actually call it every minute. That is what uptime monitoring services do.
UptimeRobot is the starting point for most indie developers. The free tier gives you 50 monitors with 5-minute check intervals. That means you could be down for up to five minutes before detection -- acceptable for a side project, not for a production SaaS. The $7/month Pro plan drops to 1-minute intervals.
Better Stack (formerly Better Uptime) is the upgrade for teams that need faster response. Their paid plans include 30-second check intervals, on-call scheduling with escalation policies, and an incident timeline view that is genuinely useful during a postmortem. It integrates cleanly with Slack, PagerDuty, and webhooks.
Checkly takes a different angle: it lets you write Playwright-based synthetic tests that simulate real user interactions. Instead of just checking that /api/health returns 200, you can check that a user can actually sign in and complete a core workflow. More setup, more signal.
Hyperping is worth mentioning for Next.js developers specifically -- it has a clean setup flow and monitors from multiple geographic regions, which matters if your app has a global audience.
For serious applications, the monitoring service is not an either/or choice -- you layer them. A cheap UptimeRobot check catches catastrophic failures. Checkly catches broken user flows. Better Stack handles on-call routing.
Alerting: Where Do the Alerts Go?
A health check that fires an alert to an email you check twice a day is not a health check. It is a delayed incident report.
Think about where you actually want the alert to land:
- Slack: Great for teams. Set up a
#alertschannel and pipe all monitoring there. Low friction, high visibility. - SMS / Phone call: For anything that breaks revenue. Better Stack and PagerDuty both support phone call escalation so an alert cannot be silently ignored.
- PagerDuty: Standard for engineering teams with on-call rotations. Overkill for a solo project, essential for a team.
Match the alert destination to the stakes. A personal project can go to email. A B2B SaaS where downtime means customers cannot do their jobs should wake someone up.
Health Checks in Vercel
If you deploy to Vercel, you can tie health checks directly into the deployment pipeline using Vercel's Deployment Checks API. This means a new deployment will not receive live traffic until it passes your checks.
Configure it in your project settings under Settings > Deployment Checks, then connect a GitHub Action or external integration that validates your /api/health endpoint returns 200 before the deployment alias is promoted to production.
Vercel also uses health signals internally. When Vercel itself experienced a service disruption in October 2025, their incident timeline shows health check monitoring as the mechanism they used to confirm recovery before re-routing traffic -- "We observe health checks passing in the iad1 builds cluster, and route Secure Compute builds to the cluster." That is the same pattern you should be using.
Status Pages: Let Users Know Before They Have to Ask
When something does break, users will go looking for answers. If they cannot find a status page, they will post on X, email support, and assume the worst.
A public status page is your incident communication channel. It does not have to be fancy.
Instatus and Statuspage (by Atlassian) are the two most common options. Both let you display uptime history, post incident updates in real time, and let users subscribe to notifications.
The discipline here matters more than the tool. When an incident starts: post that you are investigating within 5 minutes. Update every 15-20 minutes even if you have nothing new. Post a resolution message with a brief explanation. Users forgive downtime. They do not forgive silence.
The Monitoring Stack for a Vibe Coder Project
Here is what a reasonable production monitoring setup looks like for a Next.js app in 2025:
| Layer | Tool | Cost |
|---|---|---|
| Health check endpoint | /api/health (you build this) |
Free |
| Uptime monitoring | UptimeRobot (free) or Better Stack | $0-$24/mo |
| Synthetic tests | Checkly | $0 on free tier |
| Alerting | Slack + SMS on paid plan | Included |
| Status page | Instatus | $0 on free tier |
| Vercel deployment checks | Vercel Checks API | Included |
Total cost for a serious setup: under $30/month. Total cost of finding out your app was down for six hours from a user on X: harder to put a number on.
Action Items
- Add
/api/healthto your Next.js app today. Shallow check is fine to start, deep check with database ping is the goal. - Make sure your health endpoint returns
503(not200) when dependencies fail. - Sign up for UptimeRobot free tier and point a monitor at your health URL.
- Create a
#alertsSlack channel and connect your uptime monitor to it. - Set up a public status page on Instatus or Statuspage.
- If you are on Vercel Pro or Enterprise, configure Deployment Checks to gate traffic routing on health check results.
- Add
curl -s https://yourapp.com/api/health | jq .to your post-deployment checklist.
Ask The Guild
What does your health check response body include? Have you ever been caught off guard by an outage your monitoring missed -- or found an issue before users did because of a good health check? Share your setup in the comments. Bonus points if you have a monitoring stack you have battle-tested.