Backup Everything: Your Pre-Disaster Checklist
Production Ready — Part 16 of 30
In January 2025, security researchers publicly documented a ransomware campaign called Codefinger. The attackers didn't exploit a zero-day in AWS. They didn't need to. They found exposed AWS credentials — leaked in a GitHub commit, hardcoded in a Docker image, left in build logs — and used them to do something brutally clever: re-encrypt every object in the victim's S3 bucket using Amazon's own Server-Side Encryption with Customer-Provided Keys (SSE-C) feature, with an encryption key only the attackers held. Then they set lifecycle policies to delete every object within seven days.
AWS doesn't store customer-provided keys. That's by design, for security. It also meant the data was completely inaccessible to its owners. The ransom note was simple: pay, or watch your data delete itself on a timer.
Organizations with independent, versioned backups stored in separate accounts had an escape route. They could restore from before the attack. Organizations that treated their S3 bucket as their backup? They were staring at a countdown clock.
This is the thing about backups: you never need them until you desperately, urgently, business-critically need them. And by then, it's too late to set them up.
Let's fix that today.
The Three Categories of Disaster
Before we talk about what to back up, let's be clear about what we're defending against. Disasters come in three flavors:
1. Accidental deletion / human error — You delete the wrong database. You run DROP TABLE against production instead of staging. A developer deletes a shared folder in Notion and doesn't realize it contained the only copy of six months of customer research. According to Verizon's 2025 DBIR, human error remains a leading cause of data loss incidents.
2. Ransomware and malicious actors — Attackers encrypt or destroy your data and demand payment. In 2025, ransomware was present in 44% of all data breaches, up from 32% the year before, according to CrashPlan's 2026 data loss statistics. In July 2025, Ingram Micro — one of the world's largest IT distributors — was knocked entirely offline by a ransomware attack that shuttered their ordering and logistics systems for nearly a week.
3. Infrastructure failure — AWS, your database host, your storage provider — they all go down. In October 2025, a DNS failure in AWS US-East-1 cascaded through DynamoDB, Lambda, EC2, and IAM over 15 hours, affecting Snapchat, Roblox, Fortnite, airline reservation systems, and thousands of other services. Healthcare organizations alone faced estimated losses of $62,500 per hour during that window.
Your backup strategy needs to account for all three. A backup that lives in the same AWS account as your production data doesn't help you against scenario two. A backup that only runs weekly doesn't help you in scenario one if the deletion happened eight days ago.
The 3-2-1 Rule (And Why You Need 3-2-1-1)
The classic rule is simple:
- 3 copies of your data
- 2 different storage media
- 1 offsite copy
In 2025, that's the baseline, not the ceiling. Add a fourth requirement:
- 1 immutable (air-gapped or WORM-locked) copy that cannot be modified or deleted by anyone, including you
This last piece is what defeats the Codefinger attack. Even if an attacker compromises your AWS credentials, they cannot delete a backup protected by Object Lock in a separate AWS account, or stored in a completely separate cloud provider, or on physical media in a location they can't reach.
# Enable S3 Object Lock on a new bucket (must be set at creation time)
aws s3api create-bucket \
--bucket my-app-immutable-backups \
--region us-west-2 \
--create-bucket-configuration LocationConstraint=us-west-2
aws s3api put-object-lock-configuration \
--bucket my-app-immutable-backups \
--object-lock-configuration '{
"ObjectLockEnabled": "Enabled",
"Rule": {
"DefaultRetention": {
"Mode": "COMPLIANCE",
"Days": 30
}
}
}'
In COMPLIANCE mode, no one — not even the root account — can delete or overwrite objects during the retention period. That's the point.
What Actually Needs to Be Backed Up
Vibe coders often back up their database and call it done. That's necessary but not sufficient. Here's the full inventory:
1. The Database
This is obvious. What's less obvious is how to do it right.
PostgreSQL — automated daily dump:
import subprocess
import boto3
import os
from datetime import datetime
def backup_postgres_to_s3():
timestamp = datetime.utcnow().strftime('%Y%m%d_%H%M%S')
filename = f"db_backup_{timestamp}.sql.gz"
dump_path = f"/tmp/{filename}"
# Dump and compress
subprocess.run([
"pg_dump",
"--host", os.environ["DB_HOST"],
"--username", os.environ["DB_USER"],
"--dbname", os.environ["DB_NAME"],
"--format=plain",
"--no-password"
], stdout=open(dump_path.replace('.gz', ''), 'w'),
env={**os.environ, "PGPASSWORD": os.environ["DB_PASSWORD"]})
subprocess.run(["gzip", dump_path.replace('.gz', '')])
# Upload to S3
s3 = boto3.client('s3')
s3.upload_file(
dump_path,
os.environ["BACKUP_BUCKET"],
f"postgres/{datetime.utcnow().strftime('%Y/%m/%d')}/{filename}"
)
os.remove(dump_path)
print(f"Backup complete: {filename}")
if __name__ == "__main__":
backup_postgres_to_s3()
Run this as a cron job — or better, as a scheduled task in your infrastructure:
# Add to crontab — run at 2 AM UTC daily
0 2 * * * /usr/bin/python3 /app/scripts/backup_postgres.py >> /var/log/backup.log 2>&1
For high-stakes databases, also enable continuous WAL archiving (PostgreSQL's write-ahead log) so you can restore to any point in time, not just the last daily snapshot.
2. User-Uploaded Files and Media
If users upload files — images, documents, videos — those live outside your database. They need their own backup strategy.
# Sync production S3 bucket to a backup bucket in a different region
# Run this daily via cron or CI job
aws s3 sync \
s3://my-app-production-uploads \
s3://my-app-backup-uploads-us-west \
--source-region us-east-1 \
--region us-west-2
If you're on a different provider, rclone handles cross-provider syncs beautifully:
# Sync from Cloudflare R2 to a Backblaze B2 bucket
rclone sync r2:my-production-bucket b2:my-backup-bucket --progress
3. Environment Variables and Secrets
This one surprises people. Your .env file, your Doppler config, your AWS Secrets Manager entries — losing these after a disaster means you can restore the database but can't run the application because nobody knows the API keys anymore.
Store encrypted exports of your secrets in a password manager (like 1Password or Bitwarden) that at least two trusted team members can access. Document which secrets exist and what they're for.
# Export secrets from AWS Secrets Manager (store this output encrypted)
aws secretsmanager list-secrets --query 'SecretList[].Name' --output text | \
tr '\t' '\n' | \
xargs -I {} aws secretsmanager get-secret-value --secret-id {} \
--query 'SecretString' --output text > secrets_export.json
# Immediately encrypt before storing anywhere
gpg --symmetric --cipher-algo AES256 secrets_export.json
rm secrets_export.json
4. Application Configuration and Infrastructure-as-Code
Your Terraform files, your Kubernetes manifests, your Docker Compose configs — these live in Git, right? Good. But does the Git repo itself have a backup? GitHub going down (or banning your account, or experiencing data loss) is rare but not impossible.
# Mirror your GitHub repos to a self-hosted Gitea or GitLab instance
# Or simply clone and push to a second remote
git clone --mirror https://github.com/yourorg/yourapp.git
cd yourapp.git
git remote add backup https://gitlab.com/yourorg/yourapp-backup.git
git push backup --mirror
5. Logs (Especially Audit Logs)
This is the one that gets forgotten until a compliance audit or a forensic investigation. Your application logs are evidence. They tell you what happened, when, and who did it. Losing them during or after an incident is catastrophic for understanding what went wrong.
Export logs to cold storage regularly:
// ship-logs.ts — Run via scheduled job
import { CloudWatchLogsClient, GetLogEventsCommand } from "@aws-sdk/client-cloudwatch-logs";
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
async function archiveLogsToS3(logGroupName: string, logStreamName: string) {
const cwClient = new CloudWatchLogsClient({ region: "us-east-1" });
const s3Client = new S3Client({ region: "us-west-2" }); // Different region!
const yesterday = new Date();
yesterday.setDate(yesterday.getDate() - 1);
const response = await cwClient.send(new GetLogEventsCommand({
logGroupName,
logStreamName,
startTime: yesterday.setHours(0, 0, 0, 0),
endTime: yesterday.setHours(23, 59, 59, 999),
}));
const logContent = response.events
?.map(e => JSON.stringify(e))
.join('\n') ?? '';
const key = `logs/${logGroupName}/${yesterday.toISOString().split('T')[0]}.json`;
await s3Client.send(new PutObjectCommand({
Bucket: process.env.LOG_ARCHIVE_BUCKET!,
Key: key,
Body: logContent,
ContentType: "application/json",
}));
console.log(`Archived ${response.events?.length ?? 0} log events to s3://${process.env.LOG_ARCHIVE_BUCKET}/${key}`);
}
The Test Nobody Runs (Until It's Too Late)
Here's the statistic that should keep you up at night: according to CrashPlan's 2026 data loss report, only 57% of backups complete successfully, and only 61% of restores succeed. You read that right: four out of ten attempted restores fail.
A backup you've never tested is not a backup. It's a comfort blanket.
Do a restore drill. Right now, on a staging environment:
# Restore test procedure — run monthly
# 1. Download last night's backup
aws s3 cp s3://my-app-backups/postgres/2025/07/15/db_backup_20250715_020000.sql.gz /tmp/
# 2. Decompress
gzip -d /tmp/db_backup_20250715_020000.sql.gz
# 3. Restore to a TEST database (never production during a drill)
psql \
--host staging-db.example.com \
--username admin \
--dbname restore_test \
< /tmp/db_backup_20250715_020000.sql
# 4. Verify row counts match expectations
psql --host staging-db.example.com --username admin --dbname restore_test \
-c "SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;"
# 5. Run your smoke tests against the restored database
npm run test:smoke -- --db=restore_test
Calendar a restore drill every month. Make it someone's job. Rotate who does it so knowledge spreads across the team.
Automated Backup Monitoring
Backups fail silently. The job runs, the output says "done," but the file is corrupted or zero bytes. You find out six months later when you need it.
Verify your backups automatically:
# verify_backup.py — run after every backup job
import boto3
import sys
from datetime import datetime
def verify_latest_backup(bucket: str, prefix: str, min_size_bytes: int = 1024):
s3 = boto3.client('s3')
today = datetime.utcnow().strftime('%Y/%m/%d')
response = s3.list_objects_v2(
Bucket=bucket,
Prefix=f"{prefix}/{today}/"
)
objects = response.get('Contents', [])
if not objects:
print(f"ALERT: No backup found for {today} in s3://{bucket}/{prefix}")
sys.exit(1)
latest = max(objects, key=lambda x: x['LastModified'])
if latest['Size'] < min_size_bytes:
print(f"ALERT: Backup too small ({latest['Size']} bytes): {latest['Key']}")
sys.exit(1)
print(f"OK: Backup verified — {latest['Key']} ({latest['Size']:,} bytes)")
return True
if __name__ == "__main__":
verify_latest_backup(
bucket="my-app-backups",
prefix="postgres",
min_size_bytes=50_000 # Adjust to your database's expected size
)
Wire this to your alerting system. If the backup verification fails, PagerDuty fires, Slack gets a message, someone gets woken up. A failed backup is a production incident — it just hasn't hurt you yet.
Recovery Time Objective vs. Recovery Point Objective
Two numbers every production engineer needs to know:
RTO (Recovery Time Objective) — How long can your system be down before it materially harms the business? For a SaaS product, this might be 4 hours. For a hospital, it might be 15 minutes.
RPO (Recovery Point Objective) — How much data can you afford to lose? If your RPO is 24 hours and your database backup runs nightly, you're at your limit. If a disaster happens at 11 PM, you might lose almost a full day of data.
Write these numbers down. Make them visible. Then build your backup frequency around your RPO and your infrastructure redundancy around your RTO.
| System | RTO | RPO | Backup Strategy |
|---|---|---|---|
| User DB | 2 hours | 1 hour | Hourly snapshots + WAL streaming |
| File uploads | 4 hours | 24 hours | Daily cross-region sync |
| Logs | 8 hours | 24 hours | Daily archive to cold storage |
| Secrets | 24 hours | N/A (static) | Encrypted export, quarterly review |
Your Pre-Disaster Checklist
Do every item on this list before you ship anything to production. Not after. Before.
Databases
- Automated daily backups configured and running
- Backups stored in a separate account or provider from production
- At least one backup copy is immutable (Object Lock, WORM, or physical media)
- Backup retention policy set (90 days minimum for most apps)
- Point-in-time recovery enabled (WAL archiving for Postgres, binlog for MySQL)
- Restore test completed successfully in the last 30 days
- Backup success/failure monitored with alerting
Application Files and Media
- User uploads backed up to a geographically separate location
- Cross-provider backup in place (don't rely on a single cloud vendor)
- File backup verified with spot-check restores
Secrets and Configuration
- All environment variables documented and stored in a team password manager
- At least two people have access to the backup secrets
- Infrastructure-as-code (Terraform, CDK, etc.) mirrored to a second Git remote
- Deployment runbooks stored somewhere accessible if GitHub is down
Logs and Audit Trails
- Application logs archived to cold storage daily
- Log retention meets compliance requirements (often 1–7 years)
- Log archive is in a separate location from primary logging infrastructure
Monitoring and Testing
- Backup monitoring alerts configured
- Monthly restore drill scheduled on the calendar
- RTO and RPO defined and documented for each critical system
- Disaster recovery runbook written and accessible offline
The Human Side
- Every team member knows where the backup documentation lives
- Backup access credentials stored securely (not only in the production environment)
- Incident response plan documented: who calls whom, in what order
The Ingram Micro ransomware attack in July 2025 knocked a Fortune 500 IT company offline for nearly a week. Not a startup. Not a side project. A company with thousands of employees and decades of operational experience. Recovery took six days of coordinated effort from third-party cybersecurity experts. Six days.
Your backup strategy is the difference between six days of chaos and six hours of inconvenience. The checklist above isn't theoretical — it's the minimum viable posture for any production system that people depend on.
Set it up before you need it. Test it before you trust it.
Ask The Guild
This week's community prompt:
What's your backup setup look like right now? Pick the one item from the checklist above that you know you're missing — and share it in the thread. Bonus points if you commit to fixing it this week and report back. Let's hold each other accountable.
Have you ever had to use a backup in anger — an actual production disaster where the restore saved you? Tell the story. The Guild learns more from real incidents than from any tutorial.