- Published on
Restore That Took 9 Hours — Why You Need to Know Your RTO Before the Incident
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Recovery Time Objective (RTO) — how long can your service be down — is usually defined in a business document somewhere. What's rarely defined is whether your actual restore procedure can meet that objective. A 9-hour restore on an 800GB database isn't a failure of the backup system. It's a failure to test and optimize the restore path before the incident, when there's time to fix it.
- Why Restores Take So Long
- Fix 1: Parallel pg_restore
- Fix 2: Stream Download and Restore Simultaneously
- Fix 3: Pre-Create the Database Before Restore
- Fix 4: Point-in-Time Recovery (PITR) — Faster Than Full Restore
- Fix 5: Document and Automate the Restore Runbook
- 3. Run restore (90 min)
- 4. Verify
- 5. Cutover
- Expected Times
- Escalation
- Restore Checklist
- Conclusion
Why Restores Take So Long
800GB database backup restore — default approach:
Step 1: Download from S3 → 45 minutes (10 Gbps link, but S3 throttled)
Step 2: pg_restore (single-threaded) → 6 hours (indexes rebuilt one at a time)
Step 3: ANALYZE (vacuum/statistics) → 45 minutes
Step 4: Verify data integrity → 30 minutes
Step 5: Update DNS / connection strings → 15 minutes
Step 6: Warm up caches → 30 minutes
Total: ~9 hours
Same restore — optimized:
Step 1: Stream from S3 while restoring → Parallel, saves 45 min
Step 2: pg_restore --jobs=8 (parallel) → 90 minutes (not 6 hours)
Step 3: ANALYZE VERBOSE → 20 minutes (focused)
Step 4+5+6: Automated runbook → 15 minutes
Total: ~2 hours, 15 minutes
Fix 1: Parallel pg_restore
The single biggest improvement: restore with multiple jobs in parallel:
# ❌ Default single-threaded restore — rebuilds one index at a time
pg_restore \
--dbname=postgres \
--format=custom \
backup.pgdump
# ✅ Parallel restore — uses multiple CPU cores
pg_restore \
--dbname=postgres \
--format=custom \
--jobs=8 \ # Use 8 parallel workers (match your CPU count)
--no-owner \
--no-privileges \
backup.pgdump
# Note: --jobs requires custom or directory format, not plain SQL format
# If your backups are plain SQL, switch to custom format:
pg_dump --format=custom --file=backup.pgdump ...
Parallel restore is 4-8x faster for large databases. The bottleneck shifts from index rebuilding (sequential) to I/O and CPU (parallelizable).
Fix 2: Stream Download and Restore Simultaneously
Don't wait for the full download before starting the restore:
# ❌ Sequential: download (45 min) THEN restore (6 hours)
aws s3 cp s3://backups/prod.pgdump /tmp/prod.pgdump
pg_restore --jobs=8 -d mydb /tmp/prod.pgdump
# ✅ Stream directly into pg_restore — no temp file needed
aws s3 cp s3://backups/prod.pgdump - | \
pg_restore --jobs=4 --format=custom -d mydb /dev/stdin
# Or using streaming for very large files:
aws s3 cp s3://backups/prod.pgdump - \
--no-progress \
| pg_restore \
--jobs=8 \
--format=custom \
--dbname="postgresql://user:pass@localhost/mydb" \
-
Fix 3: Pre-Create the Database Before Restore
#!/bin/bash
# fast-restore.sh
set -euo pipefail
DB_NAME="myapp_production"
BACKUP_S3_PATH="s3://my-backups/prod/latest.pgdump"
DB_URL="postgresql://postgres@localhost"
echo "Starting fast restore..."
# 1. Create empty database (optimized for restore)
psql "$DB_URL" -c "DROP DATABASE IF EXISTS ${DB_NAME}_new"
psql "$DB_URL" -c "CREATE DATABASE ${DB_NAME}_new"
# 2. Tune PostgreSQL for fast restore (bypass WAL for speed)
psql "$DB_URL/${DB_NAME}_new" <<EOF
-- These settings dramatically speed up bulk loads
-- (Reset to normal after restore completes)
ALTER SYSTEM SET max_wal_size = '4GB';
ALTER SYSTEM SET checkpoint_completion_target = 0.9;
ALTER SYSTEM SET synchronous_commit = off; -- Don't wait for WAL on each row
ALTER SYSTEM SET wal_level = minimal; -- Less WAL overhead
SELECT pg_reload_conf();
EOF
# 3. Restore in parallel, streaming from S3
echo "Downloading and restoring in parallel..."
aws s3 cp "$BACKUP_S3_PATH" - | \
pg_restore \
--jobs=8 \
--format=custom \
--no-owner \
--no-privileges \
--dbname="${DB_URL}/${DB_NAME}_new"
# 4. Restore normal settings
psql "$DB_URL/${DB_NAME}_new" <<EOF
ALTER SYSTEM SET max_wal_size = '1GB';
ALTER SYSTEM SET synchronous_commit = on;
ALTER SYSTEM SET wal_level = replica;
SELECT pg_reload_conf();
EOF
# 5. Run ANALYZE to update query planner statistics
echo "Running ANALYZE..."
psql "$DB_URL/${DB_NAME}_new" -c "ANALYZE"
# 6. Atomic swap: rename old → _old, new → active
psql "$DB_URL" -c "ALTER DATABASE ${DB_NAME} RENAME TO ${DB_NAME}_old"
psql "$DB_URL" -c "ALTER DATABASE ${DB_NAME}_new RENAME TO ${DB_NAME}"
echo "✅ Restore complete. Old database available as ${DB_NAME}_old for verification."
Fix 4: Point-in-Time Recovery (PITR) — Faster Than Full Restore
For PostgreSQL, PITR + WAL archiving lets you restore to any point in time much faster than a full dump restore:
# postgresql.conf — enable WAL archiving
archive_mode = on
archive_command = 'aws s3 cp %p s3://my-wal-archive/%f'
wal_level = replica
# With PITR + base backup, restore is:
# 1. Restore base backup (smaller than full dump — no indexes in WAL)
# 2. Apply WAL segments up to the recovery point
# 3. Total time: much faster for large databases
# Restore with PITR:
# 1. Restore base backup
aws s3 sync s3://my-backups/base-backup/ /var/lib/postgresql/data/
# 2. Configure recovery target
cat > /var/lib/postgresql/data/recovery.conf << EOF
restore_command = 'aws s3 cp s3://my-wal-archive/%f %p'
recovery_target_time = '2026-03-15 02:30:00 UTC' -- restore to just before incident
recovery_target_action = 'promote'
EOF
# 3. Start PostgreSQL — it replays WAL automatically
pg_ctl start
PITR also gives you surgical recovery: restore to exactly 1 minute before the accidental DELETE, not to last night's backup.
Fix 5: Document and Automate the Restore Runbook
The worst time to figure out your restore procedure is during an incident. Document it in advance:
# Database Restore Runbook
## Pre-conditions
- [ ] Access to AWS console and S3 backup bucket
- [ ] PostgreSQL client installed on restore target
- [ ] Database credentials in 1Password under "DB Restore"
## Steps
### 1. Spin up restore instance (5 min)
- Launch RDS snapshot, OR
- Launch EC2 with PostgreSQL 15 matching production version
### 2. Download latest backup (streaming)
```bash
aws s3 ls s3://my-backups/prod/ | sort | tail -5 # find latest
export BACKUP=prod-2026-03-15-020000.pgdump
3. Run restore (90 min)
./scripts/fast-restore.sh $BACKUP
4. Verify
psql $NEW_DB_URL -f scripts/restore-verify.sql
5. Cutover
- Update DATABASE_URL in AWS Secrets Manager
- Trigger rolling restart of all app instances
- Monitor error rate for 10 minutes
Expected Times
| Step | Duration |
|---|---|
| Spin up instance | 5 min |
| Restore (800GB) | 90 min |
| ANALYZE | 20 min |
| Verify + cutover | 15 min |
| Total RTO | ~2 hours |
Escalation
- First 15 min: On-call engineer
- After 30 min: Incident commander
- After 1 hour: CTO notification
<AdSense adSlot={'5967010145'} />
## Measuring Your Actual RTO
```ts
// restore-drill-timer.ts — run quarterly
async function timedRestoreDrill() {
const timeline: Record<string, number> = {}
timeline.start = Date.now()
// Step 1: Download
await downloadBackup('/tmp/restore-drill.pgdump')
timeline.download = Date.now()
// Step 2: Restore
await execAsync('pg_restore --jobs=8 -d restore_drill /tmp/restore-drill.pgdump')
timeline.restore = Date.now()
// Step 3: Analyze
await execAsync('psql -d restore_drill -c "ANALYZE"')
timeline.analyze = Date.now()
// Step 4: Verify
await runVerificationChecks('restore_drill')
timeline.verify = Date.now()
// Log results
const totalMinutes = (timeline.verify - timeline.start) / 60000
console.log(`Restore Drill Results:`)
console.log(` Download: ${((timeline.download - timeline.start) / 60000).toFixed(0)} min`)
console.log(` Restore: ${((timeline.restore - timeline.download) / 60000).toFixed(0)} min`)
console.log(` ANALYZE: ${((timeline.analyze - timeline.restore) / 60000).toFixed(0)} min`)
console.log(` Verify: ${((timeline.verify - timeline.analyze) / 60000).toFixed(0)} min`)
console.log(` TOTAL RTO: ${totalMinutes.toFixed(0)} min`)
// Store in database for trending
await db.query(`
INSERT INTO restore_drills (duration_minutes, steps, drilled_at)
VALUES ($1, $2, NOW())
`, [totalMinutes, JSON.stringify(timeline)])
// Alert if RTO exceeds target
if (totalMinutes > 120) { // 2 hour target
await alerting.warn(`Restore drill exceeded RTO target: ${totalMinutes.toFixed(0)} min (target: 120 min)`)
}
}
Restore Checklist
- ✅ Use
pg_restore --jobs=Nto parallelize (N = CPU core count) - ✅ Stream from S3 directly into pg_restore — skip the temp file
- ✅ Tune PostgreSQL write settings during restore (disable sync_commit, increase WAL)
- ✅ Consider WAL archiving + PITR for faster, more precise recovery
- ✅ Document the restore runbook — exact commands, expected times, escalation path
- ✅ Run a timed restore drill quarterly — measure actual RTO vs target
- ✅ Keep restore tools (pg_restore, aws CLI) installed on a ready standby instance
- ✅ Verify the restored database before cutting over — don't find errors mid-cutover
Conclusion
A 9-hour restore is almost always avoidable. The biggest wins are parallel restore (--jobs=8 reduces 6 hours to 90 minutes), streaming from S3 to save the download wait, and WAL archiving for PITR (lets you restore to a surgical point in time, not just last night's backup). But the most important fix is running a timed restore drill before you have an incident — so you know your actual RTO, discover the bottlenecks with time to fix them, and have a written runbook the on-call engineer can follow at 2 AM without guessing. If you've never tested your restore, you don't have a recovery plan. You have a backup file and a hope.