- Published on
Dealing With Silent System Failure — The Bug That's Been Running for Three Months
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Silent failures are qualitatively different from loud ones. Loud failures get paged, investigated, and fixed. Silent failures accumulate — they erode correctness quietly, often for months, until a user complaint or an audit surfaces them. Email jobs that log "success" but don't send. Background workers that swallow errors and continue processing. Integrations that retry three times and then silently drop the record. The defense against silent failures is systematic: treat every background operation as suspect until proven observable, instrument everything, and regularly audit expected outcomes against actual ones.
- Types of Silent Failures
- Fix 1: Never Swallow Errors — Log and Alert
- Fix 2: Dead-Letter Queues With Monitoring
- Fix 3: Outcome Auditing — Verify Expected vs Actual
- Fix 4: Health Check Endpoints for Background Jobs
- Fix 5: Structured Logging That Makes Silent Failures Visible
- Silent Failure Prevention Checklist
- Conclusion
Types of Silent Failures
Silent failure patterns:
1. Error swallowing
→ try { await doWork() } catch { /* silently continue */ }
→ The operation failed but nothing indicates it
→ 50,000 calls, 30% failure rate, 0 alerts
2. Job success with failed subtasks
→ Job reports "completed successfully"
→ 30% of items in the batch failed
→ No per-item failure tracking
3. Queue processing with no dead-letter handling
→ Message fails processing 3 times, goes to dead-letter queue
→ Nobody monitors DLQ
→ 10,000 messages accumulate over 3 months
4. Backup job that uploads 0 bytes
→ pg_dump failed silently
→ upload command succeeded on a 0-byte file
→ All checks pass, backup is empty
5. Rate limiting that drops instead of queues
→ API client hits rate limit
→ Library returns null instead of throwing
→ Callers don't check for null
→ Data silently lost
6. Integration sync that skips on errors
→ Sync job: "if error, log and continue"
→ 10% of records silently skipped every run
→ After 6 months: database diverged from source of truth
Fix 1: Never Swallow Errors — Log and Alert
// ❌ Error swallowing — the root cause of most silent failures
async function processEmailQueue() {
const messages = await queue.receive()
for (const msg of messages) {
try {
await sendEmail(msg.email, msg.subject, msg.body)
await queue.delete(msg.id)
} catch (err) {
// ❌ Silently continue — email dropped
}
}
}
// ✅ Error is observable, alertable, and retryable
async function processEmailQueue() {
const messages = await queue.receive()
for (const msg of messages) {
try {
await sendEmail(msg.email, msg.subject, msg.body)
await queue.delete(msg.id)
metrics.increment('email.sent')
} catch (err) {
logger.error({ messageId: msg.id, error: err.message }, 'Email send failed')
metrics.increment('email.failed')
// Don't delete — move to dead-letter or retry
if (msg.retryCount < 3) {
await queue.retry(msg.id, { delay: exponentialBackoff(msg.retryCount) })
} else {
await queue.moveToDeadLetter(msg.id, err.message)
await alerting.warn(`Email permanently failed after 3 retries: ${msg.id}`)
}
}
}
}
Fix 2: Dead-Letter Queues With Monitoring
// Every queue must have a dead-letter queue
// Every DLQ must have an alert when messages accumulate
// SQS DLQ configuration (terraform)
resource "aws_sqs_queue" "email_dlq" {
name = "email-dlq"
message_retention_seconds = 1209600 # 14 days
}
resource "aws_sqs_queue" "email" {
name = "email"
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.email_dlq.arn
maxReceiveCount = 3 # After 3 failures, move to DLQ
})
}
# Alert when DLQ has messages:
resource "aws_cloudwatch_metric_alarm" "email_dlq_messages" {
alarm_name = "email-dlq-has-messages"
namespace = "AWS/SQS"
metric_name = "ApproximateNumberOfMessagesVisible"
dimensions = {
QueueName = aws_sqs_queue.email_dlq.name
}
comparison_operator = "GreaterThanThreshold"
threshold = 0 # Alert on first message in DLQ
evaluation_periods = 1
period = 300
alarm_actions = [aws_sns_topic.alerts.arn]
}
Fix 3: Outcome Auditing — Verify Expected vs Actual
// Don't just trust that the job "ran" — verify it produced expected outcomes
// Run this daily as a sanity check
async function auditEmailOutcomes(): Promise<void> {
// For the last 24 hours:
// How many emails should have been sent (orders completed, signups, etc.)?
// How many were actually sent according to SendGrid?
const [ordersYesterday, emailsSent] = await Promise.all([
db.query(`
SELECT COUNT(*) as count
FROM orders
WHERE status = 'paid'
AND paid_at > NOW() - INTERVAL '24 hours'
`),
sendgrid.request({
method: 'GET',
url: '/v3/stats',
qs: {
start_date: yesterday(),
end_date: today(),
aggregated_by: 'day',
},
}),
])
const expectedEmails = ordersYesterday.rows[0].count // One confirmation per order
const actualSent = emailsSent.body[0]?.stats[0]?.metrics.delivered ?? 0
// Allow 5% variance for timing differences
const variance = Math.abs(expectedEmails - actualSent) / expectedEmails
if (variance > 0.05) {
await alerting.warn(
`Email audit: expected ${expectedEmails} order confirmations, SendGrid shows ${actualSent} delivered (${(variance * 100).toFixed(1)}% variance)`
)
}
}
// Run daily at 8 AM:
cron.schedule('0 8 * * *', auditEmailOutcomes)
Fix 4: Health Check Endpoints for Background Jobs
// Background jobs should expose health status that can be polled
// Don't rely on "the job runs" as evidence it's working
// job-health.ts — each job reports its health
const jobHealth: Map<string, JobHealthStatus> = new Map()
interface JobHealthStatus {
lastRunAt: Date | null
lastSuccessAt: Date | null
lastFailureAt: Date | null
lastError: string | null
processedCount: number
failedCount: number
}
// Jobs register their status after each run:
async function runEmailJob() {
const start = Date.now()
let processed = 0
let failed = 0
// ... job logic ...
jobHealth.set('email-job', {
lastRunAt: new Date(),
lastSuccessAt: failed === 0 ? new Date() : jobHealth.get('email-job')?.lastSuccessAt ?? null,
lastFailureAt: failed > 0 ? new Date() : jobHealth.get('email-job')?.lastFailureAt ?? null,
lastError: null,
processedCount: processed,
failedCount: failed,
})
}
// Health endpoint for monitoring systems:
app.get('/internal/jobs/health', (req, res) => {
const jobs = Object.fromEntries(jobHealth)
const unhealthyJobs = Object.entries(jobs).filter(([name, status]) => {
const lastRunHoursAgo = status.lastRunAt
? (Date.now() - status.lastRunAt.getTime()) / 3600000
: Infinity
return lastRunHoursAgo > 2 // Job should run at least every 2 hours
})
res.json({
healthy: unhealthyJobs.length === 0,
jobs,
unhealthy: unhealthyJobs.map(([name]) => name),
})
})
Fix 5: Structured Logging That Makes Silent Failures Visible
// Every background operation should log: what was attempted, what succeeded, what failed
// This creates a queryable audit trail
async function syncRecords(sourceRecords: SourceRecord[]): Promise<SyncResult> {
const results: SyncResult = {
total: sourceRecords.length,
succeeded: 0,
failed: 0,
skipped: 0,
errors: [],
}
for (const record of sourceRecords) {
try {
await upsertRecord(record)
results.succeeded++
} catch (err) {
results.failed++
results.errors.push({ recordId: record.id, error: err.message })
logger.error({
operation: 'sync_record',
recordId: record.id,
error: err.message,
}, 'Record sync failed')
}
}
// Log summary at the end — makes it easy to query in log aggregation
logger.info({
operation: 'sync_batch',
total: results.total,
succeeded: results.succeeded,
failed: results.failed,
skipped: results.skipped,
successRate: results.succeeded / results.total,
}, 'Sync batch completed')
// Alert if failure rate is too high
if (results.failed / results.total > 0.05) {
await alerting.warn(`Sync batch: ${results.failed}/${results.total} records failed (${(results.failed/results.total*100).toFixed(0)}%)`)
}
return results
}
Silent Failure Prevention Checklist
- ✅ No bare
catch {}blocks — every exception logged and tracked - ✅ Every queue has a dead-letter queue with a CloudWatch alarm
- ✅ Outcome auditing: daily check of expected vs actual results (emails, records, etc.)
- ✅ Background jobs expose health endpoints with last run time and error counts
- ✅ Job summary logs: total processed, succeeded, failed per batch
- ✅ Alert fires when failure rate in any batch exceeds 5%
- ✅ DLQ messages reviewed and redriven or investigated within 24 hours
Conclusion
Silent failures are the reliability problems that don't show up in error rates or latency dashboards — they show up in wrong data, missing emails, and diverged systems discovered weeks later. The defense is systematic observability of background operations: log what was attempted and what failed, monitor dead-letter queues, audit expected outcomes against actuals, and alert when batch failure rates exceed thresholds. No background operation should be able to fail quietly — every failure should leave a trace, queue for retry, and eventually alert if it can't be resolved automatically.