Published on

Handling a Production Incident Live — What Good Incident Command Looks Like

Authors

Introduction

Incident response under pressure is a practiced skill, not a natural talent. The engineers who handle incidents well don't do it because they're calmer or smarter — they do it because they've internalized a process that works even when the pressure is high. The most common failure in incidents is everyone trying to fix things simultaneously with no coordination: multiple people making changes at once, no audit trail of what was tried, and no one responsible for communicating status to stakeholders. Good incident command separates diagnosis from execution, maintains a timeline, and delegates clearly.

What Bad Incident Response Looks Like

Common incident response failure patterns:

T+0:  Alert fires
T+1:  Three engineers independently start investigating
T+3:  Engineer A tries a fix without telling anyone
T+4:  Engineer B tries a different fix — now two changes in flight
T+5:  Error rate drops — was it A or B? Or neither?
T+7:  Third fix deployed — now undoing the changes is unclear
T+10: Stakeholders asking "what's happening" — no one has time to answer
T+15: All three engineers have changed things; nobody knows the current state
T+20: Error rate goes back up — nobody knows why

After: Can't write a postmortem because nobody tracked what happened

Better structure:
- One incident commander (coordinates, doesn't execute)
- One or two responders (investigate and fix)
- One comms person (stakeholder updates)
- Shared incident channel with timestamped updates
- No changes without announcing them

Fix 1: The First 10 Minutes Framework

// Incident Commander mental model for the first 10 minutes

const firstTenMinutes = {
  minute0_2: {
    role: 'Assess and declare',
    actions: [
      'Acknowledge the alert — tell others you have it',
      'Open the incident channel: "#incident-YYYY-MM-DD-[description]"',
      'Declare severity: P1 (service down) / P2 (degraded) / P3 (single user)',
      'Post first status: "Investigating: [symptom]. [your name] is IC."',
    ],
  },

  minute2_5: {
    role: 'Orient',
    actions: [
      'Check: what changed in the last hour? (recent deploys, config changes)',
      'Check: what does the error look like? (error type, affected endpoints)',
      'Check: what is the scope? (% of users, which regions, which services)',
      'DO NOT start fixing yet — understand first',
    ],
  },

  minute5_10: {
    role: 'Delegate',
    actions: [
      'Assign: one engineer to investigate root cause',
      'Assign: one engineer to check rollback feasibility',
      'If P1: notify stakeholders now (not after you know everything)',
      'Post status update: "Root cause investigation in progress. Likely [hypothesis]. ETA for update: 10 min."',
    ],
  },

  minute10_plus: {
    role: 'Coordinate',
    actions: [
      'Every change announced before execution: "I\'m going to try X"',
      'Every change result logged: "X didn\'t help / X reduced error rate"',
      'Status updates every 10-15 minutes regardless of progress',
      'One fix at a time — never two simultaneous changes',
    ],
  },
}

Fix 2: The Incident Timeline (Real-Time Log)

Maintain this in the incident channel — timestamped, continuous

Format: [HH:MM UTC] [WHO] [WHAT HAPPENED]

Example:
14:03 @alice IC: Error rate at 8%, affecting checkout. Investigating.
14:04 @alice Checked deploys: payment-service v2.4.1 deployed 14:00 UTC
14:05 @bob Investigating: payment service logs show "connection refused to redis"
14:07 @alice Hypothesis: Redis connection issue. @bob check Redis health
14:08 @bob Redis CPU at 95%, max connections reached (100/100)
14:09 @alice Fix option 1: increase Redis max connections. Rollback option: redeploy payment-service v2.4.0
14:10 @alice Stakeholder update sent: "Payment service degraded, investigating Redis"
14:11 @bob Attempting: redis-cli config set maxclients 200
14:12 @bob Redis config updated. Error rate dropping.
14:14 @alice Error rate at 1.2% and falling. Still monitoring.
14:18 @alice Error rate back to baseline 0.1%. Incident resolved.
14:19 @alice Postmortem scheduled for tomorrow 10am. Action items: [link]

This log:
Tells the story of what happened
Shows what was tried and what worked
Provides the timeline for the postmortem
Lets latecomers catch up instantly

Fix 3: Stakeholder Communication Template

// Communicate early, often, and with specificity
// Silence is the worst stakeholder communication during an incident

// P1 Initial notification (within 5 minutes):
const p1Initial = `
🚨 *INCIDENT P1 - Payment Service Degraded*

Status: Investigating
Impact: ~30% of checkout attempts failing
Affected: Users attempting to complete purchases
Started: 14:00 UTC (approximately)

Next update: 14:15 UTC or sooner if resolved.
IC: @alice
`

// P1 Update (every 10-15 minutes):
const p1Update = `
📊 *INCIDENT UPDATE - 14:15 UTC*

Progress: Root cause identified — Redis reached max connections
Fix in progress: Increasing Redis connection limit
Error rate: Down from 8% to 2%, still declining

Next update: 14:25 UTC or on resolution.
`

// P1 Resolution:
const p1Resolution = `
✅ *INCIDENT RESOLVED - 14:18 UTC*

Duration: 18 minutes (14:00 - 14:18 UTC)
Root cause: Redis max_connections limit (100) hit by traffic spike
Fix applied: Increased to 200 connections, error rate returned to baseline

Customer impact: Approximately 2,400 failed checkout attempts (8% of attempts over 18 min)
Postmortem scheduled: 2026-03-16 10:00 UTC

Action items:
- Increase Redis connection limit permanently (tonight)
- Add CloudWatch alert for Redis connection count > 80% capacity
- Review connection pooling in payment service
`

Fix 4: The Rollback-First Mental Model

// During an incident: rollback is usually faster than finding root cause

const rollbackFirstPrinciple = {
  question: 'Before investigating root cause: can we rollback to restore service?',

  ifRecentDeploy: [
    '1. Check: was there a deploy in the last hour?',
    '2. Can we rollback to the previous version in < 5 minutes?',
    '3. If yes: rollback first, investigate after service is restored',
    '4. Users care about service restoration, not root cause understanding',
  ],

  ifNoRecentDeploy: [
    '1. Look for infrastructure changes (autoscaling event, config change)',
    '2. Look for external dependency issues (third-party API, database)',
    '3. Look for traffic pattern changes (spike, unusual request type)',
    '4. Mitigation (rate limiting, circuit breaker) before fix',
  ],
}

// Common mistake: spending 20 minutes finding the root cause
// while 20 minutes of rollback would have restored service in 5 minutes.
// Restore first. Learn after.

Fix 5: Post-Incident Immediately Actionable Items

// The most important thing to do right after an incident is resolved:
// write down the immediate next steps while memory is fresh

interface PostIncidentActions {
  within1Hour: string[]   // While the team is still assembled
  within24Hours: string[]  // Before the postmortem
  within1Week: string[]    // Postmortem actions
}

const immediateActions: PostIncidentActions = {
  within1Hour: [
    'Write the timeline while it\'s fresh (you\'ll forget details by tomorrow)',
    'Check if the fix needs a permanent change (config set manually → automate)',
    'Send final stakeholder update with resolution summary',
    'Thank the team and release from incident duty',
  ],
  within24Hours: [
    'Draft postmortem document',
    'Collect any monitoring data before it ages out',
    'Identify who was affected and how (for customer notification)',
  ],
  within1Week: [
    'Complete postmortem with action items',
    'Implement the P1 action items (monitoring, alerts, runbook)',
    'Schedule one-month follow-up on all action items',
  ],
}

Incident Command Checklist

  • ✅ Single incident commander — one person coordinates, others execute
  • ✅ Incident channel opened within 2 minutes with real-time timeline
  • ✅ First stakeholder update sent within 5 minutes (even if "investigating")
  • ✅ Rollback considered before root cause investigation
  • ✅ Every change announced before execution — no uncoordinated changes
  • ✅ Status updates every 10-15 minutes regardless of progress
  • ✅ Timeline maintained throughout — postmortem writes itself
  • ✅ Immediate action items captured while team is still assembled

Conclusion

Incident response quality is the product of practiced process, not talent. The engineers who handle incidents well have internalized a simple structure: one commander, a shared timeline, early stakeholder communication, and a rollback-before-investigation reflex. The most expensive minutes in an incident are spent with multiple people making uncoordinated changes — the timeline becomes unreadable, the blast radius expands, and the post-incident recovery is harder. A 30-minute "how we run incidents" training session and a shared template does more for incident quality than any additional tooling.