- Published on
Deploying Without Canary — How One Bad Deploy Hits All Your Users at Once
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
A standard deploy replaces every running instance with the new version simultaneously. If the new version has a bug that affects 5% of requests, 100% of your users are immediately exposed to it. By the time your error rate alert fires and you start a rollback, the incident has been running for minutes across your entire user base. Canary deployments invert this: expose 1–5% of traffic to the new version first, watch the metrics, and only promote if everything looks good.
- The Cost of All-at-Once Deploys
- Fix 1: Kubernetes Canary With Traffic Splitting
- Fix 2: Automated Canary Analysis
- Fix 3: Nginx / Ingress Traffic Splitting
- Fix 4: User-Segment Canary (Employees First)
- Canary Metrics Dashboard
- Canary Deployment Checklist
- Conclusion
The Cost of All-at-Once Deploys
All-at-once deploy incident timeline:
T+0: Deploy ships to all 20 pods simultaneously
T+2min: New version is live on 100% of traffic
T+4min: Error rate climbs from 0.1% → 3%
T+5min: Alert fires (5-minute evaluation window)
T+7min: Engineer acknowledges alert, starts investigation
T+12min: Root cause identified: new code has null-pointer bug
T+15min: Rollback initiated
T+18min: Old version restored
→ 18 minutes of elevated errors for ALL users
Canary deploy — same bug:
T+0: Deploy ships to 1 of 20 pods (5% traffic)
T+2min: 5% of traffic hits new version
T+4min: Error rate on canary pod: 3% (vs 0.1% baseline)
T+5min: Canary health check detects divergence, auto-pauses
T+6min: 95% of users never saw the bug
T+7min: Engineer investigates with full prod data, zero urgency
→ Only 5% of users exposed for 5 minutes
Fix 1: Kubernetes Canary With Traffic Splitting
# Deploy the canary alongside the stable version
# Stable: 19 replicas | Canary: 1 replica = ~5% traffic split
# deployment-stable.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-stable
labels:
app: myapp
track: stable
spec:
replicas: 19
selector:
matchLabels:
app: myapp
track: stable
template:
metadata:
labels:
app: myapp
track: stable
version: "v1.4.2"
spec:
containers:
- name: myapp
image: myapp:v1.4.2
---
# deployment-canary.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-canary
labels:
app: myapp
track: canary
spec:
replicas: 1 # 1 out of 20 total = 5% traffic
selector:
matchLabels:
app: myapp
track: canary
template:
metadata:
labels:
app: myapp
track: canary
version: "v1.4.3" # New version
spec:
containers:
- name: myapp
image: myapp:v1.4.3
---
# service.yaml — routes to BOTH deployments by app label
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp # Matches both stable and canary pods
ports:
- port: 80
targetPort: 3000
Fix 2: Automated Canary Analysis
// canary-analyzer.ts — decide to promote or rollback based on metrics
interface CanaryMetrics {
errorRate: number
p99Latency: number
p50Latency: number
requestCount: number
}
async function analyzeCanary(
canaryMetrics: CanaryMetrics,
baselineMetrics: CanaryMetrics
): Promise<'promote' | 'rollback' | 'hold'> {
// Need enough traffic to be statistically significant
if (canaryMetrics.requestCount < 100) {
return 'hold'
}
// Error rate check: canary must not be significantly worse
const errorRateDelta = canaryMetrics.errorRate - baselineMetrics.errorRate
if (errorRateDelta > 0.01) { // More than 1% higher error rate
console.error(`Canary error rate ${(canaryMetrics.errorRate * 100).toFixed(2)}% vs baseline ${(baselineMetrics.errorRate * 100).toFixed(2)}%`)
return 'rollback'
}
// Latency check: canary p99 must not regress by more than 20%
const latencyRegression = (canaryMetrics.p99Latency - baselineMetrics.p99Latency) / baselineMetrics.p99Latency
if (latencyRegression > 0.20) {
console.error(`Canary p99 latency ${canaryMetrics.p99Latency}ms vs baseline ${baselineMetrics.p99Latency}ms (${(latencyRegression * 100).toFixed(0)}% regression)`)
return 'rollback'
}
// All checks passed
return 'promote'
}
// Progressive rollout controller
async function progressiveRollout(newVersion: string) {
const stages = [
{ canaryPercent: 5, durationMinutes: 10 },
{ canaryPercent: 25, durationMinutes: 15 },
{ canaryPercent: 50, durationMinutes: 15 },
{ canaryPercent: 100, durationMinutes: 0 }, // Full rollout
]
for (const stage of stages) {
console.log(`Setting canary to ${stage.canaryPercent}%...`)
await setCanaryWeight(newVersion, stage.canaryPercent)
if (stage.durationMinutes > 0) {
await sleep(stage.durationMinutes * 60 * 1000)
const [canaryMetrics, baselineMetrics] = await Promise.all([
fetchMetrics({ version: newVersion }),
fetchMetrics({ version: 'stable' }),
])
const decision = await analyzeCanary(canaryMetrics, baselineMetrics)
if (decision === 'rollback') {
await setCanaryWeight(newVersion, 0)
await alerting.critical(`Canary rollback at ${stage.canaryPercent}%: metrics degraded`)
return
}
}
}
console.log('✅ Full rollout complete')
}
Fix 3: Nginx / Ingress Traffic Splitting
# nginx canary config — weight-based upstream routing
upstream myapp_stable {
server stable-pod-1:3000 weight=19;
server stable-pod-2:3000 weight=19;
# ... 19 stable pods
}
upstream myapp_canary {
server canary-pod-1:3000 weight=1;
}
# Split: 95% stable, 5% canary
split_clients "${remote_addr}AAA" $upstream {
95% myapp_stable;
* myapp_canary;
}
server {
location / {
proxy_pass http://$upstream;
add_header X-Upstream $upstream; # For debugging
}
}
# Kubernetes nginx-ingress canary annotation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-canary
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "5" # 5% to canary
spec:
rules:
- host: api.myapp.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-canary
port:
number: 80
Fix 4: User-Segment Canary (Employees First)
// Route specific users to canary: internal employees, beta users, etc.
// Before random traffic splitting, validate with controlled group
function shouldUseCanary(userId: string, request: Request): boolean {
// Internal employees always get canary
if (isInternalEmail(request.headers['x-user-email'])) {
return true
}
// Beta users opted in
if (userFlags.has(userId, 'canary-access')) {
return true
}
// Percentage rollout based on user ID hash (consistent per user)
const hash = murmurHash(userId) % 100
return hash < CANARY_PERCENT // e.g., 5 for 5%
}
// In your load balancer / API gateway
app.use((req, res, next) => {
const userId = req.user?.id ?? req.ip
if (shouldUseCanary(userId, req)) {
req.headers['x-route-to'] = 'canary'
}
next()
})
Canary Metrics Dashboard
// Track canary vs stable side-by-side in Prometheus
// These labels let you split metrics by version in Grafana
app.use((req, res, next) => {
const start = Date.now()
const version = process.env.APP_VERSION ?? 'unknown'
const track = process.env.DEPLOY_TRACK ?? 'stable' // 'stable' or 'canary'
res.on('finish', () => {
const duration = Date.now() - start
const status = res.statusCode >= 500 ? 'error' : 'success'
// Prometheus metrics with version labels
httpRequestDuration.observe(
{ method: req.method, route: req.route?.path ?? 'unknown', status, version, track },
duration / 1000
)
httpRequestTotal.inc({ status, version, track })
})
next()
})
// Grafana query:
// Error rate canary vs stable:
// rate(http_requests_total{status="error",track="canary"}[5m])
// / rate(http_requests_total{track="canary"}[5m])
Canary Deployment Checklist
- ✅ New version deploys to a small slice (1–5%) of traffic first
- ✅ Canary metrics are compared against stable baseline automatically
- ✅ Rollback happens automatically if error rate or latency exceeds threshold
- ✅ Internal employees and beta users receive canary before general traffic
- ✅ Canary duration is long enough to cover all traffic patterns (15–30 minutes minimum)
- ✅ Full rollout only happens after all canary stages pass
- ✅ Canary vs stable metrics are visible in dashboards in real time
Conclusion
All-at-once deploys are a bet that your testing caught everything. Canary deployments are an acknowledgment that testing never catches everything — so you expose 5% of real traffic first and watch what happens. The mechanics are straightforward: run one pod on the new version while the rest stay on the current version, compare error rates and latency, and automate the promote/rollback decision. The extra 15 minutes of canary observation will catch the bugs your staging environment missed, every time.