- Published on
Deployment Strategies — Blue/Green, Canary, Rolling, and Shadow Traffic Compared
- Authors

- Name
- Sanjeev Sharma
- @webcoderspeed1
Introduction
Shipping code is easy. Shipping code without breaking production is the art. Blue/green deployments swap entire environments instantly. Canary releases shift traffic gradually to new versions. Rolling updates replace pods one at a time. Shadow traffic mirrors requests to detect logic bugs before users see them. Each strategy trades off speed, observability, and rollback time. This post dissects each strategy with real Kubernetes configurations and guidance on choosing based on your SLA requirements.
- Blue/Green Deployments with AWS ALB
- Canary Deployments with Argo Rollouts
- Rolling Updates with maxSurge and maxUnavailable
- Shadow Traffic Deployment
- Strategy Comparison Matrix
- Checklist
- Conclusion
Blue/Green Deployments with AWS ALB
Deploy to a completely separate environment (green), then switch traffic instantly.
# infra/blue-green-deployment.yaml
# Blue environment (current production)
apiVersion: v1
kind: Service
metadata:
name: api-blue
labels:
app: api
color: blue
spec:
selector:
app: api
color: blue
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-blue
spec:
replicas: 3
selector:
matchLabels:
app: api
color: blue
template:
metadata:
labels:
app: api
color: blue
spec:
containers:
- name: api
image: api:v1.2.3
ports:
- containerPort: 8080
env:
- name: VERSION
value: 'v1.2.3'
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
---
# Green environment (new release, not receiving traffic yet)
apiVersion: v1
kind: Service
metadata:
name: api-green
labels:
app: api
color: green
spec:
selector:
app: api
color: green
ports:
- protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-green
spec:
replicas: 3
selector:
matchLabels:
app: api
color: green
template:
metadata:
labels:
app: api
color: green
spec:
containers:
- name: api
image: api:v1.3.0
ports:
- containerPort: 8080
env:
- name: VERSION
value: 'v1.3.0'
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
AWS ALB target group switching script:
// deploy/blue-green-switch.ts
import {
ELBv2Client,
ModifyListenerCommand,
} from '@aws-sdk/client-elastic-load-balancing-v2';
const elbClient = new ELBv2Client({ region: 'us-east-1' });
export async function switchTrafficBlueGreen(
listenerArn: string,
targetGroupArn: string,
color: 'blue' | 'green'
) {
// Run smoke tests on green before switching
const smokeTestsPassed = await runSmokeTests(
`http://api-${color}:80`
);
if (!smokeTestsPassed) {
throw new Error(`Smoke tests failed for ${color} environment`);
}
const command = new ModifyListenerCommand({
ListenerArn: listenerArn,
DefaultActions: [
{
Type: 'forward',
TargetGroupArn: `arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/api-${color}/abcdef1234567890`,
},
],
});
const result = await elbClient.send(command);
console.log(`Traffic switched to ${color} environment`);
// Monitor new environment for 5 minutes
await monitorEnvironment(color, 5 * 60 * 1000);
return result;
}
async function runSmokeTests(baseUrl: string) {
const tests = [
{ path: '/health', expectedStatus: 200 },
{ path: '/api/users', expectedStatus: 200 },
{ path: '/api/posts?limit=10', expectedStatus: 200 },
];
for (const test of tests) {
try {
const response = await fetch(`${baseUrl}${test.path}`, {
timeout: 5000,
});
if (response.status !== test.expectedStatus) {
console.error(
`Test failed: ${test.path} returned ${response.status}`
);
return false;
}
} catch (error) {
console.error(`Test failed: ${test.path} - ${error}`);
return false;
}
}
return true;
}
async function monitorEnvironment(
color: string,
duration: number
) {
const startTime = Date.now();
while (Date.now() - startTime < duration) {
// Check error rate from CloudWatch
const errorRate = await getErrorRate(`api-${color}`);
if (errorRate > 0.05) {
// > 5% error rate
throw new Error(
`High error rate detected (${errorRate}%) on ${color}`
);
}
await new Promise((resolve) => setTimeout(resolve, 10000)); // Check every 10s
}
}
async function getErrorRate(serviceName: string): Promise<number> {
// Query Prometheus/CloudWatch for error rate
return 0.01; // Placeholder
}
export async function rollbackBlueGreen(
listenerArn: string,
previousColor: 'blue' | 'green'
) {
console.log(`Rolling back to ${previousColor}`);
const command = new ModifyListenerCommand({
ListenerArn: listenerArn,
DefaultActions: [
{
Type: 'forward',
TargetGroupArn: `arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/api-${previousColor}/abcdef1234567890`,
},
],
});
await elbClient.send(command);
}
Canary Deployments with Argo Rollouts
Gradually shift traffic from stable to new version using weighted percentages:
# deploy/canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 3
revisionHistoryLimit: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: api:v1.3.0
ports:
- containerPort: 8080
env:
- name: VERSION
value: 'v1.3.0'
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
strategy:
canary:
canaryService: api-canary # Service to test canary before shifting traffic
stableService: api-stable # Service for stable version
canaryMetadata:
labels:
canary: 'true'
steps:
# Step 1: Deploy 1 canary pod, 0% traffic
- setWeight: 0
- pause:
duration: 5m # Wait 5 minutes for metrics to stabilize
# Step 2: Shift 5% traffic
- setWeight: 5
- pause:
duration: 5m
# Step 3: Automatic analysis - check error rate and latency
- analysis:
templates:
- name: error-rate
- name: latency
setWeight: 10
- pause:
duration: 5m
# Step 4: Increase to 25%
- setWeight: 25
- pause:
duration: 5m
# Step 5: Increase to 50%
- setWeight: 50
- pause:
duration: 5m
# Step 6: Full traffic
- setWeight: 100
# Rollback if metrics violate thresholds
analysis:
interval: 1m
threshold: 5 # Number of analysis runs
unsuccessful:
threshold: 3 # Fail after 3 unsuccessful runs
# Analysis runs at each step
template:
analysisIntervalSeconds: 30
AnalysisTemplate for canary validation:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate
spec:
args:
- name: service-name
value: api
metrics:
- name: error-rate
interval: 1m
successCriteria: result <= 0.05 # Max 5% errors
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
scalar(
sum(rate(http_requests_total{service="{{args.service-name}}", status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
)
- name: latency-p99
interval: 1m
successCriteria: result <= 500 # Max 500ms p99
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
scalar(
histogram_quantile(
0.99,
sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m])) by (le)
) * 1000
)
Nginx-based canary with weight shifting:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api
annotations:
nginx.ingress.kubernetes.io/canary: 'true'
nginx.ingress.kubernetes.io/canary-weight: '5' # Start at 5% traffic
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-canary
port:
number: 80
Rolling Updates with maxSurge and maxUnavailable
Update pods gradually in-place:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-rolling
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Allow 2 extra pods during update (total: 12)
maxUnavailable: 1 # Keep at least 9 available during update
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
# Graceful shutdown: wait for in-flight requests to complete
terminationGracePeriodSeconds: 30
containers:
- name: api
image: api:v1.3.0
ports:
- containerPort: 8080
lifecycle:
preStop:
exec:
command:
[
'/bin/sh',
'-c',
'sleep 15 && nginx -s quit',
] # Wait 15s before terminating
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 2
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
# Minimal downtime; still fast rollback available
progressDeadlineSeconds: 600 # Abort if rollout takes > 10 min
revisionHistoryLimit: 10
Rolling update monitoring:
// deploy/rolling-update-monitor.ts
import { AppsV1Api, KubeConfig } from '@kubernetes/client-node';
const kc = new KubeConfig();
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(AppsV1Api);
export async function monitorRollingUpdate(
namespace: string,
deploymentName: string,
timeoutSeconds: number = 600
) {
const startTime = Date.now();
const maxDuration = timeoutSeconds * 1000;
while (Date.now() - startTime < maxDuration) {
const deployment = await k8sApi.readNamespacedDeployment(
deploymentName,
namespace
);
const spec = deployment.body.spec!;
const status = deployment.body.status!;
const replicas = spec.replicas || 0;
const updatedReplicas = status.updatedReplicas || 0;
const readyReplicas = status.readyReplicas || 0;
const availableReplicas = status.availableReplicas || 0;
console.log(
`[${new Date().toISOString()}] Replicas: ${availableReplicas}/${replicas} available, ${updatedReplicas} updated, ${readyReplicas} ready`
);
// Check if all replicas are updated and ready
if (
updatedReplicas === replicas &&
readyReplicas === replicas &&
availableReplicas === replicas
) {
console.log('Rolling update complete');
return { success: true, duration: Date.now() - startTime };
}
// Check for rollout error
if (status.conditions) {
const progressingCondition = status.conditions.find(
(c) => c.type === 'Progressing'
);
if (
progressingCondition?.status === 'False' &&
progressingCondition?.reason === 'ProgressDeadlineExceeded'
) {
throw new Error('Rolling update exceeded progress deadline');
}
}
await new Promise((resolve) => setTimeout(resolve, 5000)); // Poll every 5s
}
throw new Error(`Rolling update timed out after ${timeoutSeconds}s`);
}
Shadow Traffic Deployment
Mirror requests to new version without affecting users:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api
spec:
hosts:
- api.example.com
http:
# Main traffic to stable version
- match:
- uri:
prefix: '/'
route:
- destination:
host: api
subset: stable
weight: 100
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-shadow
spec:
hosts:
- api-shadow # Internal mirroring destination
http:
- match:
- sourceLabels:
app: api
route:
- destination:
host: api
subset: canary
weight: 100
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: api
spec:
host: api
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 100
http2MaxRequests: 1000
subsets:
- name: stable
labels:
version: v1.2.3
- name: canary
labels:
version: v1.3.0
Envoy filter for request mirroring:
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: api-mirror
spec:
workloadSelector:
labels:
app: api
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_OUTBOUND
listener:
filterChain:
filter:
name: 'envoy.filters.network.http_connection_manager'
patch:
operation: INSERT_AFTER
value:
name: envoy.ext_authz
typed_config:
'@type': type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
inline_code: |
function envoy_on_response(response_handle)
-- Log response from canary version
local headers = response_handle:headers()
local body = response_handle:body()
print("Canary response: " .. tostring(headers))
end
Shadow traffic analyzer:
// deploy/shadow-analyzer.ts
import { Client } from '@elastic/elasticsearch';
const esClient = new Client({
node: 'http://elasticsearch:9200',
});
export async function analyzeShadowTraffic(
stableVersion: string,
canaryVersion: string,
durationMinutes: number
) {
const now = Date.now();
const startTime = now - durationMinutes * 60 * 1000;
const query = {
bool: {
must: [
{ range: { '@timestamp': { gte: startTime, lte: now } } },
{
bool: {
should: [
{ match: { 'version.keyword': stableVersion } },
{ match: { 'version.keyword': canaryVersion } },
],
},
},
],
},
};
// Compare response times
const latencyComparison = await esClient.search({
index: 'api-logs',
body: {
query,
aggs: {
by_version: {
terms: {
field: 'version.keyword',
size: 2,
},
aggs: {
latency_p99: {
percentiles: {
field: 'response_time_ms',
percents: [99],
},
},
error_rate: {
filter: {
range: {
status_code: { gte: 500 },
},
},
},
},
},
},
},
});
// Compare output sizes
const outputComparison = await esClient.search({
index: 'api-logs',
body: {
query,
aggs: {
by_version: {
terms: {
field: 'version.keyword',
},
aggs: {
avg_response_size: {
avg: {
field: 'response_size_bytes',
},
},
},
},
},
},
});
console.log('Shadow traffic analysis:');
console.log('Latency:', latencyComparison.aggregations);
console.log('Output size:', outputComparison.aggregations);
return {
latencyComparison,
outputComparison,
recommendation:
'Canary version is safe for gradual traffic shift',
};
}
Strategy Comparison Matrix
export const DEPLOYMENT_STRATEGIES = {
blueGreen: {
rollbackTime: '< 1 minute',
riskProfile: 'Medium - full switch instantly',
dbMigrationStrategy: 'Apply before switch, rollback migrations available',
trafficShift: 'Instant 0% to 100%',
observability: 'Complete environment available for testing',
cost: 'Highest - maintain 2x infrastructure',
bestFor: 'Critical services needing fast rollback',
},
canary: {
rollbackTime: '5-30 minutes',
riskProfile: 'Low - 5% of users initially',
dbMigrationStrategy:
'Deploy backwards-compatible migrations first',
trafficShift: 'Gradual - 0% → 5% → 10% → 25% → 50% → 100%',
observability: 'Real user monitoring, metrics-driven progression',
cost: 'Medium - 2-3 extra pods per service',
bestFor:
'Services with unknown impact, continuous deployment',
},
rolling: {
rollbackTime: '10-60 minutes',
riskProfile: 'Low to Medium - old and new coexist',
dbMigrationStrategy:
'Requires backwards compatibility; no rollback',
trafficShift: 'Pod-by-pod replacement',
observability: 'Mixed old/new versions complicate debugging',
cost: 'Low - no extra resources during update',
bestFor: 'Straightforward updates, cost-sensitive environments',
},
shadowTraffic: {
rollbackTime: 'N/A - read-only mirror',
riskProfile: 'Lowest - zero impact on users',
dbMigrationStrategy: 'Shadow uses read-replicas, no mutation',
trafficShift: '100% to shadow, separate from user traffic',
observability: 'Compare behavior in parallel before deploy',
cost: 'High - shadow infrastructure overhead',
bestFor: 'High-risk changes, logic bugs, output format',
},
};
Checklist
- Define success metrics before deployment (latency, error rate)
- Test rollback procedure in staging
- Set up canary analysis rules
- Configure graceful shutdown (terminationGracePeriodSeconds)
- Implement readiness and liveness probes
- Monitor resource usage during deployment
- Have explicit rollback triggers
- Test database migration compatibility first
- Set deployment timeout/deadline
- Document rollback procedure per strategy
Conclusion
Blue/green trades cost for instant rollback. Canary trades time for confidence. Rolling updates trade observability for simplicity. Shadow traffic trades cost for zero-risk validation. Pick the strategy that matches your service criticality, team experience, and operational overhead tolerance. Most teams benefit from starting with rolling updates, graduating to canary for critical services, and shadowing before major logic changes.