Published on

Deployment Strategies — Blue/Green, Canary, Rolling, and Shadow Traffic Compared

Authors

Introduction

Shipping code is easy. Shipping code without breaking production is the art. Blue/green deployments swap entire environments instantly. Canary releases shift traffic gradually to new versions. Rolling updates replace pods one at a time. Shadow traffic mirrors requests to detect logic bugs before users see them. Each strategy trades off speed, observability, and rollback time. This post dissects each strategy with real Kubernetes configurations and guidance on choosing based on your SLA requirements.

Blue/Green Deployments with AWS ALB

Deploy to a completely separate environment (green), then switch traffic instantly.

# infra/blue-green-deployment.yaml
# Blue environment (current production)
apiVersion: v1
kind: Service
metadata:
  name: api-blue
  labels:
    app: api
    color: blue
spec:
  selector:
    app: api
    color: blue
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
      color: blue
  template:
    metadata:
      labels:
        app: api
        color: blue
    spec:
      containers:
        - name: api
          image: api:v1.2.3
          ports:
            - containerPort: 8080
          env:
            - name: VERSION
              value: 'v1.2.3'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5

---
# Green environment (new release, not receiving traffic yet)
apiVersion: v1
kind: Service
metadata:
  name: api-green
  labels:
    app: api
    color: green
spec:
  selector:
    app: api
    color: green
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
      color: green
  template:
    metadata:
      labels:
        app: api
        color: green
    spec:
      containers:
        - name: api
          image: api:v1.3.0
          ports:
            - containerPort: 8080
          env:
            - name: VERSION
              value: 'v1.3.0'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5

AWS ALB target group switching script:

// deploy/blue-green-switch.ts
import {
  ELBv2Client,
  ModifyListenerCommand,
} from '@aws-sdk/client-elastic-load-balancing-v2';

const elbClient = new ELBv2Client({ region: 'us-east-1' });

export async function switchTrafficBlueGreen(
  listenerArn: string,
  targetGroupArn: string,
  color: 'blue' | 'green'
) {
  // Run smoke tests on green before switching
  const smokeTestsPassed = await runSmokeTests(
    `http://api-${color}:80`
  );

  if (!smokeTestsPassed) {
    throw new Error(`Smoke tests failed for ${color} environment`);
  }

  const command = new ModifyListenerCommand({
    ListenerArn: listenerArn,
    DefaultActions: [
      {
        Type: 'forward',
        TargetGroupArn: `arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/api-${color}/abcdef1234567890`,
      },
    ],
  });

  const result = await elbClient.send(command);

  console.log(`Traffic switched to ${color} environment`);

  // Monitor new environment for 5 minutes
  await monitorEnvironment(color, 5 * 60 * 1000);

  return result;
}

async function runSmokeTests(baseUrl: string) {
  const tests = [
    { path: '/health', expectedStatus: 200 },
    { path: '/api/users', expectedStatus: 200 },
    { path: '/api/posts?limit=10', expectedStatus: 200 },
  ];

  for (const test of tests) {
    try {
      const response = await fetch(`${baseUrl}${test.path}`, {
        timeout: 5000,
      });
      if (response.status !== test.expectedStatus) {
        console.error(
          `Test failed: ${test.path} returned ${response.status}`
        );
        return false;
      }
    } catch (error) {
      console.error(`Test failed: ${test.path} - ${error}`);
      return false;
    }
  }

  return true;
}

async function monitorEnvironment(
  color: string,
  duration: number
) {
  const startTime = Date.now();

  while (Date.now() - startTime < duration) {
    // Check error rate from CloudWatch
    const errorRate = await getErrorRate(`api-${color}`);

    if (errorRate > 0.05) {
      // > 5% error rate
      throw new Error(
        `High error rate detected (${errorRate}%) on ${color}`
      );
    }

    await new Promise((resolve) => setTimeout(resolve, 10000)); // Check every 10s
  }
}

async function getErrorRate(serviceName: string): Promise<number> {
  // Query Prometheus/CloudWatch for error rate
  return 0.01; // Placeholder
}

export async function rollbackBlueGreen(
  listenerArn: string,
  previousColor: 'blue' | 'green'
) {
  console.log(`Rolling back to ${previousColor}`);

  const command = new ModifyListenerCommand({
    ListenerArn: listenerArn,
    DefaultActions: [
      {
        Type: 'forward',
        TargetGroupArn: `arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/api-${previousColor}/abcdef1234567890`,
      },
    ],
  });

  await elbClient.send(command);
}

Canary Deployments with Argo Rollouts

Gradually shift traffic from stable to new version using weighted percentages:

# deploy/canary-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 3
  revisionHistoryLimit: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: api:v1.3.0
          ports:
            - containerPort: 8080
          env:
            - name: VERSION
              value: 'v1.3.0'
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

  strategy:
    canary:
      canaryService: api-canary # Service to test canary before shifting traffic
      stableService: api-stable # Service for stable version
      canaryMetadata:
        labels:
          canary: 'true'
      steps:
        # Step 1: Deploy 1 canary pod, 0% traffic
        - setWeight: 0
        - pause:
            duration: 5m # Wait 5 minutes for metrics to stabilize

        # Step 2: Shift 5% traffic
        - setWeight: 5
        - pause:
            duration: 5m

        # Step 3: Automatic analysis - check error rate and latency
        - analysis:
            templates:
              - name: error-rate
              - name: latency
          setWeight: 10
        - pause:
            duration: 5m

        # Step 4: Increase to 25%
        - setWeight: 25
        - pause:
            duration: 5m

        # Step 5: Increase to 50%
        - setWeight: 50
        - pause:
            duration: 5m

        # Step 6: Full traffic
        - setWeight: 100

  # Rollback if metrics violate thresholds
  analysis:
    interval: 1m
    threshold: 5 # Number of analysis runs
    unsuccessful:
      threshold: 3 # Fail after 3 unsuccessful runs

  # Analysis runs at each step
  template:
    analysisIntervalSeconds: 30

AnalysisTemplate for canary validation:

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
    - name: service-name
      value: api
  metrics:
    - name: error-rate
      interval: 1m
      successCriteria: result <= 0.05 # Max 5% errors
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            scalar(
              sum(rate(http_requests_total{service="{{args.service-name}}", status=~"5.."}[5m]))
              /
              sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
            )

    - name: latency-p99
      interval: 1m
      successCriteria: result <= 500 # Max 500ms p99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            scalar(
              histogram_quantile(
                0.99,
                sum(rate(http_request_duration_seconds_bucket{service="{{args.service-name}}"}[5m])) by (le)
              ) * 1000
            )

Nginx-based canary with weight shifting:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api
  annotations:
    nginx.ingress.kubernetes.io/canary: 'true'
    nginx.ingress.kubernetes.io/canary-weight: '5' # Start at 5% traffic
spec:
  ingressClassName: nginx
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-canary
                port:
                  number: 80

Rolling Updates with maxSurge and maxUnavailable

Update pods gradually in-place:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-rolling
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2 # Allow 2 extra pods during update (total: 12)
      maxUnavailable: 1 # Keep at least 9 available during update
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      # Graceful shutdown: wait for in-flight requests to complete
      terminationGracePeriodSeconds: 30
      containers:
        - name: api
          image: api:v1.3.0
          ports:
            - containerPort: 8080
          lifecycle:
            preStop:
              exec:
                command:
                  [
                    '/bin/sh',
                    '-c',
                    'sleep 15 && nginx -s quit',
                  ] # Wait 15s before terminating
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 2
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

  # Minimal downtime; still fast rollback available
  progressDeadlineSeconds: 600 # Abort if rollout takes > 10 min
  revisionHistoryLimit: 10

Rolling update monitoring:

// deploy/rolling-update-monitor.ts
import { AppsV1Api, KubeConfig } from '@kubernetes/client-node';

const kc = new KubeConfig();
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(AppsV1Api);

export async function monitorRollingUpdate(
  namespace: string,
  deploymentName: string,
  timeoutSeconds: number = 600
) {
  const startTime = Date.now();
  const maxDuration = timeoutSeconds * 1000;

  while (Date.now() - startTime < maxDuration) {
    const deployment = await k8sApi.readNamespacedDeployment(
      deploymentName,
      namespace
    );

    const spec = deployment.body.spec!;
    const status = deployment.body.status!;

    const replicas = spec.replicas || 0;
    const updatedReplicas = status.updatedReplicas || 0;
    const readyReplicas = status.readyReplicas || 0;
    const availableReplicas = status.availableReplicas || 0;

    console.log(
      `[${new Date().toISOString()}] Replicas: ${availableReplicas}/${replicas} available, ${updatedReplicas} updated, ${readyReplicas} ready`
    );

    // Check if all replicas are updated and ready
    if (
      updatedReplicas === replicas &&
      readyReplicas === replicas &&
      availableReplicas === replicas
    ) {
      console.log('Rolling update complete');
      return { success: true, duration: Date.now() - startTime };
    }

    // Check for rollout error
    if (status.conditions) {
      const progressingCondition = status.conditions.find(
        (c) => c.type === 'Progressing'
      );
      if (
        progressingCondition?.status === 'False' &&
        progressingCondition?.reason === 'ProgressDeadlineExceeded'
      ) {
        throw new Error('Rolling update exceeded progress deadline');
      }
    }

    await new Promise((resolve) => setTimeout(resolve, 5000)); // Poll every 5s
  }

  throw new Error(`Rolling update timed out after ${timeoutSeconds}s`);
}

Shadow Traffic Deployment

Mirror requests to new version without affecting users:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api
spec:
  hosts:
    - api.example.com
  http:
    # Main traffic to stable version
    - match:
        - uri:
            prefix: '/'
      route:
        - destination:
            host: api
            subset: stable
          weight: 100

---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-shadow
spec:
  hosts:
    - api-shadow # Internal mirroring destination
  http:
    - match:
        - sourceLabels:
            app: api
      route:
        - destination:
            host: api
            subset: canary
          weight: 100

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: api
spec:
  host: api
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
  subsets:
    - name: stable
      labels:
        version: v1.2.3
    - name: canary
      labels:
        version: v1.3.0

Envoy filter for request mirroring:

apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: api-mirror
spec:
  workloadSelector:
    labels:
      app: api
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_OUTBOUND
        listener:
          filterChain:
            filter:
              name: 'envoy.filters.network.http_connection_manager'
      patch:
        operation: INSERT_AFTER
        value:
          name: envoy.ext_authz
          typed_config:
            '@type': type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
            inline_code: |
              function envoy_on_response(response_handle)
                -- Log response from canary version
                local headers = response_handle:headers()
                local body = response_handle:body()
                print("Canary response: " .. tostring(headers))
              end

Shadow traffic analyzer:

// deploy/shadow-analyzer.ts
import { Client } from '@elastic/elasticsearch';

const esClient = new Client({
  node: 'http://elasticsearch:9200',
});

export async function analyzeShadowTraffic(
  stableVersion: string,
  canaryVersion: string,
  durationMinutes: number
) {
  const now = Date.now();
  const startTime = now - durationMinutes * 60 * 1000;

  const query = {
    bool: {
      must: [
        { range: { '@timestamp': { gte: startTime, lte: now } } },
        {
          bool: {
            should: [
              { match: { 'version.keyword': stableVersion } },
              { match: { 'version.keyword': canaryVersion } },
            ],
          },
        },
      ],
    },
  };

  // Compare response times
  const latencyComparison = await esClient.search({
    index: 'api-logs',
    body: {
      query,
      aggs: {
        by_version: {
          terms: {
            field: 'version.keyword',
            size: 2,
          },
          aggs: {
            latency_p99: {
              percentiles: {
                field: 'response_time_ms',
                percents: [99],
              },
            },
            error_rate: {
              filter: {
                range: {
                  status_code: { gte: 500 },
                },
              },
            },
          },
        },
      },
    },
  });

  // Compare output sizes
  const outputComparison = await esClient.search({
    index: 'api-logs',
    body: {
      query,
      aggs: {
        by_version: {
          terms: {
            field: 'version.keyword',
          },
          aggs: {
            avg_response_size: {
              avg: {
                field: 'response_size_bytes',
              },
            },
          },
        },
      },
    },
  });

  console.log('Shadow traffic analysis:');
  console.log('Latency:', latencyComparison.aggregations);
  console.log('Output size:', outputComparison.aggregations);

  return {
    latencyComparison,
    outputComparison,
    recommendation:
      'Canary version is safe for gradual traffic shift',
  };
}

Strategy Comparison Matrix

export const DEPLOYMENT_STRATEGIES = {
  blueGreen: {
    rollbackTime: '< 1 minute',
    riskProfile: 'Medium - full switch instantly',
    dbMigrationStrategy: 'Apply before switch, rollback migrations available',
    trafficShift: 'Instant 0% to 100%',
    observability: 'Complete environment available for testing',
    cost: 'Highest - maintain 2x infrastructure',
    bestFor: 'Critical services needing fast rollback',
  },
  canary: {
    rollbackTime: '5-30 minutes',
    riskProfile: 'Low - 5% of users initially',
    dbMigrationStrategy:
      'Deploy backwards-compatible migrations first',
    trafficShift: 'Gradual - 0% → 5% → 10% → 25% → 50% → 100%',
    observability: 'Real user monitoring, metrics-driven progression',
    cost: 'Medium - 2-3 extra pods per service',
    bestFor:
      'Services with unknown impact, continuous deployment',
  },
  rolling: {
    rollbackTime: '10-60 minutes',
    riskProfile: 'Low to Medium - old and new coexist',
    dbMigrationStrategy:
      'Requires backwards compatibility; no rollback',
    trafficShift: 'Pod-by-pod replacement',
    observability: 'Mixed old/new versions complicate debugging',
    cost: 'Low - no extra resources during update',
    bestFor: 'Straightforward updates, cost-sensitive environments',
  },
  shadowTraffic: {
    rollbackTime: 'N/A - read-only mirror',
    riskProfile: 'Lowest - zero impact on users',
    dbMigrationStrategy: 'Shadow uses read-replicas, no mutation',
    trafficShift: '100% to shadow, separate from user traffic',
    observability: 'Compare behavior in parallel before deploy',
    cost: 'High - shadow infrastructure overhead',
    bestFor: 'High-risk changes, logic bugs, output format',
  },
};

Checklist

  • Define success metrics before deployment (latency, error rate)
  • Test rollback procedure in staging
  • Set up canary analysis rules
  • Configure graceful shutdown (terminationGracePeriodSeconds)
  • Implement readiness and liveness probes
  • Monitor resource usage during deployment
  • Have explicit rollback triggers
  • Test database migration compatibility first
  • Set deployment timeout/deadline
  • Document rollback procedure per strategy

Conclusion

Blue/green trades cost for instant rollback. Canary trades time for confidence. Rolling updates trade observability for simplicity. Shadow traffic trades cost for zero-risk validation. Pick the strategy that matches your service criticality, team experience, and operational overhead tolerance. Most teams benefit from starting with rolling updates, graduating to canary for critical services, and shadowing before major logic changes.