Published on

Kubernetes Resource Management — Requests, Limits, and Why Your Pods Keep Getting OOMKilled

Authors

Introduction

Resource management is one of the most misunderstood aspects of Kubernetes. You set requests and limits, assume your containers won't consume more than allocated, and then wake up at 2 AM to pages about pods being OOMKilled on nodes that appeared to have free memory. Understanding the difference between requests and limits, CPU throttling versus memory OOM, and how Kubernetes assigns Quality of Service classes is fundamental to running reliable workloads at scale.

Requests vs Limits: The Core Distinction

Requests tell Kubernetes the minimum resources a pod needs. Limits act as a ceiling. The scheduler uses requests to bin-pack pods onto nodes; the kubelet uses limits to enforce resource boundaries.

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: my-app:v1.2.3
    resources:
      requests:
        cpu: "500m"          # Scheduler guarantees this
        memory: "256Mi"      # Scheduler guarantees this
      limits:
        cpu: "1000m"         # Kubelet enforces (kernel cgroup)
        memory: "512Mi"      # Kubelet OOMKills if exceeded
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - my-app
          topologyKey: kubernetes.io/hostname

If your pod requests 256Mi but tries to allocate 513Mi, the kubelet sends SIGKILL. If a pod requests 500m CPU but needs 800m, the kernel throttles it—your app slows down but doesn't crash.

CPU Throttling vs Memory OOM

CPU is a compressible resource; memory is not. When a pod exceeds its CPU limit, the kernel's CFS scheduler throttles the process. Latency increases, throughput drops, but the pod stays running.

---
apiVersion: v1
kind: Pod
metadata:
  name: throttled-cpu
spec:
  containers:
  - name: worker
    image: busybox:1.35
    resources:
      requests:
        cpu: "100m"
      limits:
        cpu: "200m"
    command:
    - /bin/sh
    - -c
    - |
      while true; do
        dd if=/dev/zero of=/dev/null bs=1M count=1024
      done
  restartPolicy: Never

This pod will be CPU-throttled but survive. Memory violations, by contrast, are fatal. The kernel's OOM killer selects processes (often the offending pod) and terminates them.

---
apiVersion: v1
kind: Pod
metadata:
  name: oom-demo
spec:
  containers:
  - name: memory-hog
    image: ubuntu:22.04
    resources:
      requests:
        memory: "100Mi"
      limits:
        memory: "100Mi"
    command:
    - /bin/bash
    - -c
    - |
      # This will OOMKill the pod
      python3 -c "
      import os
      data = []
      while True:
          data.append('x' * (1024*1024))
      "

LimitRange and Namespace Defaults

Rather than rely on developers to always set requests and limits, use LimitRange to enforce defaults per namespace.

apiVersion: v1
kind: LimitRange
metadata:
  name: app-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    max:
      cpu: "4"
      memory: "4Gi"
    min:
      cpu: "10m"
      memory: "32Mi"
  - type: Pod
    max:
      cpu: "8"
      memory: "8Gi"
    min:
      cpu: "20m"
      memory: "64Mi"

Containers created in the production namespace without explicit resources inherit these defaults. The max field prevents over-allocation.

Vertical Pod Autoscaler (VPA) Recommendation Mode

VPA analyzes actual resource usage and recommends appropriate requests and limits. In recommendation mode, it doesn't auto-update pods—you review and apply recommendations.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "off"  # Recommendation mode
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: "50m"
        memory: "64Mi"
      maxAllowed:
        cpu: "2"
        memory: "2Gi"
      controlledResources:
      - cpu
      - memory
      controlledValues: RequestsAndLimits

VPA examines metrics over days and suggests values. Extract recommendations with:

kubectl describe vpa app-vpa -n production

HPA with Custom Metrics (KEDA)

Horizontal Pod Autoscaler (HPA) scales replicas based on metrics. For simple CPU/memory, use Kubernetes HPA. For SQS queue depth, Kafka lag, or custom metrics, use KEDA.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-scaler
  namespace: processing
spec:
  scaleTargetRef:
    name: job-processor
    kind: Deployment
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: "https://sqs.us-east-1.amazonaws.com/123456789/jobs"
      queueLength: "5"
      awsRegion: "us-east-1"
      identityOwner: "operator"
  fallbacks:
  - failureType: "all"
    replicas: 10

This scales the Deployment from 1 to 50 replicas based on SQS queue depth, maintaining 5 messages per pod.

Quality of Service Classes

Kubernetes assigns QoS classes based on requests and limits. Understanding these classes is critical for reliability.

Guaranteed: Pod has matching requests and limits for CPU and memory. Under eviction, these pods are killed last.

spec:
  containers:
  - name: app
    resources:
      requests:
        cpu: "500m"
        memory: "256Mi"
      limits:
        cpu: "500m"
        memory: "256Mi"

Burstable: Pod has requests less than limits, or has some but not all resources specified. Killed before Guaranteed, after BestEffort.

spec:
  containers:
  - name: app
    resources:
      requests:
        cpu: "100m"
        memory: "128Mi"
      limits:
        cpu: "500m"
        memory: "512Mi"

BestEffort: No requests or limits. Killed first under memory pressure.

ResourceQuota for Namespace Control

ResourceQuota enforces aggregate resource consumption per namespace, preventing one team from monopolizing the cluster.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "100"
    services: "10"
    persistentvolumeclaims: "5"
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values: ["high", "medium"]

Right-Sizing Workflow

  1. Baseline: Set conservative requests based on app documentation (e.g., JVM needs 512Mi minimum).
  2. Monitor: Deploy with LimitRange defaults. Collect metrics over 1-2 weeks.
  3. Analyze: Use VPA recommendations or Prometheus queries to identify actual usage patterns.
  4. Adjust: Update requests to P95 usage; set limits to P99 + 20% headroom.
  5. Test: Verify latency and throughput with load tests under the new limits.
  6. Review: Quarterly re-evaluate as traffic patterns change.

Checklist

  • All production pods have explicit CPU and memory requests
  • CPU limit ≥ 2x request; memory limit ≥ 1.25x request (at minimum)
  • LimitRange defined for all critical namespaces
  • ResourceQuota prevents single team/app from consuming entire cluster
  • VPA deployed in recommendation mode; reviewed monthly
  • HPA configured for stateless workloads with appropriate target utilization (70-80%)
  • QoS class verified for critical pods (aim for Guaranteed)
  • Monitoring alerts on CPU throttling (node_cpu_cfs_throttled_seconds_total)
  • Runbooks document how to respond to OOMKills and evictions
  • Annual right-sizing review based on actual usage trends

Conclusion

Kubernetes resource management is not a one-time configuration task—it's an ongoing practice of observation, measurement, and adjustment. Start with conservative requests, monitor actual usage, and use VPA to guide refinements. Combine QoS classes, LimitRange, and ResourceQuota to prevent noisy neighbors and cascading failures. In production, this discipline transforms a flaky platform into one where pods behave predictably and your on-call rotations aren't eaten alive by resource-related incidents.