Kubernetes Autoscaling: HPA vs VPA vs Cluster Autoscaler

Kubernetes autoscaling is not one feature — it's three independent controllers that scale different things. The Horizontal Pod Autoscaler adds and removes pod replicas, the Vertical Pod Autoscaler tunes each pod's CPU and memory requests, and the Cluster Autoscaler adds and removes nodes. Confuse them and you get the classic failure: an HPA that wants ten more pods sitting in Pending forever because nothing grows the cluster underneath it. This guide explains what each autoscaler actually does, how they compose into one elastic system, and how to configure them so your workloads scale up under load and scale down when the traffic goes home.

This is a supporting page for the Devgains Kubernetes architecture guide, which explains the scheduler and reconciliation loop that every autoscaler plugs into. It builds directly on requests and limits: those numbers are the input signal that both the HPA and VPA reason about, so get them right first.

Quick answer: the three Kubernetes autoscalers

Horizontal Pod Autoscaler (HPA) — scales the number of replicas in a Deployment or StatefulSet based on observed metrics (usually CPU or memory utilization against the pod's request). More load → more pods.
Vertical Pod Autoscaler (VPA) — adjusts each pod's CPU and memory requests to match real usage. It right-sizes a single pod rather than adding more of them.
Cluster Autoscaler (CA) — scales the number of nodes in the cluster. When pods can't be scheduled for lack of capacity, it adds nodes; when nodes sit underused, it drains and removes them.

The one-line rule: HPA scales out, VPA scales up, and the Cluster Autoscaler grows the cluster so there's somewhere for the new pods to land.

Why it matters

Autoscaling is how you match capacity to demand without a human in the loop. Done well, it delivers three things at once:

Reliability under load. When traffic spikes, the HPA adds replicas before latency degrades, and the Cluster Autoscaler supplies the nodes to run them.
Cost control. When traffic drops, the same controllers scale back down so you're not paying for idle pods and half-empty nodes overnight.
Right-sized requests. The VPA stops you from guessing CPU and memory forever — it observes actual usage and recommends values that keep the scheduler's bin-packing honest.

Miss any one layer and the system stalls. An HPA with no room to grow just queues Pending pods; a Cluster Autoscaler with no pod-level scaling adds nodes that never fill. The layers only work together.

How each autoscaler works

Horizontal Pod Autoscaler

The HPA is a control loop that runs every 15 seconds by default. It reads a metric — most commonly CPU utilization as a percentage of the pod's request — averages it across the current replicas, and computes a desired replica count with a simple ratio:

desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric)

If four pods average 80% CPU and your target is 50%, the HPA wants ceil(4 × 80 / 50) = 7 replicas. Because it divides by the request, an HPA is only as good as the requests you set — which is why requests come first. The HPA can also scale on memory, custom application metrics (requests per second, queue depth), or external metrics via an adapter.

Vertical Pod Autoscaler

The VPA watches a workload's historical CPU and memory usage and produces a recommendation for requests (and optionally limits). It runs in one of three modes:

Off (recommendation only) — it publishes suggested values you can read and apply by hand. The safest starting point.
Initial — it sets requests when a pod is first created, then leaves it alone.
Auto / Recreate — it evicts and recreates pods to apply new requests. Powerful, but eviction causes a restart, so it's disruptive for stateful or singleton workloads.

The critical caveat: do not run the VPA in Auto mode on the same metric (CPU or memory) that an HPA scales on. They fight — the VPA changes the request, which moves the HPA's utilization denominator, which changes the replica count, and neither settles.

Cluster Autoscaler

The Cluster Autoscaler operates at the infrastructure layer. It watches for pods stuck in Pending because no node has room, and it asks the cloud provider's node group (an AWS Auto Scaling Group, an Azure VMSS, a GKE node pool) to add a node. On the way down, it finds nodes that have been underused for a configured period, checks that their pods can be rescheduled elsewhere, cordons and drains them, and removes them. It scales on schedulability, not on CPU graphs — a subtle but important distinction.

The three axes, side by side

	HPA	VPA	Cluster Autoscaler
Scales	Replica count (pods)	CPU/memory requests per pod	Node count
Trigger	Metric vs target (CPU, custom)	Historical usage	`Pending` pods / idle nodes
Direction	Out and in	Up and down	Cluster grows / shrinks
Disruptive?	No (adds/removes pods)	Yes in Auto (evicts pods)	Yes (drains nodes)
Good for	Bursty, stateless traffic	Right-sizing requests	Elastic infrastructure cost
Conflicts with	VPA on the same metric	HPA on the same metric	— (complements both)

Step-by-step: add an HPA to a Deployment

Start from a Deployment that already sets a CPU request — without it the HPA has no denominator and does nothing. (See requests vs limits for why the request is the reference point.)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels: { app: api }
  template:
    metadata:
      labels: { app: api }
    spec:
      containers:
        - name: api
          image: ghcr.io/acme/api:1a2b3c4
          resources:
            requests:
              cpu: "250m"      # HPA measures utilization against THIS
              memory: "256Mi"
            limits:
              memory: "512Mi"

Now attach an HPA that keeps average CPU near 60% of the request, between 3 and 20 replicas:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:                     # tune reaction speed to avoid thrashing
    scaleDown:
      stabilizationWindowSeconds: 300   # wait 5m of low load before scaling in

Apply it and watch it react. The HPA needs metrics-server installed to read pod CPU:

kubectl apply -f api-hpa.yaml
 
# Watch the current vs target utilization and the replica count move:
kubectl get hpa api --watch
# NAME  REFERENCE      TARGETS   MINPODS  MAXPODS  REPLICAS
# api   Deployment/api 82%/60%   3        20       3 -> 5
 
# If TARGETS shows <unknown>, metrics-server isn't returning metrics:
kubectl top pod -l app=api    # must return numbers, or the HPA is blind

A TARGETS column showing <unknown>/60% is the number-one HPA support ticket: it almost always means metrics-server is missing or the container has no CPU request. Fix the input signal and the loop starts moving.

How the three combine

A well-tuned cluster runs all three, each on its own axis:

VPA (recommendation mode) tells you what CPU and memory each pod actually needs, so your requests are honest.
HPA scales replicas out and in on live load, using those honest requests as its denominator.
Cluster Autoscaler adds nodes when the HPA's new pods can't be scheduled, and reclaims them when load falls.

The safe pattern is VPA for requests, HPA for replicas, CA for nodes — never HPA and VPA fighting over the same metric. Many teams run the VPA in recommendation mode only, apply its numbers during deploys, and let the HPA and Cluster Autoscaler handle the live elasticity.

Best practices

Set requests before you set an HPA. The HPA computes utilization as a fraction of the CPU request. No request, no denominator, no scaling. This is the single most common cause of an HPA that "does nothing."
Install metrics-server. The default resource-based HPA reads pod metrics from it. Without it, kubectl top and the HPA both go blind.
Use a stabilization window for scale-down. A scaleDown.stabilizationWindowSeconds of a few minutes stops the HPA from flapping replicas up and down on noisy metrics.
Never point HPA and VPA at the same metric in Auto mode. They form a feedback loop that never converges. Run the VPA in recommendation mode, or scale the HPA on a different signal (e.g. a custom RPS metric) than the one the VPA manages.
Give the Cluster Autoscaler room and PodDisruptionBudgets. Set a PodDisruptionBudget so draining a node during scale-down doesn't take your service below its minimum available replicas.
Pair autoscaling with honest health checks. New replicas only receive traffic once their readiness probe passes, so a slow-starting pod won't absorb load until it's actually ready — critical during a scale-up spike.
Scale on the metric that reflects user pain. CPU is a fine default, but a queue-worker should scale on queue depth and an API on requests-per-second or p95 latency via custom metrics.

Common mistakes

HPA with no CPU request. Utilization is undefined, so the HPA sits at minReplicas and never reacts. Always set requests first.
HPA and VPA on the same resource. The two controllers chase each other — the VPA raises the request, the HPA's utilization drops, replicas fall, load per pod climbs, and round it goes.
No Cluster Autoscaler behind the HPA. The HPA happily requests 20 replicas, but half of them stay Pending because no node has room. Horizontal scaling needs node scaling underneath it.
maxReplicas too low. The HPA is capped below real demand and silently under-provisions at peak. Watch for an HPA pinned at its ceiling.
Scaling on lagging metrics. CPU rises after latency already degraded. For latency-sensitive services, scale on a leading signal (RPS, concurrency, queue depth) instead.
Aggressive scale-down with no PDB. The Cluster Autoscaler drains a node and briefly drops you below your replica floor because nothing declared a disruption budget.

Takeaways

Three autoscalers, three axes. HPA scales replicas, VPA scales requests, Cluster Autoscaler scales nodes. They are complementary, not alternatives.
Requests are the linchpin. The HPA measures against the CPU request and the VPA sets it — wrong requests break both.
Never fight HPA against VPA on the same metric; run the VPA in recommendation mode or split the signals.
HPA needs the Cluster Autoscaler or its extra replicas have nowhere to schedule.

Keep building your mental model with the Kubernetes cluster and the related DevOps guides: how the control plane schedules pods, how workload controllers shape what gets scaled, and how Services spread traffic across the replicas the HPA adds.

FAQ

What is autoscaling in Kubernetes? Kubernetes autoscaling is the automatic adjustment of capacity to match demand. It has three independent controllers: the Horizontal Pod Autoscaler (more replicas), the Vertical Pod Autoscaler (bigger CPU/memory requests per pod), and the Cluster Autoscaler (more nodes). Each scales a different dimension.

What is the difference between HPA and VPA? The HPA changes the number of pod replicas based on live metrics like CPU utilization, so it scales out and in. The VPA changes the CPU and memory requests of each pod based on historical usage, so it scales up and down. HPA handles bursty load; VPA right-sizes a workload. Don't run both on the same metric in automatic mode.

Do I need the Cluster Autoscaler if I use the HPA? Usually yes. The HPA can request more replicas than the current nodes have room for, leaving pods Pending. The Cluster Autoscaler adds nodes so those pods can schedule, and removes idle nodes when load falls. HPA scales pods; the Cluster Autoscaler scales the infrastructure under them.

Can HPA and VPA run together? Yes, but not on the same resource metric in Auto mode, or they form a feedback loop that never settles. The common safe pattern is to run the VPA in recommendation mode to size requests and let the HPA scale replicas, or to scale the HPA on a custom metric the VPA doesn't touch.

Why is my HPA not scaling? The usual causes are: no CPU request on the container (so utilization is undefined), metrics-server not installed (so the HPA can't read metrics, showing <unknown>/60%), or maxReplicas set too low. Check kubectl get hpa and kubectl top pod first.

Conclusion

Kubernetes autoscaling is elastic only when all three layers cooperate. The HPA reacts to live load by adding and removing replicas; the VPA keeps each pod's requests honest so that reaction is accurate; and the Cluster Autoscaler grows and shrinks the cluster so the replicas always have somewhere to run. Set requests first, keep the HPA and VPA off the same metric, put a Cluster Autoscaler behind every HPA, and protect scale-down with disruption budgets. From here, revisit the architecture guide to see how the scheduler turns autoscaling decisions into placed pods, and requests vs limits to make sure the numbers your autoscalers depend on are right.

References

Kubernetes: Horizontal Pod Autoscaling — how the HPA loop, metrics, and the desired-replica calculation work.
Kubernetes: HorizontalPodAutoscaler Walkthrough — a hands-on example of creating and observing an HPA.
Kubernetes Autoscaler: Vertical Pod Autoscaler — the VPA's modes, recommender, and updater components.
Kubernetes Autoscaler: Cluster Autoscaler FAQ — how node scale-up and scale-down decisions are made.
Kubernetes: Assign CPU Resources to Containers — why the CPU request is the reference the HPA scales against.