Kubernetes Autoscaling: HPA vs VPA vs Cluster Autoscaler

Cover: illustration generated for Devgains
Kubernetes autoscaling is not one feature — it's three independent controllers that scale
different things. The Horizontal Pod Autoscaler adds and removes pod replicas, the Vertical Pod
Autoscaler tunes each pod's CPU and memory requests, and the Cluster Autoscaler adds and removes
nodes. Confuse them and you get the classic failure: an HPA that wants ten more pods sitting in
Pending forever because nothing grows the cluster underneath it. This guide explains what each
autoscaler actually does, how they compose into one elastic system, and how to configure them so
your workloads scale up under load and scale down when the traffic goes home.
This is a supporting page for the Devgains Kubernetes architecture guide, which explains the scheduler and reconciliation loop that every autoscaler plugs into. It builds directly on requests and limits: those numbers are the input signal that both the HPA and VPA reason about, so get them right first.
Quick answer: the three Kubernetes autoscalers
- Horizontal Pod Autoscaler (HPA) — scales the number of replicas in a Deployment or StatefulSet based on observed metrics (usually CPU or memory utilization against the pod's request). More load → more pods.
- Vertical Pod Autoscaler (VPA) — adjusts each pod's CPU and memory requests to match real usage. It right-sizes a single pod rather than adding more of them.
- Cluster Autoscaler (CA) — scales the number of nodes in the cluster. When pods can't be scheduled for lack of capacity, it adds nodes; when nodes sit underused, it drains and removes them.
The one-line rule: HPA scales out, VPA scales up, and the Cluster Autoscaler grows the cluster so there's somewhere for the new pods to land.
Why it matters
Autoscaling is how you match capacity to demand without a human in the loop. Done well, it delivers three things at once:
- Reliability under load. When traffic spikes, the HPA adds replicas before latency degrades, and the Cluster Autoscaler supplies the nodes to run them.
- Cost control. When traffic drops, the same controllers scale back down so you're not paying for idle pods and half-empty nodes overnight.
- Right-sized requests. The VPA stops you from guessing CPU and memory forever — it observes actual usage and recommends values that keep the scheduler's bin-packing honest.
Miss any one layer and the system stalls. An HPA with no room to grow just queues Pending pods; a
Cluster Autoscaler with no pod-level scaling adds nodes that never fill. The layers only work
together.
How each autoscaler works
Horizontal Pod Autoscaler
The HPA is a control loop that runs every 15 seconds by default. It reads a metric — most commonly CPU utilization as a percentage of the pod's request — averages it across the current replicas, and computes a desired replica count with a simple ratio:
desiredReplicas = ceil(currentReplicas × currentMetric / targetMetric)If four pods average 80% CPU and your target is 50%, the HPA wants ceil(4 × 80 / 50) = 7 replicas.
Because it divides by the request, an HPA is only as good as the requests you set — which is why
requests come first. The HPA can also scale on memory, custom application metrics (requests per
second, queue depth), or external metrics via an adapter.
Vertical Pod Autoscaler
The VPA watches a workload's historical CPU and memory usage and produces a recommendation for
requests (and optionally limits). It runs in one of three modes:
- Off (recommendation only) — it publishes suggested values you can read and apply by hand. The safest starting point.
- Initial — it sets requests when a pod is first created, then leaves it alone.
- Auto / Recreate — it evicts and recreates pods to apply new requests. Powerful, but eviction causes a restart, so it's disruptive for stateful or singleton workloads.
The critical caveat: do not run the VPA in Auto mode on the same metric (CPU or memory) that an HPA scales on. They fight — the VPA changes the request, which moves the HPA's utilization denominator, which changes the replica count, and neither settles.
Cluster Autoscaler
The Cluster Autoscaler operates at the infrastructure layer. It watches for pods stuck in Pending
because no node has room, and it asks the cloud provider's node group (an AWS Auto Scaling Group, an
Azure VMSS, a GKE node pool) to add a node. On the way down, it finds nodes that have been underused
for a configured period, checks that their pods can be rescheduled elsewhere, cordons and drains
them, and removes them. It scales on schedulability, not on CPU graphs — a subtle but important
distinction.
The three axes, side by side
| HPA | VPA | Cluster Autoscaler | |
|---|---|---|---|
| Scales | Replica count (pods) | CPU/memory requests per pod | Node count |
| Trigger | Metric vs target (CPU, custom) | Historical usage | Pending pods / idle nodes |
| Direction | Out and in | Up and down | Cluster grows / shrinks |
| Disruptive? | No (adds/removes pods) | Yes in Auto (evicts pods) | Yes (drains nodes) |
| Good for | Bursty, stateless traffic | Right-sizing requests | Elastic infrastructure cost |
| Conflicts with | VPA on the same metric | HPA on the same metric | — (complements both) |
Step-by-step: add an HPA to a Deployment
Start from a Deployment that already sets a CPU request — without it the HPA has no denominator and does nothing. (See requests vs limits for why the request is the reference point.)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
selector:
matchLabels: { app: api }
template:
metadata:
labels: { app: api }
spec:
containers:
- name: api
image: ghcr.io/acme/api:1a2b3c4
resources:
requests:
cpu: "250m" # HPA measures utilization against THIS
memory: "256Mi"
limits:
memory: "512Mi"Now attach an HPA that keeps average CPU near 60% of the request, between 3 and 20 replicas:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior: # tune reaction speed to avoid thrashing
scaleDown:
stabilizationWindowSeconds: 300 # wait 5m of low load before scaling inApply it and watch it react. The HPA needs metrics-server installed to read pod CPU:
kubectl apply -f api-hpa.yaml
# Watch the current vs target utilization and the replica count move:
kubectl get hpa api --watch
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# api Deployment/api 82%/60% 3 20 3 -> 5
# If TARGETS shows <unknown>, metrics-server isn't returning metrics:
kubectl top pod -l app=api # must return numbers, or the HPA is blindA TARGETS column showing <unknown>/60% is the number-one HPA support ticket: it almost always
means metrics-server is missing or the container has no CPU request. Fix the input signal and the
loop starts moving.
How the three combine
A well-tuned cluster runs all three, each on its own axis:
- VPA (recommendation mode) tells you what CPU and memory each pod actually needs, so your requests are honest.
- HPA scales replicas out and in on live load, using those honest requests as its denominator.
- Cluster Autoscaler adds nodes when the HPA's new pods can't be scheduled, and reclaims them when load falls.
The safe pattern is VPA for requests, HPA for replicas, CA for nodes — never HPA and VPA fighting over the same metric. Many teams run the VPA in recommendation mode only, apply its numbers during deploys, and let the HPA and Cluster Autoscaler handle the live elasticity.
Best practices
- Set requests before you set an HPA. The HPA computes utilization as a fraction of the CPU request. No request, no denominator, no scaling. This is the single most common cause of an HPA that "does nothing."
- Install
metrics-server. The default resource-based HPA reads pod metrics from it. Without it,kubectl topand the HPA both go blind. - Use a stabilization window for scale-down. A
scaleDown.stabilizationWindowSecondsof a few minutes stops the HPA from flapping replicas up and down on noisy metrics. - Never point HPA and VPA at the same metric in Auto mode. They form a feedback loop that never converges. Run the VPA in recommendation mode, or scale the HPA on a different signal (e.g. a custom RPS metric) than the one the VPA manages.
- Give the Cluster Autoscaler room and PodDisruptionBudgets. Set a PodDisruptionBudget so draining a node during scale-down doesn't take your service below its minimum available replicas.
- Pair autoscaling with honest health checks. New replicas only receive traffic once their readiness probe passes, so a slow-starting pod won't absorb load until it's actually ready — critical during a scale-up spike.
- Scale on the metric that reflects user pain. CPU is a fine default, but a queue-worker should scale on queue depth and an API on requests-per-second or p95 latency via custom metrics.
Common mistakes
- HPA with no CPU request. Utilization is undefined, so the HPA sits at
minReplicasand never reacts. Always set requests first. - HPA and VPA on the same resource. The two controllers chase each other — the VPA raises the request, the HPA's utilization drops, replicas fall, load per pod climbs, and round it goes.
- No Cluster Autoscaler behind the HPA. The HPA happily requests 20 replicas, but half of them
stay
Pendingbecause no node has room. Horizontal scaling needs node scaling underneath it. maxReplicastoo low. The HPA is capped below real demand and silently under-provisions at peak. Watch for an HPA pinned at its ceiling.- Scaling on lagging metrics. CPU rises after latency already degraded. For latency-sensitive services, scale on a leading signal (RPS, concurrency, queue depth) instead.
- Aggressive scale-down with no PDB. The Cluster Autoscaler drains a node and briefly drops you below your replica floor because nothing declared a disruption budget.
Takeaways
- Three autoscalers, three axes. HPA scales replicas, VPA scales requests, Cluster Autoscaler scales nodes. They are complementary, not alternatives.
- Requests are the linchpin. The HPA measures against the CPU request and the VPA sets it — wrong requests break both.
- Never fight HPA against VPA on the same metric; run the VPA in recommendation mode or split the signals.
- HPA needs the Cluster Autoscaler or its extra replicas have nowhere to schedule.
Keep building your mental model with the Kubernetes cluster and the related DevOps guides: how the control plane schedules pods, how workload controllers shape what gets scaled, and how Services spread traffic across the replicas the HPA adds.
FAQ
What is autoscaling in Kubernetes? Kubernetes autoscaling is the automatic adjustment of capacity to match demand. It has three independent controllers: the Horizontal Pod Autoscaler (more replicas), the Vertical Pod Autoscaler (bigger CPU/memory requests per pod), and the Cluster Autoscaler (more nodes). Each scales a different dimension.
What is the difference between HPA and VPA? The HPA changes the number of pod replicas based on live metrics like CPU utilization, so it scales out and in. The VPA changes the CPU and memory requests of each pod based on historical usage, so it scales up and down. HPA handles bursty load; VPA right-sizes a workload. Don't run both on the same metric in automatic mode.
Do I need the Cluster Autoscaler if I use the HPA? Usually yes. The HPA can request more replicas
than the current nodes have room for, leaving pods Pending. The Cluster Autoscaler adds nodes so
those pods can schedule, and removes idle nodes when load falls. HPA scales pods; the Cluster
Autoscaler scales the infrastructure under them.
Can HPA and VPA run together? Yes, but not on the same resource metric in Auto mode, or they form a feedback loop that never settles. The common safe pattern is to run the VPA in recommendation mode to size requests and let the HPA scale replicas, or to scale the HPA on a custom metric the VPA doesn't touch.
Why is my HPA not scaling? The usual causes are: no CPU request on the container (so utilization
is undefined), metrics-server not installed (so the HPA can't read metrics, showing
<unknown>/60%), or maxReplicas set too low. Check kubectl get hpa and kubectl top pod first.
Conclusion
Kubernetes autoscaling is elastic only when all three layers cooperate. The HPA reacts to live load by adding and removing replicas; the VPA keeps each pod's requests honest so that reaction is accurate; and the Cluster Autoscaler grows and shrinks the cluster so the replicas always have somewhere to run. Set requests first, keep the HPA and VPA off the same metric, put a Cluster Autoscaler behind every HPA, and protect scale-down with disruption budgets. From here, revisit the architecture guide to see how the scheduler turns autoscaling decisions into placed pods, and requests vs limits to make sure the numbers your autoscalers depend on are right.
References
- Kubernetes: Horizontal Pod Autoscaling — how the HPA loop, metrics, and the desired-replica calculation work.
- Kubernetes: HorizontalPodAutoscaler Walkthrough — a hands-on example of creating and observing an HPA.
- Kubernetes Autoscaler: Vertical Pod Autoscaler — the VPA's modes, recommender, and updater components.
- Kubernetes Autoscaler: Cluster Autoscaler FAQ — how node scale-up and scale-down decisions are made.
- Kubernetes: Assign CPU Resources to Containers — why the CPU request is the reference the HPA scales against.



