DevgainsDevgainsDevgains
All articles

Liveness vs Readiness Probes: The Difference That Takes Down Deploys

·5 min read
Liveness vs Readiness Probes: The Difference That Takes Down Deploys

Photo: Unsplash

You ship a deploy, and within a minute pods are restarting in a loop, traffic is being dropped, and the dashboard looks like a heart attack. The application code is fine. The problem is a single misunderstood field in your pod spec: a probe that is doing the opposite of what you assumed.

Liveness and readiness probes look almost identical in YAML, which is exactly why teams get them wrong. They answer different questions, and confusing the two is one of the most common ways a healthy application gets taken offline by its own orchestration. Let us be precise about what each one does and when it fires.

Two questions, two probes

A liveness probe answers: is this container so broken that it should be killed and restarted? When a liveness probe fails its threshold, the kubelet kills the container and restarts it. This is for deadlocks and unrecoverable states, situations where the only fix is a fresh process.

A readiness probe answers: should this container receive traffic right now? When a readiness probe fails, the pod is removed from the Service's endpoints, so no new requests are routed to it, but the container is not restarted. This is for temporary unavailability: warming a cache, waiting on a dependency, or draining during shutdown.

The distinction is the whole game. Liveness failures restart. Readiness failures remove from rotation. Mix them up and you either kill pods that just needed a moment, or you keep sending traffic to pods that cannot serve it. The Kubernetes documentation on configuring liveness, readiness, and startup probes is the canonical reference and worth reading in full.

The classic outage: liveness checking a dependency

Here is the mistake that takes down deploys. Someone points the liveness probe at an endpoint that checks the database connection:

livenessProbe:
  httpGet:
    path: /health/db   # checks the database!
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

Now imagine the database has a brief hiccup, a failover, a slow query storm, anything that makes that endpoint fail for thirty seconds. The liveness probe fails, so Kubernetes kills every pod. They all restart, all reconnect to the same struggling database at once, and the thundering herd makes the outage worse. A transient dependency blip just became a full restart cascade.

Liveness probes should test only the process itself, never external dependencies. If your liveness check can fail because of something outside the container, a downstream blip will turn into a cluster-wide restart storm. Put the dependency check in the readiness probe instead, where failure removes the pod from traffic without killing it.

The correct split puts the dependency check where it belongs:

livenessProbe:
  httpGet:
    path: /healthz    # process-only: am I alive?
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /readyz     # can I serve? checks DB, caches, etc.
    port: 8080
  periodSeconds: 5
  failureThreshold: 3

When the database hiccups now, /readyz fails, pods leave the Service endpoints, and traffic pauses. The processes stay alive, keep their warm connections, and rejoin rotation the instant the dependency recovers. No restarts, no herd.

Slow starts need a startup probe

The other deploy-killer is an application that takes a while to boot. A JVM service loading a large model, or a framework doing heavy initialization, might need 60 seconds before it can answer anything. If your liveness probe starts checking at 5 seconds with a 10-second period and 3 failures allowed, the pod gets killed at roughly 35 seconds, before it ever finished starting. It restarts, fails to boot again, and you have a crash loop that looks like a bug but is pure misconfiguration.

The temptation is to crank initialDelaySeconds to 90, but that also delays detection of real deadlocks for the entire life of the pod. The right tool is the startup probe, which gates the other two:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10    # allows up to 300s to start
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3  # fast detection once started

Until the startup probe succeeds, the liveness and readiness probes are disabled. This gives slow apps up to 300 seconds to come up while keeping aggressive liveness checks afterward. You get patience at boot and fast failure detection in steady state, without compromising either.

Readiness and graceful shutdown

Readiness probes also matter at the end of a pod's life. When Kubernetes terminates a pod, it sends SIGTERM and removes the pod from endpoints, but in-flight requests and the propagation delay of endpoint updates mean traffic can still arrive for a moment. A preStop hook combined with a readiness probe that flips to failing on shutdown gives load balancers time to stop sending new requests:

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 10"]

The sleep holds termination open long enough for the endpoint removal to propagate, so the pod drains cleanly instead of dropping requests. This is the difference between a rolling update that users never notice and one that sprinkles 502s across the deploy window.

Tuning the thresholds

The numeric fields are where good intentions go wrong. periodSeconds sets how often the probe runs, failureThreshold sets how many consecutive failures trigger action, and timeoutSeconds sets how long a single check may take before it counts as failed. A common mistake is a 1-second timeout on an endpoint that occasionally takes 1.5 seconds under load, which produces flapping that looks random. Set timeouts with real latency data, not optimism, and give liveness a high enough failureThreshold that a single slow check does not nuke a healthy pod.

# Watch restart counts climb to spot a bad liveness probe
kubectl get pods -w
kubectl describe pod <name>   # Events show probe failures

The Events section of kubectl describe pod spells out exactly which probe failed and why, which is the fastest way to diagnose a crash loop that the application logs cannot explain.

Takeaways

  • Liveness restarts the container; readiness removes it from traffic. They answer different questions, never conflate them.
  • Never check external dependencies in a liveness probe, or a downstream blip becomes a cluster-wide restart storm.
  • Put dependency and warmup checks in the readiness probe, where failure pauses traffic without killing the process.
  • Use a startup probe for slow-booting apps so liveness does not crash-loop them before they finish initializing.
  • Pair a failing-on-shutdown readiness state with a preStop hook to drain connections for zero-downtime rollouts.
  • Tune timeoutSeconds and failureThreshold with real latency data to avoid flapping under load.
5 min read

Read next