Learn how FinServ eng leaders optimize costs with AI for prod

How to debug kubernetes probe issues?

Learn how to diagnose and fix Kubernetes probe failures. This guide covers Liveness vs. Readiness differences, CPU throttling timeouts, and how to stop "Unhealthy" restart loops.

What is a Kubernetes Probe?

A Kubernetes probe is a periodic diagnostic performed by the Kubelet on a container to determine its operational health. The Kubelet calls a handler which is either a code execution, a TCP socket check, or an HTTP request and it acts based on the response. Probes are the primary mechanism Kubernetes uses to maintain high availability and self-healing.

What are the 3 types of Kubernetes Probes?

In a production cluster, you will encounter three distinct probe types, each triggering a different lifecycle action:

  1. Liveness Probe: Determines if the container is running. If the liveness probe fails, the Kubelet kills the container, and it is subjected to its restartPolicy.
  2. Readiness Probe: Determines if the container is prepared to respond to requests. If it fails, the pod's IP address is removed from the endpoints of all Services that match the pod.
  3. Startup Probe: Indicates whether the application within the container is initialized. All other probes are disabled until the startup probe succeeds, preventing premature restarts during slow boot sequences.

What do these probes actually do?

The kubelet runs probes at intervals defined in your pod spec. Each probe type serves a distinct purpose:

Liveness probes answer "is this container stuck?" If a liveness probe fails failureThreshold times in a row (default: 3), the kubelet kills the container. Whether it restarts depends on the pod's restartPolicy. Liveness probes are designed to catch deadlocks, infinite loops, or states where the application is running but can't make progress.

Readiness probes answer "can this container handle requests?" If a readiness probe fails, the kubelet removes the pod from all Service endpoints. The container keeps running. Readiness probes run continuously throughout the container's lifecycle where a pod can transition between ready and not-ready as conditions change.

Startup probes (when configured) answer "has this container finished starting?" Liveness and readiness probes don't run until the startup probe succeeds. This prevents the kubelet from killing slow-starting containers before they're initialized.

Each probe can use one of three mechanisms:

  • HTTP GET: Kubelet sends a request to a specified endpoint. Status codes 200-399 are success; anything else is failure.
  • TCP Socket: Kubelet attempts to open a connection to a port. Success if the connection opens; failure otherwise.
  • Exec: Kubelet runs a command inside the container. Exit code 0 is success; non-zero is failure.

A liveness probe failure tells Kubernetes that a container is no longer running correctly and needs to be restarted. A readiness probe failure tells Kubernetes that a container isn't ready to receive traffic and should be removed from service endpoints. Both appear as Unhealthy events in kubectl describe pod, but they trigger different responses and often have different root causes.

Warning  Unhealthy  Liveness probe failed: Get "http://10.1.2.3:8080/healthz": context deadline exceeded
Warning  Unhealthy  Readiness probe failed: HTTP probe failed with statuscode: 503

Understanding why these probes fail requires knowing what each probe type is checking, how the kubelet executes them, and what conditions cause a running application to stop responding to health checks.

What are the common causes of probe failures?

Probe failures fall into several categories: configuration issues where the probe itself is misconfigured, resource contention where the application can't respond in time, startup timing issues where probes run before the application is ready, and application issues where the health endpoint itself has problems.

1/ The Application Isn't Ready When Probes Start

If your container needs 30 seconds to load configuration, connect to databases, and warm caches, but your liveness probe has an initialDelaySeconds of 5, the probe will start failing immediately. The kubelet will restart the container, which will start the same slow initialization, which will fail again.

This is one of the most common causes of restart loops. You'll see events like:

Warning  Unhealthy  Liveness probe failed: connection refused
Normal   Killing    Container failed liveness probe, will be restarted

The solution is either increasing initialDelaySeconds to exceed your worst-case startup time, or using a startup probe. A startup probe with failureThreshold: 30 and periodSeconds: 10 gives the container 300 seconds to start before liveness probes begin.

2/ CPU Pressure Causes Timeout

When a container is under CPU pressure it may not respond to probes within the configured timeoutSeconds (default: 1 second). This happens either because it's doing real work or because it's being throttled at its resource limit.

The application is running fine. It would respond to the probe in 2 seconds. But the probe has a 1-second timeout, so the kubelet marks it as failed. After three failures, the container restarts, which only makes things worse: now the remaining pods have more traffic, more CPU pressure, and more probe failures.

You'll see this as:

Warning  Unhealthy  Liveness probe failed: Get "http://10.1.2.3:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Check CPU utilization with kubectl top pods and look at the pod's resource limits. If CPU usage is consistently at or near the limit, the container is being throttled. Either increase timeoutSeconds to something the application can reliably meet under load, or increase the container's CPU limits.

3/ Probes Are Too Aggressive for the Workload

A periodSeconds of 1 with a timeoutSeconds of 1 means the kubelet is sending a probe every second and expecting a response within that same second. For applications with variable response times, garbage collection pauses, or any kind of latency jitter, this configuration will produce intermittent failures.

Intermittent readiness failures cause pods to flap in and out of service endpoints, which can look like random request failures to clients. Intermittent liveness failures cause unnecessary restarts.

Consider what the probe is actually protecting against. A liveness probe catches deadlocks which are conditions that don't self-resolve. A container that's slow for 5 seconds but then recovers doesn't need to be restarted. Setting failureThreshold: 5 with periodSeconds: 10 means the container has 50 seconds of continuous failures before restart. That's usually appropriate for catching real deadlocks while tolerating transient slowness.

4/ The Health Endpoint Has Dependencies

A readiness probe that checks /health which in turn queries the database, checks Redis connectivity, and validates certificates is doing too much work. If any of those dependencies has a momentary issue, all pods fail readiness simultaneously, and the service becomes unavailable.

Liveness probes should be especially lightweight. They should answer "is this process fundamentally broken?" not "are all my dependencies working?" A liveness probe that checks external dependencies can restart containers that are perfectly healthy when the real problem is elsewhere.

For readiness probes, consider whether you want the pod to stop receiving traffic when a dependency is down. Sometimes yes (database connection pool exhausted), sometimes no (downstream service temporarily slow).

5/ Network or Firewall Issues

The kubelet runs on the node, not inside the container. Probe requests go from the kubelet to the pod IP on the probe port. Network policies, firewall rules, or CNI issues that interfere with this path will cause probe failures even when the application is healthy.

If you can kubectl exec into the pod and successfully curl the health endpoint, but probes still fail, look at network-level issues. Check that no NetworkPolicy is blocking traffic from the node to the pod on the probe port.

6/ Wrong Port or Path

A probe configured for port 8080 when the application listens on 8081 will fail immediately with "connection refused." A probe pointing to /healthz when the application exposes /health will fail with 404.

These mismatches often happen when application code changes without corresponding updates to Kubernetes manifests, or when copying manifests between applications with different configurations.

livenessProbe:
  httpGet:
    path: /healthz    # Does this path actually exist?
    port: 8080        # Is this the right port?

7/ Container Still Starting When Probes Begin

This is distinct from the "application isn't ready" case. Here, the application process might not even be listening on the port yet. The kubelet sends the probe, gets "connection refused" because nothing is listening, and marks it as failed.

For applications with initialization steps before the HTTP server starts a startup probe is usually the right solution. It lets the container take whatever time it needs to initialize, then hands off to liveness and readiness probes for ongoing monitoring.

How do you diagnose probe failures?

Start with kubectl describe pod <pod-name> and look at the Events section. Probe failures appear as Warning Unhealthy events with the specific error message.

kubectl describe pod my-app-xyz123 | grep -A 20 Events

The error message tells you a lot:

  • "connection refused": Nothing is listening on the probe port. Either wrong port, or application hasn't started listening yet.
  • "context deadline exceeded": Application didn't respond within timeoutSeconds. Either the timeout is too short, or the application is too slow.
  • "HTTP probe failed with statuscode: 503": Application responded, but with an error status. Check what the health endpoint is actually checking.
  • "no such file or directory" (exec probe): The command in the probe doesn't exist in the container.

If the pod is running but probes are failing, exec into the container and test the health check manually:

kubectl exec -it my-app-xyz123 -- curl -v http://localhost:8080/healthz

If this succeeds but probes fail, the issue is likely timing (probe runs before application is ready) or timeout (manual curl can wait longer than the 1-second default).

Check resource utilization:

kubectl top pod my-app-xyz123

If CPU is at or near the limit, throttling is likely contributing to probe timeouts.

Look at the probe configuration in the pod spec:

kubectl get pod my-app-xyz123 -o yaml | grep -A 15 livenessProbe

Compare initialDelaySeconds against actual application startup time. Compare timeoutSeconds against actual health endpoint response time under load.

How do you tune Probe Configuration?

The defaults like timeoutSeconds: 1, periodSeconds: 10, failureThreshold: 3 are reasonable starting points but often need adjustment.

For slow-starting applications, use a startup probe:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

This gives the container 300 seconds to start. Once it passes, liveness and readiness probes take over.

For applications with variable response times, increase timeouts and failure thresholds:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

This tolerates occasional slow responses (up to 5 seconds) and requires 30 seconds of continuous failures before restart.

For applications that should stop receiving traffic during dependency outages, configure readiness probes accordingly but keep liveness probes independent:

readinessProbe:
  httpGet:
    path: /ready      # Checks dependencies
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 3
livenessProbe:
  httpGet:
    path: /alive      # Only checks if process is responsive
    port: 8080
  periodSeconds: 10
  timeoutSeconds: 5

The key insight is that liveness and readiness probes can (and often should) check different things. Readiness is about "should this pod receive traffic?" Liveness is about "is this container fundamentally broken?"

How Resolve AI Approaches Probe Failures

Probe failures often look like application problems but originate elsewhere. A container that times out on health checks might be CPU-throttled because node-level resource pressure increased after a recent deployment to a different service. Or readiness probes might be failing because a dependency three hops away started returning errors.

Diagnosing this manually means correlating events across multiple systems: pod events, node metrics, deployment history, and dependency health. Each piece lives in a different tool and requires different queries.

Resolve AI connects across these domains during investigation. When a probe failure pattern emerges, it can trace backward from the failing pod to node resource utilization, recent deployments that might have changed resource allocation, and the health of dependencies that the probe endpoint checks. This cross-domain visibility surfaces root causes that aren't apparent from any single vantage point.

The pattern recognition also helps identify whether a probe configuration needs tuning versus whether there's a deeper issue. An application that reliably starts in 20 seconds but occasionally hits 45 seconds during high cluster load needs a different solution than an application that started taking longer to initialize after a specific code change. Distinguishing these requires context about what changed and when. This is exactly the kind of cross-system correlation that's tedious to do manually but straightforward when you have visibility across code, infrastructure, and telemetry together.