Koping

No self-healing without awareness of health.

We expect Kubernetes to ensure and maintain application availability. In order to meet this expectation the cluster must be able to determine if an application is healthy in order to detect and handle failures.

The cluster has adequate mechanisms to deal with failures, but unless we ensure it knows when an application is healthy, application-layer failures may go undetected and potentially cause all sorts of problems. This is not strictly necessary - we can forego the configuration of the "probes" Kubernetes uses to check the status of our containers, but by doing so we'd opt out of one of the central pillars of resilience, should an application prove unreliable or exhibit inconsistent behaviour.

A modern application will assist by exposing the means by which health checks can be facilitated. These are often dedicated API endpoints and querying them will yield responses that may reveal even the in-depth operability of a complex application based on its own internal checks, optionally with client tooling of its own - in case of microservices that cloud-native applications normally rely on it may suffice to simply check if they respond. Some situations prove much less trivial.

In accordance with a pod's fairly complicated lifecycle there are various phases they can be found in based - among other things - on the states of their containers with corresponding probes to keep track of the latter. They are similar and have a few things in common: they are executed by the cluster component named "kubelet" - more on this later - and they can make use of one of three handlers: check if a TCP port is open by opening a socket to connect, send an HTTP request and process its response or execute a command and process its exit code. (The first two are executed on the node, the last one in the container.) They also have a number of configurable parameters that have to do with timing. Logically, probes really only apply to long-running processes and can't be used for example on init containers that are short-lived and provide no services.

If a container - its main process, in fact - terminates, Kubernetes will subject the container to the pod's restart policy, which governs all its containers. Unless overridden, this will mean the container will be restarted regardless of its exit code. (This can be changed to only happen on failures signaled by non-zero exit codes - or never. Not all policies apply to all workloads e.g. Jobs can't have "Always".) It's not always this straightforward, though: some processes will or can not terminate upon encountering problems, rather become inoperable.

"Liveness probes" are used to determine if applications are operable. Success implies the application started and is not hung or locked up. Failure terminates the container and subjects it to the pod's restart policy. You don't need this if the application is the main process (i.e. not started by init) and reliably capable of termination upon failure.

"Readiness probes" are used to determine if applications are ready to serve requests. A pod is considered ready when all its containers are ready. Success implies that incoming requests can be routed to the pod. This can be complemented and further refined by readiness gates on conditions adjusted dynamically by clients. These need to be part of the pod's spec and corresponding fields need to be maintained accordingly.

"Startup probes" were added last, supporting applications that take some time to start up - these are practically temporary liveness probes with higher tolerances that will defer execution of any actual liveness probes until their own first success, after which they no longer execute. They are used in conjunction with liveness probes and both perform the same checks, detecting application failures during and after their initial startup phases equally effectively.

These probes are not mutually exclusive: technically you can have any combination of startup, liveness and readiness probes - one each - including all or - as hinted at earlier - none of them with the same or different configurations, bearing common sense in mind. The handlers employed will probably vary depending on the application in question and the tooling available in its container. Liveness and readiness probes are executed at recurring intervals. When omitted (i.e. not configured), success results are always assumed to have been returned.

That's all forks - stay tuned for more "ek8splanations"!

Koping

No self-healing without awareness of health.

My goodness - my Ingress!