Matters of state

Previously I mentioned that some applications keep state and therefore require the means to store that someplace. I also said pods contained volumes and storage actually was an object, abstracted by Kubernetes. Let's take a closer look at how storage is made use of!

Putting the orchestration layer briefly aside: we remember containers are started off images and will then store everything written within the boundaries of their root filesystem in a writable layer on top of whatever the image had initially provided, physically present on the host's filesystem and bind-mounted into the container. This data will survive stopping or restarting the container, but not a deletion, regardless of a future re-creation of another container based on the same image - nor is it written or merged into the image. Furthermore, security policies might require making the root filesystem of a container immutable - in such a scenario there will even be no way to write anything therein at all.

Data produced by applications shall usually either represent state or cache - the former, as such, is to be made persistent, while the latter is normally ephemeral, but needs to fulfill higher requirements concerning throughput, i.e. speed is of essence. Memory-backed (as opposed to traditional filesystems on actual media) volumes lend themselves to caching purposes, factoring in the additional overhead incurred against the host's memory reserves. We therefore differentiate between volumes also based on their lifecycle: not all volumes are to outlive their pods, and therefore persistent. Even ephemeral storage will, however, survive actual containers, while still being bound to the lifecycle of the pod that defines it. At any rate: should we produce data in a container, we usually need to externalize it to volumes outside its own root filesystem - writable or not - for good reasons. (Why else would they have been around and widely used also alongside virtual machines? The exact same train of thought applies, involving operational and legal security, backups, retention policies as well as performance characteristics and trivial, lifecycle-related concerns. It's as old as the "cattle, not pets" paradigm shift introduced by the emergence of cloud computing. In the world of distributed systems instances of any kind should be disposable. Uptime should solely mean application availability. The business requires the application and its data, nothing else. No idle system has any value. Scale takes this to the next level.)

Back to Kubernetes: since containers reside in pods, volumes need to be defined in pods and mounted by containers therein, regardless of their characteristics. However, only persistent volumes are represented by corresponding objects. On the other hand, not everything is a volume that looks like one - there are specific types of objects (namely ConfigMaps and Secrets) that hold data in textual form and can back volume definitions in pods, eventually ending up having read-only files in any container that mounts them, normally named after their keys and populated by their respective values. Many types of volumes are supported, with all vendors doing their best to provide appropriate drivers. They generally have different configurational syntaxes in their definitions, with options that pertain to their implementation as exposed by the drivers making them available. Ephemeral volume definitions will reference these types as implemented by the available drivers directly. Persistent volumes are different in this regard with their own set of abstractions, but you'll naturally have to define the dirty details somewhere. We'll see.

Persistence raises a new set of questions, prominent among which is the actual access mode of the volume in question. As you've no doubt seen elsewhere, certain implementations allow simultaneous write access, while most do not. Block storage - unlike file- or object stores - generally falls in the latter category. From the storage perspective, this is simply understood as RWO (for ReadWriteOnce) or RWX (for ReadWriteMany) - and there's also ROX (for ReadOnlyMany). (There's no ReadOnlyOnce, but nothing stops you from mounting RWO volumes as read-only.) There's a catch, though: owing to the fact that the underlying storage is usually physically accessed and exposed by the host running the containers, pods mounting the same persistent volume are implicitly colocated. Remember: once inside a pod, you're free to share volumes between its containers - they're just directories on that level. You can, of course, use read-only mounts and different mount points each time you define volume mounts in containers even in case of the same volume. Volumes used by pods that can be re-scheduled need to be on shared storage accessible by all suitable nodes.

Persistent volumes are effectively crucial for StatefulSets, but you can use them in conjunction with other workload types like Deployments or even stand-alone pods. They are usually dynamically provisioned by the cluster, but they can also be pre-provisioned manually by storage administrators. Specific to them, however, is the fact that they are not defined directly in pods. Pods consume persistent storage via so-called claims, technically meaning that the volumes they list refer to PersistentVolumeClaim objects, which will, in turn, refer to StorageClass objects, if dynamic provisioning is at play. A storage class object defines the actual implementation by specifying a provisioner, which will ultimately mean an instance of software running in the cluster - yes, a container in a pod - that can interface with the storage backend and cater for operations relating to volume lifecycle (creation, deletion, expansion etc.) as required by the actual implementation.

To complicate matters further: just as a container may have multiple volumes mounted and pods multiple volumes defined, there will probably be multiple storage classes defined in a cluster, seeing to the needs of developers consuming them as required by their use-cases. So one can say their application requires 2 gigabytes of RWO storage without a specific requirement for bandwidth - or, on the contrary, that it better be fast, or that it needs RWX owing to its replicated nature, or maybe encryption is a must - and they won't need to care about the implementation, as the cluster abstracts that. (They do need the list and details of available storage classes in order to be able to pick one, though, but that's very easy to obtain in a self-service manner. Please note the cluster may have a default storage class, deciding which one is implied whenever a claim with no specific reference to a storage class is created. This is a case when the actual details don't matter - they might not know what storage classes their target cluster implements and perhaps it is of no consequence either.)

A persistent volume claim will bind a suitably large volume - might not be precisely as large as defined because the size specification will be understood by the cluster as a minimum requirement, also satisfied by larger volumes that happen to be available for binding from the storage represented by the class defined in the claim. (The provisioner will see to the creation of a volume should there be none around that fit the bill and are unbound.) Aligned with the intention of persistence transcending pod lifecycles, a persistent volume claim is NOT a part of the pod specification - it's an object in its own right and merely referenced by pods. Consequently the deletion of a pod will not imply the removal of any persistent volumes, nor any persistent volume claims that bind them. What happens to the volume when a claim is deleted, though, depends on the reclaim policy of the storage class it referenced - it may be deleted (which is a default behaviour) releasing its capacity back to the storage pool, or it may be kept around indefinitely, being managed no further, just as well as re-purposed with prior erasure of its contents.

I mentioned volume claim templates when I spoke of StatefulSets - they are there in order to implement stable storage: separate PersistentVolumeClaim objects are created for replicas based on the template and bound to the identities of their pods. Claims created this way will survive their Statefulset i.e. deletion thereof shall not cascade down to them in order to retain application data even with the default reclaim policy of the storage class.

I believe you need a breather now and so do I - I'll return, though, as I'm far from finished.

Matters of state

Kubernetized

Konnected