Storage / Caching¶

With "Large" in the name, caching is a critical part of serving LLMs.

The best caching technique may very depending on your environment:

What cloud features are available?
Is your cluster deployed in an air-gapped environment?

A. Model built into container¶

Status: Supported

Building a model into a container image can provide a simple way to take advantage of image-related optimizations built into Kubernetes:

Relaunching a model server on the same Node that it ran on before will likely be able to reuse the previously pulled image.
Secondary boot disks on GKE can be used to avoid needing to pull images.
Image streaming on GKE can allow for containers to startup before the entire image is present on the Node.
Container images can be pre-installed on Nodes in air-gapped environments (example: k3s airgap installation).

Guides:

How to build models into container images

B. Model on shared filesystem (read-write-many)¶

KubeAI can manage model caches on a shared filesystem (i.e. AWS EFS, GCP Filestore, NFS). It manages the full lifecycle of a cached model: loading, serving, and cache eviction (on deletion of the Model).

C. Model on read-only-many disk¶

Status: Planned.

Examples: GCP Hyperdisk ML