Skip to content

Storage / Caching

With "Large" in the name, caching is a critical part of serving LLMs.

The best caching technique may very depending on your environment:

  • What cloud features are available?
  • Is your cluster deployed in an air-gapped environment?

A. Model built into container

Status: Supported

Building a model into a container image can provide a simple way to take advantage of image-related optimizations built into Kubernetes:

  • Relaunching a model server on the same Node that it ran on before will likely be able to reuse the previously pulled image.

  • Secondary boot disks on GKE can be used to avoid needing to pull images.

  • Image streaming on GKE can allow for containers to startup before the entire image is present on the Node.

  • Container images can be pre-installed on Nodes in air-gapped environments (example: k3s airgap installation).

Guides:

B. Model on shared filesystem (read-write-many)

KubeAI can manage model caches on a shared filesystem (i.e. AWS EFS, GCP Filestore, NFS). It manages the full lifecycle of a cached model: loading, serving, and cache eviction (on deletion of the Model).


C. Model on read-only-many disk

Status: Planned.

Examples: GCP Hyperdisk ML