Storage / Caching¶
With "Large" in the name, caching is a critical part of serving LLMs.
The best caching technique may very depending on your environment:
- What cloud features are available?
- Is your cluster deployed in an air-gapped environment?
A. Model built into container¶
Status: Supported
Building a model into a container image can provide a simple way to take advantage of image-related optimizations built into Kubernetes:
-
Relaunching a model server on the same Node that it ran on before will likely be able to reuse the previously pulled image.
-
Secondary boot disks on GKE can be used to avoid needing to pull images.
-
Image streaming on GKE can allow for containers to startup before the entire image is present on the Node.
-
Container images can be pre-installed on Nodes in air-gapped environments (example: k3s airgap installation).
Guides:
B. Model on shared filesystem (read-write-many)¶
Status: Planned.
Examples: AWS EFS
C. Model on read-only-many disk¶
Status: Planned.
Examples: GCP Hyperdisk ML