Install models¶
This guide provides instructions on how to configure KubeAI models.
Installing models with helm¶
KubeAI provides a chart that contains preconfigured models.
Preconfigured models with helm¶
When you are defining Helm values for the kubeai/models
chart you can install a preconfigured Model by setting enabled: true
. You can view a list of all preconfigured models in the chart's default values file.
# helm-values.yaml
catalog:
llama-3.1-8b-instruct-fp8-l4:
enabled: true
You can optionally override preconfigured settings, for example, resourceProfile
:
# helm-values.yaml
catalog:
llama-3.1-8b-instruct-fp8-l4:
enabled: true
resourceProfile: nvidia-gpu-l4:2 # Require "2 NVIDIA L4 GPUs"
Custom models with helm¶
If you prefer to add a custom model via the same Helm chart you use for installed KubeAI, you can add your custom model entry into the .catalog
array of your existing values file for the kubeai/models
Helm chart:
# helm-values.yaml
catalog:
my-custom-model-name:
enabled: true
features: ["TextEmbedding"]
owner: me
url: "hf://me/my-custom-model"
resourceProfile: CPU:1
Installing models with kubectl¶
You can add your own model by defining a Model yaml file and applying it using kubectl apply -f model.yaml
.
Take a look at the KubeAI API docs to view Model schema documentation.
If you have a running cluster with KubeAI installed you can inspect the schema for a Model using kubectl explain
:
kubectl explain models
kubectl explain models.spec
kubectl explain models.spec.engine
You can view all example manifests on the GitHub repository.
Below are few examples using various engines and resource profiles.
Example Gemma 2 2B using Ollama on CPU¶
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: gemma2-2b-cpu
spec:
features: [TextGeneration]
url: ollama://gemma2:2b
engine: OLlama
resourceProfile: cpu:2
Example Llama 3.1 8B using vLLM on NVIDIA L4 GPU¶
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: llama-3.1-8b-instruct-fp8-l4
spec:
features: [TextGeneration]
owner: neuralmagic
url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
engine: VLLM
args:
- --max-model-len=16384
- --max-num-batched-token=16384
- --gpu-memory-utilization=0.9
- --disable-log-requests
resourceProfile: nvidia-gpu-l4:1
Load Models from PVC¶
You can store your models in a Persistent Volume Claim (PVC) and load them into KubeAI for serving. This guide will show you how to load models from a PVC.
Currenly only vLLM supports loading models from PVCs.
The following formats are supported to load models from a PVC:
url: pvc://$PVC_NAME
- Loads the model from the PVC named$PVC_NAME
.url: pvc://$PVC_NAME/$PATH
- Loads the model from the PVC named$PVC_NAME
and mounts the subpath$PATH
within the PVC.
You need to make sure the model is preloaded into the PVC before you can use it in KubeAI.
The Access Mode of the PVC should be ReadOnlyMany
or ReadWriteMany
, because otherwise
KubeAI won't be able to spin up more than 1 replica of the model.
Programmatically installing models¶
See the examples.
Calling a model¶
You can inference a model by calling the KubeAI OpenAI compatible API. The model name should match the KubeAI model name.
Feedback welcome: A model management UI¶
We are considering adding a UI for managing models in a running KubeAI instance. Give the GitHub Issue a thumbs up if you would be interested in this feature.