Serve LoRA adapters¶

In this guide you will configure KubeAI to serve LoRA adapters.

Configuring adapters¶

LoRA adapters are configured on Model objects. For Example:

# model.yaml
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: tinyllama-chat
spec:
  features: [TextGeneration]
  owner: meta-llama
  url: hf://TinyLlama/TinyLlama-1.1B-Chat-v0.3
  adapters: # <--
  - name: colorist
    url: hf://jashing/tinyllama-colorist-lora
  engine: VLLM
  resourceProfile: nvidia-gpu-l4:1
  minReplicas: 1

Limitation: Currently LoRA adapters are only supported with engine: VLLM and hf:// or s3:// urls.

You can install this Model using kubectl:

kubectl apply -f ./model.yaml

Or if you are managed models with the KubeAI models helm chart you can add adapters to a given model via your helm values:

# helm-values.yaml
catalog:
  llama-3.1-8b-instruct-fp8-l4:
    enabled: true
    adapters:
    - name: example
      url: hf://some-huggingface-user/some-huggingface-repo
    # ...

Requesting an adapter¶

When using the OpenAI compatible REST API, model adapters are referenced using the <base-model>_<adapter> convention. Once a Model is installed with an adapter, you can request that adapter by name via appending _<adapter-name> to the model field. This will work with any OpenAI client library.

If you installed a Model with name: llama-3.2 and configured .spec.adapters[] to contain an adapter with name: sql, you can issue a completion request to that adapter using:

curl http://$KUBEAI_ENDPOINT/openai/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"prompt": "Hi", "model": "llama-3.2_sql"}'

Listing adapters¶

Adapters will be returned by the /models endpoint:

curl http://$KUBEAI_ENDPOINT/openai/v1/models

Each adapter will be listed as a separate model object with the adapter name appended to the base Model name.