Skip to content

Configure Text Generation Models

KubeAI supports the following engines for text generation models (LLMs, VLMs, ..):

  • vLLM (Recommended for GPU)
  • Ollama (Recommended for CPU)
  • Need something else? Please file an issue on GitHub.

There are 2 ways to install a text generation model in KubeAI: - Use Helm with the kubeai/models chart. - Use kubectl apply -f model.yaml to install a Model Custom Resource.

KubeAI comes with pre-validated and optimized Model configurations for popular text generation models. These models are available in the kubeai/models Helm chart and are also published as raw manifests in the manifests/model directory.

You can also easily define your own models using the Model Custom Resource directly or by using the kubeai/models Helm chart.

Install a Text Generation Model using Helm

You can take a look at all the pre-configured models in the chart's default values file.

You can get the default values for the models chart using the following command:

helm show values kubeai/models

Install Text Generation Model using L4 GPU

Enable the Llama 3.1 8B model using the Helm chart:

helm upgrade --install --reuse-values kubeai-models kubeai/models -f - <<EOF
catalog:
  llama-3.1-8b-instruct-fp8-l4:
    enabled: true
    engine: VLLM
    resourceProfile: nvidia-gpu-l4:1
    minReplicas: 1 # by default this is 0
EOF

Install a Text Generation Model using kubectl

You can use the Model Custom Resource directly to install a model using kubectl apply -f model.yaml.

Install Text Generation Model using L4 GPU

Apply the following Model Custom Resource to install the Llama 3.1 8B model using vLLM on L4 GPU:

apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-8b-instruct-fp8-l4
spec:
  features: [TextGeneration]
  url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
  engine: VLLM
  args:
    - --max-model-len=16384
    - --max-num-batched-token=16384
    - --gpu-memory-utilization=0.9
    - --disable-log-requests
  resourceProfile: nvidia-gpu-l4:1

Interact with the Text Generation Model

The KubeAI service exposes an OpenAI compatible API that you can use to query the available models and interact with them.

The KubeAI service is available at http://kubeai/openai/v1 within the Kubernetes cluster.

You can also port-forward the KubeAI service to your local machine to interact with the models:

kubectl port-forward svc/kubeai 8000:80

You can now query the available models using curl:

curl http://localhost:8000/openai/v1/models

Using curl to interact with the model

Run the following curl command to interact with the model named llama-3.1-8b-instruct-fp8-l4:

curl "http://localhost:8000/openai/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama-3.1-8b-instruct-fp8-l4",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Write a haiku about recursion in programming."
            }
        ]
    }'

Using the OpenAI Python SDK to interact with the model

Once the pod is ready, you can use the OpenAI Python SDK to interact with the model: All OpenAI SDKs work with KubeAI since the KubeAI service is OpenAI API compatible.

See the below example code to interact with the model using the OpenAI Python SDK:

import os
from openai import OpenAI
# Assumes port-forward of kubeai service to localhost:8000.
kubeai_endpoint = "http://localhost:8000/openai/v1"
model_name = "llama-3.1-8b-instruct-fp8-l4"

# If you are running in a Kubernetes cluster, you can use the kubeai service endpoint.
if os.getenv("KUBERNETES_SERVICE_HOST"):
    kubeai_endpoint = "http://kubeai/openai/v1"

client = OpenAI(api_key="ignored", base_url=kubeai_endpoint)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Say this is a test",
        }
    ],
    model=model_name,
)