Benchmarking Prefix Aware Load Balancing¶
Prefix Aware Load Balancing is able to improve throughput, inter-token latency and Time To First Token (TTFT). Even under heavy load the Time to First Token stays stable.
The benchmarks demonstrate the following improvements by enabling Prefix Aware Load Balancing under heavy load (8000 concurrent requests):
- 164x improvement in Mean TTFT from 39163.80 ms to 237.06 ms.
- 41% increased throughput (total token/second) from 47609.83 to 67333.71.
- 143% improvement in inter-token latency from 194.44 ms 79.90 ms.
The benchmarks compare the following load balancing strategies: * Kubernetes Native Service - Round Robin. Randomly distribute requests across all instances. * KubeAI - Least Load. Send the request to the instance that is handling the least amount of requests. * KubeAI - Prefix Hash. Send the request to an instance that has already handled the same prefix or portial prefix.
The benchmarks were run on 8 instances of vLLM serving LLama 3.1 8B and each uses a single L4 GPU. vLLM was configured to utilize prefix caching.
The ShareGPT dataset was purposely crafted such that prompts shared partial prefixes. See more about dataset in the dataset section. Performance gains are lower when the amount of partial prefix re-use is less.
Comparing the Mean TTFT for each Load Balancing strategy.
We can see that as the engine becomes overloaded, the Mean TTFT increases significantly for both K8s Native service and KubeAI Least Load. However, KubeAI Prefix Aware Load Balancing is able to achieve a stable TTFT even under heavy load.
Comparing the throughput in tokens per second for each load balancing strategy:
The graph shows that even at low load you can get a significant improvement in throughput by enabling Prefix Aware Load Balancing.
Conclusion: Prefix Aware Load Balancing is a must-have for large scale inference workloads.
Dataset and Benchmarking script¶
Dataset: ShareGPT filtered to only include conversations of 16 messages or more. This helps to simulate a scenario where people ask follow up questions and increases the number of partial prefix re-use.
The vLLM benchmark_serving.py script was used for this but with a few modifications:
* Removed the limit to only include 2 messages per conversation
* Create multiple prompts from a single conversation. E.g. prompt 1 would include message (1) of conversation x and prompt 2 would include message (1, 2, 3) of conversation x. This resembles multi-round conversation of ChatGPT.
* added --max-conversations
parameter which limits of unique conversations to use.
The script can be found under kubeai/benchmarks/chat-py/benchmark_serving.py
.
The image that was used: substratusai/benchmark_serving:v0.0.1
Benchmarking Setup¶
- Scale: 8 instances of vLLM
- GPU: L4 GPU, 1 per instance
- Model: Llama 3.1 8B Instruct FP8
The following model was deployed in KubeAI:
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: llama-3.1-8b-instruct-fp8-l4
spec:
features: [TextGeneration]
url: hf://neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
engine: VLLM
env:
VLLM_USE_V1: "1"
args:
- --enable-prefix-caching
- --max-model-len=16384
- --max-num-batched-token=16384
- --gpu-memory-utilization=0.95
- --disable-log-requests
- --kv-cache-dtype=fp8
resourceProfile: nvidia-gpu-l4:1
minReplicas: 8
maxReplicas: 8
In order to test Prefix Aware Load Balancing, we modify the load balancing strategy on the model object itself:
kubectl patch model llama-3.1-8b-instruct-fp8-l4 --type='merge' \
-p '{"spec": {"loadBalancing": {"strategy": "PrefixHash"}}}'
K8s native service was tested by sending requests directly to the K8s service instead of the KubeAI proxy/LB. This allows us to test the default K8s service Round Robin based load balancing.
This was the K8s Service used:
apiVersion: v1
kind: Service
metadata:
name: vllm-direct
labels:
app: vllm-direct
spec:
selector:
app.kubernetes.io/name: vllm
ports:
- name: http
protocol: TCP
port: 80
targetPort: 8000
type: ClusterIP
800 concurrent requests¶
containers:
- name: benchmark-serving
image: substratusai/benchmark_serving:v0.0.1
args:
- --base-url=http://kubeai/openai
- --dataset-name=sharegpt
- --dataset-path=/app/sharegpt_16_messages_or_more.json
- --model=llama-3.1-8b-instruct-fp8-l4
- --seed=12345
- --tokenizer=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
- --request-rate=800
- --max-concurrency=800
- --num-prompts=8000
- --max-conversations=800
restartPolicy: Never
K8s Service - Round Robin (No KubeAI proxy)¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 159.98
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 50.01
Output token throughput (tok/s): 3803.20
Total Token throughput (tok/s): 45409.81
---------------Time to First Token----------------
Mean TTFT (ms): 1319.77
Median TTFT (ms): 601.29
P99 TTFT (ms): 7438.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 189.29
Median TPOT (ms): 184.76
P99 TPOT (ms): 486.16
---------------Inter-token Latency----------------
Mean ITL (ms): 173.06
Median ITL (ms): 94.60
P99 ITL (ms): 715.66
==================================================
KubeAI - Least Load¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 158.39
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 50.51
Output token throughput (tok/s): 3841.42
Total Token throughput (tok/s): 45866.16
---------------Time to First Token----------------
Mean TTFT (ms): 817.18
Median TTFT (ms): 494.28
P99 TTFT (ms): 5551.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 191.44
Median TPOT (ms): 183.18
P99 TPOT (ms): 520.48
---------------Inter-token Latency----------------
Mean ITL (ms): 176.03
Median ITL (ms): 124.55
P99 ITL (ms): 691.97
==================================================
KubeAI - Prefix Hash¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 104.67
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 76.43
Output token throughput (tok/s): 5813.11
Total Token throughput (tok/s): 69407.79
---------------Time to First Token----------------
Mean TTFT (ms): 280.20
Median TTFT (ms): 239.80
P99 TTFT (ms): 1260.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 86.55
Median TPOT (ms): 91.13
P99 TPOT (ms): 139.47
---------------Inter-token Latency----------------
Mean ITL (ms): 85.78
Median ITL (ms): 77.35
P99 ITL (ms): 272.04
==================================================
1600 concurrent requests¶
apiVersion: batch/v1
kind: Job
metadata:
name: benchmark-serving
spec:
template:
spec:
containers:
- name: benchmark-serving
image: substratusai/benchmark_serving:v0.0.1
args:
- --base-url=http://kubeai/openai
- --dataset-name=sharegpt
- --dataset-path=/app/sharegpt_16_messages_or_more.json
- --model=llama-3.1-8b-instruct-fp8-l4
- --seed=12345
- --tokenizer=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
- --request-rate=200
- --max-concurrency=1600
- --num-prompts=8000
- --max-conversations=800
restartPolicy: Never
K8s Service - Round Robin¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 157.07
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 50.93
Output token throughput (tok/s): 3873.62
Total Token throughput (tok/s): 46250.51
---------------Time to First Token----------------
Mean TTFT (ms): 10365.29
Median TTFT (ms): 10068.73
P99 TTFT (ms): 22283.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 216.53
Median TPOT (ms): 207.58
P99 TPOT (ms): 607.73
---------------Inter-token Latency----------------
Mean ITL (ms): 197.37
Median ITL (ms): 90.35
P99 ITL (ms): 749.96
==================================================
KubeAI - Least Load¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 153.02
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 52.28
Output token throughput (tok/s): 3976.28
Total Token throughput (tok/s): 47476.29
---------------Time to First Token----------------
Mean TTFT (ms): 10579.01
Median TTFT (ms): 11501.96
P99 TTFT (ms): 15514.10
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 212.39
Median TPOT (ms): 202.98
P99 TPOT (ms): 613.06
---------------Inter-token Latency----------------
Mean ITL (ms): 193.34
Median ITL (ms): 92.65
P99 ITL (ms): 747.65
==================================================
KubeAI - Prefix Hash¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 110.00
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 72.73
Output token throughput (tok/s): 5531.31
Total Token throughput (tok/s): 66043.15
---------------Time to First Token----------------
Mean TTFT (ms): 196.13
Median TTFT (ms): 184.29
P99 TTFT (ms): 492.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 78.51
Median TPOT (ms): 81.50
P99 TPOT (ms): 117.36
---------------Inter-token Latency----------------
Mean ITL (ms): 79.20
Median ITL (ms): 70.36
P99 ITL (ms): 249.71
==================================================
3200 concurrent requests¶
job:
- name: benchmark-serving
image: substratusai/benchmark_serving:v0.0.1
args:
- --base-url=http://kubeai/openai
- --dataset-name=sharegpt
- --dataset-path=/app/sharegpt_16_messages_or_more.json
- --model=llama-3.1-8b-instruct-fp8-l4
- --seed=12345
- --tokenizer=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
- --request-rate=200
- --max-concurrency=3200
- --num-prompts=8000
- --max-conversations=800
K8s Native - Round Robin¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 156.36
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 51.16
Output token throughput (tok/s): 3891.22
Total Token throughput (tok/s): 46460.74
---------------Time to First Token----------------
Mean TTFT (ms): 27183.41
Median TTFT (ms): 31260.66
P99 TTFT (ms): 51797.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 214.63
Median TPOT (ms): 205.61
P99 TPOT (ms): 629.95
---------------Inter-token Latency----------------
Mean ITL (ms): 195.30
Median ITL (ms): 88.07
P99 ITL (ms): 742.53
==================================================
KubeAI - Least Load¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 152.43
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 52.48
Output token throughput (tok/s): 3991.56
Total Token throughput (tok/s): 47658.74
---------------Time to First Token----------------
Mean TTFT (ms): 24147.86
Median TTFT (ms): 25580.61
P99 TTFT (ms): 46021.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 211.98
Median TPOT (ms): 201.97
P99 TPOT (ms): 598.14
---------------Inter-token Latency----------------
Mean ITL (ms): 192.94
Median ITL (ms): 93.29
P99 ITL (ms): 721.71
==================================================
KubeAI - Prefix Hash¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 111.37
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 71.84
Output token throughput (tok/s): 5463.50
Total Token throughput (tok/s): 65233.60
---------------Time to First Token----------------
Mean TTFT (ms): 213.92
Median TTFT (ms): 188.53
P99 TTFT (ms): 838.35
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 78.73
Median TPOT (ms): 82.17
P99 TPOT (ms): 122.60
---------------Inter-token Latency----------------
Mean ITL (ms): 78.49
Median ITL (ms): 70.32
P99 ITL (ms): 242.44
==================================================
concurrent requests 8000¶
- name: benchmark-serving
image: substratusai/benchmark_serving:v0.0.1
args:
- --base-url=http://kubeai/openai
- --dataset-name=sharegpt
- --dataset-path=/app/sharegpt_16_messages_or_more.json
- --model=llama-3.1-8b-instruct-fp8-l4
- --seed=12345
- --tokenizer=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
- --request-rate=800
- --max-concurrency=8000
- --num-prompts=8000
- --max-conversations=800
K8s Native - Round robin¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 156.20
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 51.22
Output token throughput (tok/s): 3895.38
Total Token throughput (tok/s): 46510.40
---------------Time to First Token----------------
Mean TTFT (ms): 48587.55
Median TTFT (ms): 48682.53
P99 TTFT (ms): 101940.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 215.24
Median TPOT (ms): 206.65
P99 TPOT (ms): 566.10
---------------Inter-token Latency----------------
Mean ITL (ms): 196.77
Median ITL (ms): 87.08
P99 ITL (ms): 751.68
==================================================
KubeAI - Least Load¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 152.59
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 52.43
Output token throughput (tok/s): 3987.46
Total Token throughput (tok/s): 47609.83
---------------Time to First Token----------------
Mean TTFT (ms): 39163.80
Median TTFT (ms): 40140.70
P99 TTFT (ms): 78489.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 214.09
Median TPOT (ms): 205.62
P99 TPOT (ms): 623.61
---------------Inter-token Latency----------------
Mean ITL (ms): 194.44
Median ITL (ms): 90.36
P99 ITL (ms): 725.95
==================================================
KubeAI - Prefix Hash¶
============ Serving Benchmark Result ============
Successful requests: 8000
Benchmark duration (s): 107.89
Total input tokens: 6656338
Total generated tokens: 608447
Request throughput (req/s): 74.15
Output token throughput (tok/s): 5639.40
Total Token throughput (tok/s): 67333.71
---------------Time to First Token----------------
Mean TTFT (ms): 237.06
Median TTFT (ms): 219.27
P99 TTFT (ms): 619.65
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 79.99
Median TPOT (ms): 81.76
P99 TPOT (ms): 124.28
---------------Inter-token Latency----------------
Mean ITL (ms): 79.90
Median ITL (ms): 71.31
P99 ITL (ms): 303.14
==================================================