Llama 3.2 11B Vision Instruct vLLM Benchmarks¶
Single L4 GPU vLLM 0.6.2
python3 benchmark_serving.py --backend openai \
--base-url http://localhost:8000/openai \
--dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
--model meta-llama-3.2-11b-vision-instruct \
--seed 12345 --tokenizer neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 681.93
Total input tokens: 230969
Total generated tokens: 194523
Request throughput (req/s): 1.47
Output token throughput (tok/s): 285.25
Total Token throughput (tok/s): 623.95
---------------Time to First Token----------------
Mean TTFT (ms): 319146.12
Median TTFT (ms): 322707.98
P99 TTFT (ms): 642512.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 54.84
Median TPOT (ms): 53.66
P99 TPOT (ms): 83.75
---------------Inter-token Latency----------------
Mean ITL (ms): 54.09
Median ITL (ms): 47.44
P99 ITL (ms): 216.77
==================================================