LLM Inference Engines Performance Comparison: vLLM vs sglang
Benchmark Comparison Report
This report compares the performance of three inference engines (vLLM V0, vLLM V1, and sglang V0.4.9) on the Llama-3.1-8B-Instruct model. All tests generated 634,000 tokens, ensuring horizontal comparability of the data.
Testing Environment
- Model: Llama-3.1-8B-Instruct
- Hardware: 1 NVIDIA H100 80GB GPU
- Command:
evalscope perf --parallel 5 10 20 30 40 50 60 70 80 90 100 --number 5 10 20 30 40 50 60 70 80 90 100 --model meta-llama/Llama-3.1-8B-Instruct --url <URL> --api openai --dataset longalpaca --max-tokens 2000 --min-tokens 2000 --max-prompt-length 10000 --min-prompt-length 5000
- vLLM V0 and V1 versions: v0.9.0
- sglang version: v0.4.9
I. Basic Performance Comparison
Inference Engine | Total Time (sec) | Avg Output Rate (tokens/sec) | Performance Improvement (vs vLLM V0) |
---|---|---|---|
vLLM V0 | 268.79 | 2358.71 | Baseline |
vLLM V1 | 236.00 | 2686.49 | +14% (rate improvement) |
sglang V0.4.9 | 236.95 | 2675.66 | +13% (rate improvement) |
- vLLM V1 and sglang V0.4.9 perform similarly, both outperforming vLLM V0.
- vLLM V1 is slightly faster than sglang, but the difference is minimal.
II. Detailed Performance Metrics
1. Average Latency
Concurrency | vLLM V0 Avg Lat(s) | vLLM V1 Avg Lat(s) | sglang Avg Lat(s) |
---|---|---|---|
5 | 16.98 | 15.75 | 15.00 |
50 | 26.18 | 22.80 | 23.38 |
100 | 26.16 | 23.59 | 23.29 |
- Low concurrency: sglang has the lowest latency, providing the fastest response.
- High concurrency: vLLM V1 has slightly lower latency, performing marginally better.
2. Average Generation Rate (Gen. toks/sec)
Concurrency | vLLM V0 Gen. toks/s | vLLM V1 Gen. toks/s | sglang Gen. toks/s |
---|---|---|---|
5 | 588.62 | 634.87 | 666.54 |
50 | 2742.96 | 3141.16 | 3077.68 |
100 | 2744.10 | 3036.62 | 3088.08 |
- High concurrency: Both vLLM V1 and sglang exceed 3000 tokens/s, significantly outperforming vLLM V0.
3. Time to First Token (TTFT)
Concurrency | vLLM V0 TTFT(s) | vLLM V1 TTFT(s) | sglang TTFT(s) |
---|---|---|---|
5 | 0.318 | 0.276 | 0.136 |
50 | 0.357 | 0.348 | 0.258 |
100 | 0.415 | 0.373 | 0.254 |
- sglang demonstrates the best TTFT performance across all concurrency scenarios, with significantly faster response times.
4. Throughput (RPS & Concurrency Capability)
Inference Engine | Max Throughput (RPS) | Optimal Concurrency | Recommended Range |
---|---|---|---|
vLLM V0 | 1.37 req/sec | 50 | ~50 |
vLLM V1 | 1.57 req/sec | 40 | ~40 |
sglang | 1.55 req/sec | 60 | ~60 |
- vLLM V1 and sglang lead in throughput, recommended at concurrency levels of 40 and 60 respectively.
III. Best Performance Configurations
Inference Engine | Lowest Latency (s) | Optimal Concurrency | Generation Rate (tokens/sec) |
---|---|---|---|
vLLM V0 | 16.98 | 5 | 588.62 |
vLLM V1 | 15.75 | 5 | 634.87 |
sglang | 15.00 | 5 | 666.54 |
- For low-latency scenarios, sglang performs best.
IV. Summary and Recommendations
-
Performance Improvement:
- vLLM V1 and sglang show significant improvements over vLLM V0 in generation rate and latency, with an overall performance increase of approximately 14%.
- At high concurrency, vLLM V1 performs slightly better in terms of latency, while sglang excels in TTFT and throughput.
-
Application Scenarios:
- Low-latency scenarios: sglang is recommended for its shorter latency and quick response times.
- High-throughput scenarios: Both vLLM V1 and sglang perform excellently; choice depends on specific requirements.
-
Recommended Configurations:
- For high-concurrency requirements, recommended concurrency settings are:
- vLLM V1: around 40
- sglang: around 60
- For high-concurrency requirements, recommended concurrency settings are:
-
Optimization Directions:
- Further optimize hardware configurations to test performance under even higher loads.
- Explore the adaptability of different engines for specific tasks and model scenarios.
Note: This report is based on performance testing of the Llama-3.1-8B-Instruct model; performance with other models requires separate testing and validation.
V. Original Benchmark Data
vLLM V0 v0.9.0
Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model │ Llama-3.1-8B-Instruct │
│ Total Generated │ 634,000.0 tokens │
│ Total Test Time │ 268.79 seconds │
│ Avg Output Rate │ 2358.71 tokens/sec │
└───────────────────────┴──────────────────────────────────┘
Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ Avg ┃ P99 ┃ Gen. ┃ Avg ┃ P99 ┃ Avg ┃ P99 ┃ Success┃
┃Conc. ┃ RPS ┃ Lat.(s) ┃ Lat.(s) ┃ toks/s ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(s) ┃ Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ 5 │ 0.29 │ 16.981 │ 16.988 │ 588.62 │ 0.318 │ 0.375 │ 0.008 │ 0.008 │ 100.0%│
│ 10 │ 0.53 │ 18.707 │ 18.719 │ 1068.23 │ 0.306 │ 0.332 │ 0.009 │ 0.009 │ 100.0%│
│ 20 │ 0.92 │ 21.738 │ 21.784 │ 1835.82 │ 0.401 │ 0.589 │ 0.011 │ 0.011 │ 100.0%│
│ 30 │ 1.19 │ 25.037 │ 25.119 │ 2388.07 │ 0.391 │ 0.643 │ 0.012 │ 0.012 │ 100.0%│
│ 40 │ 1.32 │ 27.254 │ 27.355 │ 2631.36 │ 0.434 │ 0.621 │ 0.013 │ 0.013 │ 100.0%│
│ 50 │ 1.37 │ 26.176 │ 26.241 │ 2742.96 │ 0.357 │ 0.426 │ 0.013 │ 0.013 │ 100.0%│
│ 60 │ 1.34 │ 26.708 │ 26.784 │ 2687.64 │ 0.393 │ 0.454 │ 0.013 │ 0.013 │ 100.0%│
│ 70 │ 1.37 │ 26.148 │ 26.223 │ 2744.60 │ 0.405 │ 0.478 │ 0.013 │ 0.013 │ 100.0%│
│ 80 │ 1.35 │ 26.531 │ 26.604 │ 2705.46 │ 0.425 │ 0.502 │ 0.013 │ 0.013 │ 100.0%│
│ 90 │ 1.35 │ 26.600 │ 26.672 │ 2698.57 │ 0.518 │ 0.582 │ 0.013 │ 0.013 │ 100.0%│
│ 100 │ 1.37 │ 26.157 │ 26.234 │ 2744.10 │ 0.415 │ 0.488 │ 0.013 │ 0.013 │ 100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘
Best Performance Configuration
Highest RPS Concurrency 50 (1.37 req/sec)
Lowest Latency Concurrency 5 (16.981 seconds)
Performance Recommendations:
• Optimal concurrency range is around 50
vLLM V1 v0.9.0
Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model │ Llama-3.1-8B-Instruct │
│ Total Generated │ 634,000.0 tokens │
│ Total Test Time │ 236.00 seconds │
│ Avg Output Rate │ 2686.49 tokens/sec │
└───────────────────────┴──────────────────────────────────┘
Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ Avg ┃ P99 ┃ Gen. ┃ Avg ┃ P99 ┃ Avg ┃ P99 ┃ Success┃
┃Conc. ┃ RPS ┃ Lat.(s) ┃ Lat.(s) ┃ toks/s ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(s) ┃ Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ 5 │ 0.32 │ 15.747 │ 15.752 │ 634.87 │ 0.276 │ 0.292 │ 0.008 │ 0.008 │ 100.0%│
│ 10 │ 0.58 │ 17.082 │ 17.096 │ 1169.81 │ 0.181 │ 0.206 │ 0.009 │ 0.009 │ 100.0%│
│ 20 │ 1.03 │ 19.303 │ 19.342 │ 2067.71 │ 0.366 │ 0.474 │ 0.009 │ 0.009 │ 100.0%│
│ 30 │ 1.36 │ 21.902 │ 22.004 │ 2726.02 │ 0.459 │ 0.673 │ 0.011 │ 0.011 │ 100.0%│
│ 40 │ 1.57 │ 22.797 │ 22.898 │ 3143.68 │ 0.397 │ 0.549 │ 0.011 │ 0.011 │ 100.0%│
│ 50 │ 1.57 │ 22.803 │ 22.911 │ 3141.16 │ 0.348 │ 0.442 │ 0.011 │ 0.011 │ 100.0%│
│ 60 │ 1.57 │ 22.872 │ 22.979 │ 3132.07 │ 0.375 │ 0.459 │ 0.011 │ 0.011 │ 100.0%│
│ 70 │ 1.56 │ 22.962 │ 23.070 │ 3120.01 │ 0.354 │ 0.437 │ 0.011 │ 0.011 │ 100.0%│
│ 80 │ 1.55 │ 23.071 │ 23.185 │ 3104.36 │ 0.384 │ 0.478 │ 0.011 │ 0.011 │ 100.0%│
│ 90 │ 1.57 │ 22.866 │ 22.984 │ 3130.58 │ 0.434 │ 0.524 │ 0.011 │ 0.011 │ 100.0%│
│ 100 │ 1.52 │ 23.590 │ 23.702 │ 3036.62 │ 0.373 │ 0.466 │ 0.012 │ 0.012 │ 100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘
Best Performance Configuration
Highest RPS Concurrency 40 (1.57 req/sec)
Lowest Latency Concurrency 5 (15.747 seconds)
Performance Recommendations:
• Optimal concurrency range is around 40
sglang V0.4.9
Basic Information:
┌───────────────────────┬──────────────────────────────────┐
│ Model │ Llama-3.1-8B-Instruct │
│ Total Generated │ 634,000.0 tokens │
│ Total Test Time │ 236.95 seconds │
│ Avg Output Rate │ 2675.66 tokens/sec │
└───────────────────────┴──────────────────────────────────┘
Detailed Performance Metrics
┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓
┃ ┃ ┃ Avg ┃ P99 ┃ Gen. ┃ Avg ┃ P99 ┃ Avg ┃ P99 ┃ Success┃
┃Conc. ┃ RPS ┃ Lat.(s) ┃ Lat.(s) ┃ toks/s ┃ TTFT(s) ┃ TTFT(s) ┃ TPOT(s) ┃ TPOT(s) ┃ Rate┃
┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩
│ 5 │ 0.33 │ 15.000 │ 15.003 │ 666.54 │ 0.136 │ 0.143 │ 0.007 │ 0.007 │ 100.0%│
│ 10 │ 0.59 │ 16.829 │ 16.833 │ 1188.16 │ 0.154 │ 0.164 │ 0.008 │ 0.008 │ 100.0%│
│ 20 │ 1.03 │ 19.443 │ 19.450 │ 2056.25 │ 0.213 │ 0.236 │ 0.010 │ 0.010 │ 100.0%│
│ 30 │ 1.36 │ 22.031 │ 22.042 │ 2721.65 │ 0.236 │ 0.273 │ 0.011 │ 0.011 │ 100.0%│
│ 40 │ 1.54 │ 23.429 │ 23.442 │ 3070.85 │ 0.303 │ 0.506 │ 0.012 │ 0.012 │ 100.0%│
│ 50 │ 1.54 │ 23.375 │ 23.386 │ 3077.68 │ 0.258 │ 0.322 │ 0.012 │ 0.012 │ 100.0%│
│ 60 │ 1.55 │ 23.256 │ 23.267 │ 3093.28 │ 0.263 │ 0.328 │ 0.011 │ 0.011 │ 100.0%│
│ 70 │ 1.53 │ 23.439 │ 23.450 │ 3068.73 │ 0.273 │ 0.404 │ 0.012 │ 0.012 │ 100.0%│
│ 80 │ 1.53 │ 23.496 │ 23.510 │ 3062.05 │ 0.305 │ 0.522 │ 0.012 │ 0.012 │ 100.0%│
│ 90 │ 1.55 │ 23.186 │ 23.201 │ 3102.28 │ 0.266 │ 0.325 │ 0.011 │ 0.011 │ 100.0%│
│ 100 │ 1.54 │ 23.293 │ 23.307 │ 3088.08 │ 0.254 │ 0.302 │ 0.011 │ 0.012 │ 100.0%│
└──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘
Best Performance Configuration
Highest RPS Concurrency 60 (1.55 req/sec)
Lowest Latency Concurrency 5 (15.000 seconds)
Performance Recommendations:
• Optimal concurrency range is around 60