Select your GPU and model below. Then enter your desired max context length and max output
tokens, and we’ll show you:
5. Results
Model Parameter Size:
—
Estimated KV‐Cache Size:
—
Total Memory Required:
—
Multi-GPU Configurations
Understanding Parallelism Strategies:
- Tensor Parallelism (TP): Divides each layer across GPUs. Lower latency, higher throughput, requires fast GPU interconnect.
- Pipeline Parallelism (PP): Divides layers sequentially across GPUs. Higher latency, more memory efficient for KV cache.
- Optimal Config: ✓ Highlighted in green. Balances efficiency and performance.
The following configurations distribute the model across multiple
GPUs:
GPUs |
Tensor Parallel |
Pipeline Parallel |
Memory Per GPU (GB) |
Remaining RAM (GB) |
Choose |
Suggested vLLM Command Line:
—