The vLLMulator

Select your GPU and model below. Then enter your desired max context length and max output tokens, and we’ll show you:

5. Results

Model Parameter Size: —

Estimated KV‐Cache Size: —

Total Memory Required: —

Understanding Parallelism Strategies:

Tensor Parallelism (TP): Divides each layer across GPUs. Lower latency, higher throughput, requires fast GPU interconnect.
Pipeline Parallelism (PP): Divides layers sequentially across GPUs. Higher latency, more memory efficient for KV cache.
Optimal Config: ✓ Highlighted in green. Balances efficiency and performance.

The following configurations distribute the model across multiple GPUs:

GPUs	Tensor Parallel	Pipeline Parallel	Memory Per GPU (GB)	Remaining RAM (GB)	Choose

Suggested vLLM Command Line:

—