LLM VRAM Calculator
Will that model fit on your GPU? Estimate weights + KV cache + overhead before you download.
Quick Summary · TL;DR
VRAM needed = model weights + KV cache + compute buffer + ~1-2GB CUDA/OS overhead. Weights = params × bits-per-weight ÷ 8. The KV cache grows with context length, so a long context can cost more than the weights themselves.
What eats your VRAM
- Model weights — the biggest chunk; shrinks directly with quantization.
- KV cache — scales with context length and grows fast at 32K+ tokens.
- Compute buffer — transient working memory for attention.
- CUDA + desktop overhead — ~0.6GB CUDA context plus ~1.2GB on Windows.
Frequently asked questions
How much VRAM do I need to run an LLM locally?
It depends on the model size, the quantization, and your context length: model weights dominate, then the KV cache grows with context. This calculator adds the real CUDA and OS overhead so the number reflects what actually loads.
What quantization should I use to fit a bigger model?
Q4_K_M is the popular sweet spot, cutting weights to roughly a quarter of FP16 size with minimal quality loss. Dropping to Q3 or Q2 fits even larger models but quality degrades noticeably.
Why does my model need more VRAM than its file size?
Because the weights are only one part; you also pay for the KV cache (which scales with context length), a compute buffer, and around 1-2GB of CUDA and desktop overhead on Windows.
Running local AI on a gaming GPU? VRAM capacity is king — see how the meta builds prioritise it in the Auto PC Builder.