Question 1

How much VRAM do I need to run an LLM locally?

Accepted Answer

It depends on the model size, the quantization, and your context length: model weights dominate, then the KV cache grows with context. This calculator adds the real CUDA and OS overhead so the number reflects what actually loads.

Question 2

What quantization should I use to fit a bigger model?

Accepted Answer

Q4_K_M is the popular sweet spot, cutting weights to roughly a quarter of FP16 size with minimal quality loss. Dropping to Q3 or Q2 fits even larger models but quality degrades noticeably.

Question 3

Why does my model need more VRAM than its file size?

Accepted Answer

Because the weights are only one part; you also pay for the KV cache (which scales with context length), a compute buffer, and around 1-2GB of CUDA and desktop overhead on Windows.

LLM VRAM Calculator

What eats your VRAM

Frequently asked questions

How much VRAM do I need to run an LLM locally?

What quantization should I use to fit a bigger model?

Why does my model need more VRAM than its file size?