Running LLMs locally? The single most important question is: how much VRAM do you actually need? This calculator gives you the exact answer for every popular open-source model, at every quantization level, with context length scaling.
๐ Part of the Local LLMs in 2026 guide
How Much VRAM Do You Need?
Select a model โ see exact VRAM requirements at every quantization level โ find the cheapest GPU that runs it.
VRAM estimates are for Ollama / llama.cpp GGUF format. Actual usage varies by backend, batch size, and KV cache. Context length adds ~0.5-2 GB above base model weight. Amazon links are affiliate โ we may earn a commission at no extra cost.
Understanding VRAM and Quantization
VRAM (Video RAM) is the GPU memory that holds the model weights during inference. Unlike system RAM, VRAM is fast enough for the matrix multiplications that LLMs need. The amount of VRAM you need depends on two things: model size (parameter count) and quantization level (how compressed the weights are).
Quantization Levels Explained
Q4_K_M is the sweet spot for most users โ it reduces VRAM by ~75% compared to FP16 with only ~5% quality loss. Q8 is near-lossless but uses double the VRAM. FP16 (full precision) is for fine-tuning and research only.
Context Length Impact
Longer context windows (more tokens in the conversation) require additional VRAM for the KV cache. A 4K context adds minimal overhead. At 32K+ tokens, expect 1-3 GB of additional VRAM usage depending on the model architecture.
VRAM Requirements Quick Reference
| Model | Q4 VRAM | Minimum GPU | Recommended GPU |
|---|---|---|---|
| Llama 3.1 8B | 5 GB | RTX 3060 12GB | RTX 3060 12GB |
| Qwen 2.5 14B | 8.5 GB | RTX 3060 12GB | RTX 4060 Ti 16GB |
| Gemma 2 27B | 16 GB | RTX 4060 Ti 16GB | RTX 3090 24GB |
| Llama 3.1 70B | 40 GB | 2x RTX 3090 | Mac Studio 64GB |
| Llama 3.1 405B | 230 GB | Cloud GPU | Cloud GPU |
Frequently Asked Questions
How much VRAM do I need to run Llama 3 8B locally?
Llama 3.1 8B requires approximately 5 GB of VRAM at Q4_K_M quantization. An NVIDIA RTX 3060 12GB is the minimum recommended GPU.
Can I run a 70B model with 24 GB VRAM?
Not fully in VRAM. A 70B model at Q4 needs ~40 GB. With 24 GB (RTX 3090/4090), you can run it with partial CPU offloading, but expect 3-5x slower inference.
What is Q4 vs Q8 vs FP16 quantization?
Q4_K_M uses 4-bit precision, reducing VRAM by ~75% with minimal quality loss. Q8 uses 8-bit precision for near-lossless quality. FP16 is full precision for fine-tuning.
Does context length affect VRAM usage?
Yes. The KV cache grows with context length. At 4K tokens the overhead is minimal. At 32K+ tokens, expect 1-3 GB additional VRAM.
Is Apple Silicon good for running LLMs?
Yes. Apple Silicon uses unified memory shared between CPU and GPU. The Mac Studio M3 Max with 64 GB can run 70B models entirely in memory.
What is the cheapest GPU for local AI in 2026?
The NVIDIA RTX 3060 12GB (~$300 used) is the best entry point. It runs all 7-8B models comfortably at Q4 with 35+ tokens/sec.