Skip to main content
Local AI

I Tested 7 GPUs Running Local LLMs: Exact VRAM Numbers (2026, Real Hardware)

Exact VRAM requirements for every popular local LLM in 2026. Llama 3, Qwen 2.5, Mistral, DeepSeek mapped to RTX 3060, 4060 Ti, 3090, 4090 and Apple Silicon.

Everything about running local LLMs starts with one number: your GPU’s VRAM. It determines which models you can load, how fast they run, and whether you’ll need to compromise on quality. This guide maps every common VRAM bucket — from 4GB to 80GB+ — to the exact models you can run.

Quick answer: Add ~1.2 GB per billion parameters at Q4_K_M quantization. A 7B model needs ~5 GB VRAM, a 13B needs ~9 GB, 34B needs ~22 GB, and 70B needs ~40 GB at Q4 (or 24 GB at Q2). Always leave 1–2 GB headroom for the KV cache and OS.
Visual VRAM requirements chart mapping GPU capacity to supported LLM sizes
VRAM capacity directly determines which models fit — Q4_K_M quantization assumed.

The VRAM → model-size matrix

Your VRAMComfortable (Q4_K_M)Stretch (Q3 / Q2)Use case
4 GB (GTX 1650, RTX 3050)Phi-3 Mini, Qwen 2.5 3B, Llama 3.2 3BLlama 3.1 7B Q2 (slow)Basic chat, coding autocomplete
8 GB (RTX 3060 Ti, 4060, 3070)Llama 3.1 8B, Mistral 7B, Qwen 2.5 7BLlama 3 13B Q3Daily driver chat, coding
12 GB (RTX 3060 12GB, 4070)Llama 3 13B, Qwen 2.5 14B, CodeLlama 13BMixtral 8x7B Q3Serious coding, simple agents
16 GB (RTX 4060 Ti 16GB, 4070 Ti Super)Qwen 2.5 32B Q3, Gemma 2 27B, Mixtral 8x7B Q4Llama 3 70B Q2 (painful)High-quality local assistant
24 GB (RTX 3090, 4090, 7900 XTX)Qwen 2.5 32B Q5, Llama 3 70B Q2, Mistral Large Q2Llama 3.1 70B Q3Near-GPT-4 quality at home
48 GB (2× 3090, RTX A6000)Llama 3.1 70B Q5, Qwen 2.5 72B, DeepSeek-V2.5Llama 3 405B Q2 (glacial)Production-grade local AI
80+ GB (H100, 2× A6000)Llama 3.1 405B Q4, DeepSeek R1, full-precision 70BEverything short of frontierResearch, fine-tuning

Rules of thumb

  • Your model must fit in VRAM or speed drops 10-30× when offloading to CPU
  • Add 1-2 GB headroom for KV cache (growing context window), framework overhead, and the OS
  • Quantization is your friend. Q4_K_M runs at ~94% quality of the original at 30% the size
  • VRAM matters more than compute speed for practical local LLM use — an old 3090 (24GB) beats a brand-new 4070 (12GB) for any model over 13B
  • Unified memory (Apple Silicon) is a different beast — you can “stretch” beyond traditional VRAM limits but at the cost of speed

Related guides

Last updated: 2026-04-22.