Skip to main content
Local AI

I Benchmarked Local LLMs on 7 GPUs — Real Tokens/Sec for RTX 4090, 3090, M3 Max (2026)

I benchmarked 7 GPUs running Llama 3, Qwen 2.5 and Mistral at Q4_K_M. Real tokens-per-second numbers — RTX 4090 hits 135 tok/s on 7B, M3 Max runs 70B at 5 tok/s.

How fast will Llama 3.1 70B run on a used RTX 3090? Is an M3 Max actually worth the money? These are the questions answered by real tokens-per-second benchmarks — the only metric that tells you whether your local LLM will feel instant or painful. Below are the numbers we’ve measured, plus a breakdown of what they mean in practice.

Quick answer: RTX 4090 leads at 135 tok/s on 7B models, RTX 3090 hits 95 tok/s, RTX 3060 12GB does 45 tok/s. On 70B models (Q2 quant), 4090 reaches 18 tok/s and 3090 hits 10 tok/s. Apple M3 Max 64GB runs 70B but only at 5 tok/s. CPU-only is painful — under 10 tok/s on anything.
Tokens per second benchmark bar chart comparing consumer GPUs for 7B to 70B LLM models
Real-world llama.cpp throughput — Q4_K_M quantization, single batch, warm model.

Detailed tokens-per-second table

GPU7B13B34B70B (Q2)
RTX 4090 24GB135 tok/s78 tok/s42 tok/s18 tok/s
RTX 3090 24GB95 tok/s55 tok/s28 tok/s10 tok/s
RTX 4070 Super 12GB75 tok/s40 tok/sOOMOOM
RTX 4060 Ti 16GB55 tok/s30 tok/s8 tok/sOOM
RTX 3060 12GB45 tok/s22 tok/sOOMOOM
Apple M3 Max 64GB40 tok/s22 tok/s11 tok/s5 tok/s
CPU only (DDR5)6-10 tok/s3-5 tok/s1-2 tok/s<1 tok/s

Get one of these for your homelab

Affiliate links — using these helps support the testing and benchmarks on this site at no extra cost to you.

RTX 4090 24GB
~135 t/s on 7B
Check price →
RTX 3090 24GB
~95 t/s on 7B, used market
Check price →
RTX 4070 Super 12GB
~75 t/s on 7B, sweet spot
Check price →
RTX 4060 Ti 16GB
Best $/GB-VRAM in 2026
Check price →
Mac Studio M3 Max
64GB+ for 70B models
Check price →
Try a GPU Droplet
DigitalOcean — pay-by-hour H100/A100
$200 credit →

Don’t have one of these? Use the LLM hardware calculator to see what your current PC can already run.

What these numbers mean in practice

🟢 40+ tok/s — feels instant

Faster than you can read. Great for coding autocomplete, quick chat, and streaming responses. This is the bar you want for daily use.

🔵 20–40 tok/s — comfortable

Responses keep pace with your reading. Perfect for deep-thinking work where quality matters more than raw speed.

🟡 10–20 tok/s — usable

You notice the wait but it’s fine for one-off queries. Typical for running 70B on a single 24GB GPU.

🔴 Under 10 tok/s — painful

Fine for batch processing. Torturous for interactive chat. Consider a different GPU or a smaller model.

Benchmark methodology

  • Backend: llama.cpp b3520 (CUDA/Metal/Vulkan builds as appropriate)
  • Quantization: Q4_K_M for main numbers, Q2_K for 70B on 24GB cards
  • Context length: 2048 tokens
  • Single batch: numbers reflect solo use, not server-style batching
  • Warm model: first request excluded (cold start adds 2-5 seconds)
  • OS: Ubuntu 24.04 on Linux rigs, macOS 14 for Apple Silicon

Related guides

Last updated: 2026-04-22.