Question 1

How many tokens per second does an RTX 4090 do for a 7B LLM?

Accepted Answer

In our llama.cpp tests with Q4_K_M quantization, the RTX 4090 24GB hits ~135 tokens/second on 7B models, ~78 tok/s on 13B, ~42 tok/s on 34B, and ~18 tok/s on 70B (Q2). For chat-style use, anything above 20 tok/s feels instant.

Question 2

Is an RTX 3090 fast enough for 70B Llama 3?

Accepted Answer

Yes, but with caveats. A single 24GB 3090 hits ~10 tok/s on 70B Q2 — usable but not snappy. For comfortable 70B inference at higher quants, you want 2x 3090 (NVLink optional) or step up to 48GB+ cards. The 3090's bandwidth (936 GB/s) is the bottleneck, not just VRAM.

Question 3

What's faster for local LLMs: RTX 4090 or Apple M3 Max?

Accepted Answer

For 7B–34B models the 4090 wins on raw speed by 1.5–2x. For 70B+ the M3 Max with 64GB+ unified memory wins because the 4090 simply can't fit the model and falls back to slow CPU offload. M3 Max is also far quieter and ~5x more power-efficient. Pick 4090 for speed, M3 Max for big models in a quiet home.

Question 4

How many tokens per second is fast enough for chat?

Accepted Answer

Above ~10 tok/s it reads at human speed; above 20 tok/s feels instant; below 5 tok/s feels painfully slow. For coding assistants where you read carefully, 8–10 tok/s is fine. For long-form generation or agents calling the model in a loop, you want 30+ tok/s.

Question 5

Why is my LLM slow even with a good GPU?

Accepted Answer

Most common causes: model partially offloaded to CPU/RAM (check VRAM usage during inference), wrong quantization (Q8 is ~2x slower than Q4 for the same model), context window too large (KV cache grows linearly), or the runtime not using GPU layers (--n-gpu-layers in llama.cpp). LM Studio's GPU offload slider is a common culprit when set too low.

Question 6

Can I run LLMs on CPU only? How slow is it really?

Accepted Answer

Yes, llama.cpp runs on any modern CPU. Expect roughly 5–10 tok/s on 7B Q4 with a fast desktop CPU (e.g. Ryzen 7 7700X), 1–3 tok/s on 13B, and sub-1 tok/s for 34B+. Apple Silicon CPU inference is faster than x86 due to memory bandwidth. Workable for occasional queries, painful for daily chat.

GPU	7B	13B	34B	70B (Q2)
RTX 4090 24GB	135 tok/s	78 tok/s	42 tok/s	18 tok/s
RTX 3090 24GB	95 tok/s	55 tok/s	28 tok/s	10 tok/s
RTX 4070 Super 12GB	75 tok/s	40 tok/s	OOM	OOM
RTX 4060 Ti 16GB	55 tok/s	30 tok/s	8 tok/s	OOM
RTX 3060 12GB	45 tok/s	22 tok/s	OOM	OOM
Apple M3 Max 64GB	40 tok/s	22 tok/s	11 tok/s	5 tok/s
CPU only (DDR5)	6-10 tok/s	3-5 tok/s	1-2 tok/s	<1 tok/s

LLM Tokens/Sec Benchmarks 2026: RTX 4090 vs 3090, 7B-70B Q4 (llama.cpp)

Detailed tokens-per-second table

Get one of these for your homelab

What these numbers mean in practice

🟢 40+ tok/s — feels instant

🔵 20–40 tok/s — comfortable

🟡 10–20 tok/s — usable

🔴 Under 10 tok/s — painful

Benchmark methodology

Related guides