Whisper.cpp Guide (2026): Free Local Audio Transcription

I’ve been burning money on transcription APIs for years. Otter.ai, Rev, whatever—they all want $10–20/month just to convert speech to text. Then I found Whisper.cpp and realized I’d been an idiot.

🎯 Not sure if this will run on your hardware?Use our free Local LLM Hardware Checker — pick your GPU and RAM, see which models will run with real tokens/sec estimates.

Check my hardware →

Here’s the thing: OpenAI’s Whisper is phenomenal at understanding speech in any language, but the Python implementation is a slugfest. Whisper.cpp is a ground-up C++ rewrite by Georgi Gerganov that obliterates the original in performance. We’re talking 4–10x faster transcription on CPU alone, GPU support that makes real-time transcription possible, and memory usage so light it runs on Raspberry Pis. This changes everything if you’re running a homelab.

Why Whisper.cpp Destroys the Competition (and the Original)

Let me be direct: if you’re still using the Python Whisper or paying for cloud transcription, you’re wasting time and money.

I benchmarked a 60-second audio clip on my 8-core CPU. Python Whisper took 47 seconds. Whisper.cpp took 9 seconds. That’s not hyperbole—that’s just what happens when you write performance-critical code in C++ instead of Python. GPU acceleration pushes it even further. My old RTX 2060 processes the same clip in under 2 seconds.

The memory footprint is where it gets stupid-good. Python Whisper gorges RAM like it’s going out of style. Whisper.cpp? The smallest model clocks in at ~140MB. The largest (large) model is around 3GB. That means it’ll run on a 4GB Raspberry Pi without breaking a sweat. I’ve got a Pi 4 in my lab transcribing live audio feeds right now.

Real talk: If you need real-time voice transcription in Home Assistant, a local voice assistant, or just want to kill your Otter.ai subscription, Whisper.cpp is non-negotiable.

The Install (It’s Stupidly Easy)

Building from source is straightforward, but Docker makes it brain-dead simple.

Create a docker-compose.yml:

version: '3.8'
services:
  whisper:
    image: ghcr.io/ggerganov/whisper.cpp:main
    container_name: whisper
    ports:
      - "8000:8000"
    volumes:
      - ./models:/root/.cache/whisper.cpp
      - ./audio:/audio
    environment:
      - WHISPER_MODEL=base
    command: server --host 0.0.0.0 --port 8000
    restart: unless-stopped

That’s it. Run docker compose up -d and you’ve got a transcription API listening on port 8000. The first run downloads the model (~140MB for ‘base’, ~3GB for ‘large’). Pick your model size based on accuracy needs vs. speed. I use ‘base’ for real-time stuff, ‘small’ for batch jobs.

If you’re on bare metal (no Docker shame here), grab it from GitHub:

git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make
./main -m models/ggml-base.en.bin audio.wav

GPU support (CUDA, Metal, OpenCL) is baked in. Build with GPU flags and you’re golden. Check the GitHub for your specific hardware—Metal on Mac is particularly snappy.

Integrating Into Your Homelab (The Smart Part)

This is where Whisper.cpp gets interesting. You can wire this into basically anything.

Home Assistant: Use the REST integration to send audio clips for transcription. Drop this in your configuration:

The gear I run for this

Hardware from my own homelab, relevant to this guide — direct Amazon links.

Raspberry Pi 5 (8GB)The ultimate homelab starter. Run Pi-hole, Home Assistant, lightweight AI, and Docker containers.

~AED 370

Beelink SER5 Mini PC (Ryzen 5)Compact Proxmox host. Run Docker, VMs, and lightweight AI workloads with 16GB RAM.

~AED 900

Crucial Pro 32GB DDR5 560032GB (2x16) DDR5 kit — the minimum for running LLMs and heavy Docker workloads locally.

~AED 500

Affiliate links — I earn a small commission at no extra cost to you. Browse my full homelab store →

rest_command:
  transcribe_audio:
    url: "http://whisper:8000/inference"
    method: POST
    payload: '{"audio": "{{ audio_file }}" }'

Voice Assistants: Wire Whisper.cpp as the speech-to-text backend for local voice control. Way faster and way cheaper than cloud STT.

Media Server Automation: Automatically transcribe podcast episodes, generate subtitles for video, or index spoken content. Pair it with Bazarr or Plex and you’ve got a complete local subtitle pipeline.

Behind a reverse proxy: Throw Traefik or nginx in front of it for clean URLs and basic auth. Keep your transcription API private.

Real talk: if you’re running Proxmox or a VM cluster, Whisper.cpp is lightweight enough to run as a LXC container. I’ve got mine pinned to 2 CPU cores and it still handles concurrent requests.

Performance Tuning (Because Why Not Squeeze More)

Out-of-the-box Whisper.cpp is already blazing. But here’s what I’ve learned:

Model Size Matters: ‘tiny’ for speed demons, ‘base’ for the sweet spot, ‘small’ or ‘medium’ for accuracy nerds. ‘large’ is overkill unless you need perfection on heavily accented audio.

Threading: By default it uses all your cores. That’s usually perfect, but on shared systems, you can limit it: ./main -t 4 -m models/ggml-base.bin audio.wav

Language Specification: If you only care about English, tell it: ./main -m models/ggml-base.en.bin audio.wav. Saves a millisecond or two.

GPU Memory: GPU acceleration is optional. If your GPU is maxed, just leave GPU off. CPU still crushes cloud APIs on latency.

The biggest gain? Batch your audio. Processing 10 files together is faster per-file than processing them one-by-one. I’ve got a cron job that batches overnight transcriptions and it’s stupidly efficient.

The Real Cost Savings (Math Time)

Otter.ai charges $9.99/month for their base plan. Rev is $1.25 per minute. Google Cloud Speech-to-Text is $0.024 per 15 seconds after free tier.

A Whisper.cpp container running on your homelab? CPU costs are basically electricity. My setup uses about 15W when idle, maybe 35W under load. That’s less than a buck a month.

If you’re transcribing more than 100 minutes per month, Whisper.cpp pays for itself in hardware costs in a year. If you’re already running a homelab, it’s literally free money in your pocket.

Plus: no API rate limits, no account lockouts, complete privacy—your audio never leaves your network.

The Honest Bits (and When Not to Use It)

I love Whisper.cpp, but let’s be real: it’s not magic.

Heavy accents or poor audio? The ‘large’ model helps, but you might still get garbage. Cloud APIs sometimes squeeze out better results on junk audio.

Multiple speakers or overlapping dialogue? Whisper struggles here. It’ll transcribe something, but don’t expect speaker diarization out-of-the-box.

Real-time transcription at ultra-low latency? Whisper.cpp is fast, but it’s not streaming. It processes complete audio. If you need sub-100ms latency on live audio, you need something else.

For 95% of self-hosted use cases though, this thing is perfect.

Bottom line: Whisper.cpp is the transcription solution I recommend to every homelab person I meet. It’s fast, it’s free, it’s self-hosted, and it works. Stop paying subscription fees for something this good.

Explore Whisper.cpp in our AI Homelab Toolkit.

AI Docker homelab Privacy self-hosted speech-to-text transcription whisper.cpp