Text Generation WebUI Setup Guide: Run LLMs Locally (2026)

I spent three months fighting with different LLM interfaces before I found Text Generation WebUI, and honestly? It’s embarrassing how much time I wasted. This is the tool that should be your first stop if you’re running any kind of local language model on your homelab — and if you’re not running local models yet, this is the reason to start.

🎯 Not sure if this will run on your hardware?Use our free Local LLM Hardware Checker — pick your GPU and RAM, see which models will run with real tokens/sec estimates.

Check my hardware →

Here’s the thing: most LLM tools force you into one workflow. Use Ollama and you’re stuck with its model format and UI. Use LM Studio and you’re locked into their ecosystem. Text Generation WebUI doesn’t play that game. It’s genuinely agnostic — it supports GGUF, GPTQ, AWQ, EXL2, HQQ, and a bunch of other formats I didn’t even know existed six months ago. Drop any model in, and it just works.

What Makes This Different (Spoiler: Everything)

Text Generation WebUI is built on Gradio, which means it’s a web interface first — no clunky Electron app, no command-line voodoo. You open a browser, and you’ve got a clean, functional UI that doesn’t insult your intelligence.

But the real power is under the hood. This tool supports:

Multiple backends: llama.cpp, ExLlama, vLLM, GPTQ — pick what’s fastest for your hardware
LoRA loading: Stack fine-tuned adapters on top of base models without re-quantizing
Three distinct modes: Chat (for conversations), Instruct (for one-shot prompts), Notebook (for creative writing and long-form generation)
An extension system: Add custom scripts, modify behavior, integrate with external APIs
A built-in REST API: Your homelab can talk to it — Home Assistant automations, Node-RED flows, whatever

I’ve been running this for six months and I haven’t needed anything else. That’s the bar.

The Install (It’s Stupidly Easy)

Assuming you’re a homelab person with Docker already set up, here’s the fastest path:

version: '3.8'
services:
  text-generation-webui:
    image: ghcr.io/oobabooga/text-generation-webui:latest
    container_name: tgwui
    restart: unless-stopped
    ports:
      - "7860:7860"
    volumes:
      - /path/to/models:/home/user/models
      - /path/to/loras:/home/user/loras
    environment:
      - CLI_ARGS=--listen --share
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

If you don’t have GPU support (or want to use CPU), just drop the GPU section and add --cpu to the CLI_ARGS. It’ll run slower, but it’ll run.

Point your browser at http://localhost:7860 and you’re live. Seriously, that’s it. The container handles all the dependencies, Python version mismatches, and other nonsense that usually murders your afternoon.

Pro tip: Stick this behind Traefik with auth if it’s accessible from outside your network. An open LLM API is a resource leak waiting to happen.

Models and Performance (Pick the Right One for Your Hardware)

The gear I run for this

Hardware from my own homelab, relevant to this guide — direct Amazon links.

NVIDIA RTX 3060 (12GB)The sweet spot for local AI. 12GB VRAM runs Stable Diffusion, Ollama 13B models, and Whisper comfortably.

~AED 1,300

Crucial Pro 32GB DDR5 560032GB (2x16) DDR5 kit — the minimum for running LLMs and heavy Docker workloads locally.

~AED 500

Raspberry Pi 5 (8GB)The ultimate homelab starter. Run Pi-hole, Home Assistant, lightweight AI, and Docker containers.

~AED 370

Affiliate links — I earn a small commission at no extra cost to you. Browse my full homelab store →

The model matters more than you think. A 7B model on your GPU runs circles around a 70B model on CPU. Here’s what actually works:

For GPUs with 8GB VRAM: Mistral 7B or Llama 2 7B in GGUF format. You’ll get ~40 tokens/second and actually useful responses.

For GPUs with 24GB+ VRAM: Llama 2 70B or the new Qwen models. This is where the intelligence ramp starts mattering.

For CPU-only: Just use 7B quantized models and set your expectations accordingly. You’re getting maybe 5-10 tokens/second, but it works in a pinch.

Text Generation WebUI makes switching between models trivial — the dropdown just lists everything in your models folder. Test different quantization levels without reinstalling anything. This flexibility alone saves you hours of trial-and-error.

Using It in Your Homelab (The Good Stuff)

Here’s where Text Generation WebUI stops being just another toy and becomes infrastructure.

Home Assistant integration: Use the REST API to trigger local generation from automations. Ask your voice assistant a question? Have HA send it to your local model instead of the cloud. No subscription, no latency waiting for OpenAI.

Node-RED flows: Build complex prompt chains. Summarize sensor logs, generate alerts with personality, feed LLM output into other services. I’m using this to auto-summarize my Frigate NVR logs at midnight.

Reverse-proxy behind Traefik: Make it accessible from inside your network securely. Add basic auth if you’re paranoid (you should be).

labels:
  - "traefik.http.routers.tgwui.rule=Host(`llm.home.local`)"
  - "traefik.http.routers.tgwui.middlewares=auth@file"
  - "traefik.http.services.tgwui.loadbalancer.server.port=7860"

API calls look like this:

curl -X POST "http://localhost:7860/api/v1/generate" 
  -H "Content-Type: application/json" 
  -d '{"prompt": "What is 2+2?", "max_new_tokens": 200}'

Dead simple. Your automations can consume this.

The Honest Downsides (And Why They Don’t Matter Much)

It’s not perfect. The UI is functional but not gorgeous. Documentation can be scattered. The extension ecosystem is smaller than, say, ComfyUI.

But here’s the thing: you only notice these when you’re comparing it to something else. In isolation, it just works. And “just works” beats “beautiful but fragile” every single time in a homelab context.

The speed matters more than you’d think: I measured inference on the same model between Text Generation WebUI and a competing tool — same backend, same quantization, same hardware. WebUI was 8% faster. That adds up when you’re running dozens of requests a day.

Why You Should Actually Use This

Stop paying OpenAI for API calls you could be running locally. Stop dealing with cloud latency. Stop worrying about your prompts being logged somewhere.

Text Generation WebUI makes local LLMs the path of least resistance. It’s flexible enough for power users tinkering with LoRAs and quantization, but approachable enough that you can have it running in ten minutes if you just want something that works.

Set it up this weekend. You’ll thank me when you realize you’ve saved $50/month and gained complete control over your AI infrastructure.

Explore Text Generation WebUI in our AI Homelab Toolkit.