What Tabby is really doing when you hit autocomplete

I spent the afternoon setting up Tabby on my homelab NAS and immediately started reading through the request flow to understand how it’s different from just running Ollama in the corner. Turns out the architecture is more interesting than I expected, and also more opinionated about context than I’d assumed going in.

🎯 Not sure if this will run on your hardware?Use our free Local LLM Hardware Checker — pick your GPU and RAM, see which models will run with real tokens/sec estimates.

Check my hardware →

The basic shape of it

Tabby runs as a server—just a single binary, or a Docker container if you go that route—that sits on your network and listens for completion requests. When you’re typing in VS Code or JetBrains, the editor extension sends a request every few hundred milliseconds with your current file state, cursor position, and some surrounding context. The server processes that, runs it through a code-tuned language model, and sends back token-by-token completions.

The mental model is straightforward. What’s not immediately obvious is the context pipeline.

How context actually gets built

Here’s where I got surprised. When you trigger a completion in Tabby, the extension doesn’t just send your current file and call it done. It sends:

The file you’re editing (full text, up to a reasonable limit)
A chunk of context before and after your cursor
Metadata about the file type, language, and project root

The server receives this request and immediately starts a context retrieval phase. If you’ve indexed your repository—and Tabby has a built-in indexer that watches your git root—it searches that index for semantically similar chunks from other files in your project. Not full files. Chunks. Token-aware chunks.

This is the part that makes Tabby different from a raw language model. It’s doing approximate nearest-neighbor search over embeddings of your code. When I first read this I thought, okay, so it’s pulling in similar functions or patterns from elsewhere in the repo. That seemed smart. Then I started wondering whether that’s actually helping or just adding latency, which is probably a question I’ll come back to in a month.

The indexing happens asynchronously. You can point Tabby at your repo, it spins up a worker thread, and it processes the codebase in the background. I set it to index on startup and let it run while I made coffee.

The gear I run for this

Hardware from my own homelab, relevant to this guide — direct Amazon links.

NVIDIA RTX 3060 (12GB)The sweet spot for local AI. 12GB VRAM runs Stable Diffusion, Ollama 13B models, and Whisper comfortably.

~AED 1,300

Crucial Pro 32GB DDR5 560032GB (2x16) DDR5 kit — the minimum for running LLMs and heavy Docker workloads locally.

~AED 500

NVIDIA Jetson Orin Nano40 TOPS GPU compute for edge AI. Run Ollama, Stable Diffusion, and small LLMs on dedicated hardware.

~AED 900

Affiliate links — I earn a small commission at no extra cost to you. Browse my full homelab store →

The model layer and where it gets loaded

Tabby is model-agnostic in theory but opinionated in practice. The default is a CodeLLaMA variant or similar—something trained on code, not general text. You can swap in different models, but the pipeline assumes you’re working with something between 6B and 34B parameters. I’m running it on an RTX 3070, so I stayed with the 7B default.

The model doesn’t live in your repo. It lives in a ~/.tabby/models directory that Tabby manages. When you start the server, it downloads the model from Hugging Face if you don’t have it cached, loads it into GPU memory, and keeps it resident. If your GPU doesn’t have enough VRAM—and 7B models need roughly 14-16GB—it’ll fall back to CPU inference, which is brutal. I tested this by accident when I set the wrong model size.

The actual inference happens in a request queue. Tabby doesn’t spawn a new model instance per request. It maintains a single loaded model and queues incoming completion requests. Each request gets a reasonable timeout—I think it defaults to 20 seconds or so—and if the model hasn’t returned tokens by then, Tabby sends back what it has or an empty response.

The wire protocol and why it matters

The extension talks to Tabby over HTTP. That seemed obvious until I realized the implications. Tabby is just a standard HTTP server. You hit it with a POST request containing your context, it returns completions. The extension handles streaming the response back into your editor as it arrives.

This is worth noting because it means Tabby runs on port 8080 by default, it can sit behind a reverse proxy like Traefik, and if you’re paranoid about network isolation (which I am) you can stick it on a separate VLAN and still reach it from your dev machine. I put it on my homelab network with an internal hostname, fired up the VS Code extension, pointed it at the Tabby server, and after maybe three seconds of waiting I got my first completion. A function stub in Python. Not wrong, not particularly inspired, but working.

The requests are not cached intelligently on the client side. Every keystroke that triggers autocomplete sends a full request. That means latency matters. If your network is flaky or Tabby is slow, you’ll notice. I set up a small monitoring script in Prometheus just to watch response times. First day was around 800ms average, which felt sluggish. After I realized I’d left max_batch_size at 1 instead of bumping it to 4, things improved.

The piece that still feels off

One thing I haven’t fully wrapped my head around yet is how Tabby handles very large files. The indexer processes your repo, but if you open a file with 5000 lines and position your cursor deep in the middle, Tabby has to make a decision about what context to send. Does it send the whole file? A sliding window? Just the surrounding 200 lines?

Looking at the request logs, it seems to be a configurable window that defaults to something like 50 lines before and after your cursor. That’s probably sensible for latency, but it means Tabby might miss relevant context that’s further up in the file. I haven’t hit this as a real problem yet, but I suspect for large monolithic files it might matter.

The docker-compose setup I’m using is straightforward. Tabby publishes one image to Docker Hub, it mounts a volume for model cache, exposes port 8080, and that’s really it:

version: '3.8'
services:
  tabby:
    image: tabbyml/tabby:0.12.0
    ports:
      - "8080:8080"
    volumes:
      - tabby_models:/root/.tabby/models
      - /mnt/data/repos:/root/repository
    environment:
      - TABBY_TOKEN=your_secret_token_here
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
volumes:
  tabby_models:

That GPU stanza assumes you have nvidia-docker working, which… I did, barely, after some fiddling with the NVIDIA container toolkit.

The thing that’s sitting in the back of my mind is that I still don’t know the exact memory footprint of the embedding search. Tabby loads the model, maintains the index, and handles concurrent requests all at once. On a tight homelab box, that could get ugly. For now though, my setup has plenty of headroom. I’m curious whether six months in I’ll feel differently about whether this was worth the complexity versus just using Ollama as a code completion backend with simpler logic.

Explore Tabby in our AI Homelab Toolkit.

AI coding GPU homelab self-hosted tabby