I spent the afternoon setting up Tabby on my homelab NAS and immediately started reading through the request flow to understand how it’s different from just running Ollama in the corner. Turns out the architecture is more interesting than I expected, and also more opinionated about context than I’d assumed going in.
The basic shape of it
Tabby runs as a server—just a single binary, or a Docker container if you go that route—that sits on your network and listens for completion requests. When you’re typing in VS Code or JetBrains, the editor extension sends a request every few hundred milliseconds with your current file state, cursor position, and some surrounding context. The server processes that, runs it through a code-tuned language model, and sends back token-by-token completions.
The mental model is straightforward. What’s not immediately obvious is the context pipeline.
How context actually gets built
Here’s where I got surprised. When you trigger a completion in Tabby, the extension doesn’t just send your current file and call it done. It sends:
- The file you’re editing (full text, up to a reasonable limit)
- A chunk of context before and after your cursor
- Metadata about the file type, language, and project root
The server receives this request and immediately starts a context retrieval phase. If you’ve indexed your repository—and Tabby has a built-in indexer that watches your git root—it searches that index for semantically similar chunks from other files in your project. Not full files. Chunks. Token-aware chunks.
This is the part that makes Tabby different from a raw language model. It’s doing approximate nearest-neighbor search over embeddings of your code. When I first read this I thought, okay, so it’s pulling in similar functions or patterns from elsewhere in the repo. That seemed smart. Then I started wondering whether that’s actually helping or just adding latency, which is probably a question I’ll come back to in a month.
The indexing happens asynchronously. You can point Tabby at your repo, it spins up a worker thread, and it processes the codebase in the background. I set it to index on startup and let it run while I made coffee.
The model layer and where it gets loaded
Tabby is model-agnostic in theory but opinionated in practice. The default is a CodeLLaMA variant or similar—something trained on code, not general text. You can swap in different models, but the pipeline assumes you’re working with something between 6B and 34B parameters. I’m running it on an RTX 3070, so I stayed with the 7B default.
The model doesn’t live in your repo. It lives in a ~/.tabby/models directory that Tabby manages. When you start the server, it downloads the model from Hugging Face if you don’t have it cached, loads it into GPU memory, and keeps it resident. If your GPU doesn’t have enough VRAM—and 7B models need roughly 14-16GB—it’ll fall back to CPU inference, which is brutal. I tested this by accident when I set the wrong model size.
The actual inference happens in a request queue. Tabby doesn’t spawn a new model instance per request. It maintains a single loaded model and queues incoming completion requests. Each request gets a reasonable timeout—I think it defaults to 20 seconds or so—and if the model hasn’t returned tokens by then, Tabby sends back what it has or an empty response.
The wire protocol and why it matters
The extension talks to Tabby over HTTP. That seemed obvious until I realized the implications. Tabby is just a standard HTTP server. You hit it with a POST request containing your context, it returns completions. The extension handles streaming the response back into your editor as it arrives.
This is worth noting because it means Tabby runs on port 8080 by default, it can sit behind a reverse proxy like Traefik, and if you’re paranoid about network isolation (which I am) you can stick it on a separate VLAN and still reach it from your dev machine. I put it on my homelab network with an internal hostname, fired up the VS Code extension, pointed it at the Tabby server, and after maybe three seconds of waiting I got my first completion. A function stub in Python. Not wrong, not particularly inspired, but working.
The requests are not cached intelligently on the client side. Every keystroke that triggers autocomplete sends a full request. That means latency matters. If your network is flaky or Tabby is slow, you’ll notice. I set up a small monitoring script in Prometheus just to watch response times. First day was around 800ms average, which felt sluggish. After I realized I’d left max_batch_size at 1 instead of bumping it to 4, things improved.
The piece that still feels off
One thing I haven’t fully wrapped my head around yet is how Tabby handles very large files. The indexer processes your repo, but if you open a file with 5000 lines and position your cursor deep in the middle, Tabby has to make a decision about what context to send. Does it send the whole file? A sliding window? Just the surrounding 200 lines?
Looking at the request logs, it seems to be a configurable window that defaults to something like 50 lines before and after your cursor. That’s probably sensible for latency, but it means Tabby might miss relevant context that’s further up in the file. I haven’t hit this as a real problem yet, but I suspect for large monolithic files it might matter.
The docker-compose setup I’m using is straightforward. Tabby publishes one image to Docker Hub, it mounts a volume for model cache, exposes port 8080, and that’s really it:
version: '3.8'
services:
tabby:
image: tabbyml/tabby:0.12.0
ports:
- "8080:8080"
volumes:
- tabby_models:/root/.tabby/models
- /mnt/data/repos:/root/repository
environment:
- TABBY_TOKEN=your_secret_token_here
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
tabby_models:
That GPU stanza assumes you have nvidia-docker working, which… I did, barely, after some fiddling with the NVIDIA container toolkit.
The thing that’s sitting in the back of my mind is that I still don’t know the exact memory footprint of the embedding search. Tabby loads the model, maintains the index, and handles concurrent requests all at once. On a tight homelab box, that could get ugly. For now though, my setup has plenty of headroom. I’m curious whether six months in I’ll feel differently about whether this was worth the complexity versus just using Ollama as a code completion backend with simpler logic.
Explore Tabby in our AI Homelab Toolkit.
Recommended Hardware & Hosting
Build your homelab with hardware tested and used by our team.
Affiliate links — we may earn a small commission at no extra cost to you.