Skip to main content
Local LLMs

LocalAI on my homelab: running OpenAI API locally

· · 4 min read

I’ve been leaning on OpenAI’s API for a few projects—Home Assistant automations that need to understand intent, some basic image generation, stuff like that. Nothing fancy. But the API calls add up, and having my own data flow through Anthropic or OpenAI’s servers always felt slightly wrong, even if I trust them. Yesterday I finally set up LocalAI, and I’m writing this down while it’s still fresh because I know I’ll forget the one thing that actually bit me.

🎯 Not sure if this will run on your hardware?Use our free Local LLM Hardware Checker — pick your GPU and RAM, see which models will run with real tokens/sec estimates.
Check my hardware →

Why I’m doing this

The premise is straightforward: LocalAI is a self-hosted API server that mimics OpenAI’s endpoints. Drop it in, point your applications at localhost instead of api.openai.com, and everything works the same way. No monthly bills. No data leaving the homelab. Models stay private. For my use case—a few Home Assistant automations that need basic language understanding, plus occasional Stable Diffusion image generation—it’s overkill in some ways and exactly right in others.

I’m not trying to replace my entire GPT-4 workflow. This is for the low-stakes stuff, the requests that don’t need frontier intelligence. Local models are fine for that.

Getting it running

I went with Docker because I already have Dockge managing containers on this machine. LocalAI provides an official image, which makes this straightforward. Here’s the compose file I used:

version: '3.8'
services:
  localai:
    image: localai:latest
    ports:
      - "8080:8080"
    environment:
      - MODELS_PATH=/models
      - CONTEXT_SIZE=2048
      - THREADS=4
    volumes:
      - /mnt/storage/localai/models:/models
      - /mnt/storage/localai/config:/etc/localai
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G
        reservations:
          memory: 6G

Notes: I set CONTEXT_SIZE conservatively because my hardware isn’t unlimited. The THREADS value should match your CPU count, or use half if you want breathing room for the rest of the system. The models path is important—this is where LocalAI will download and store the actual model files, and they can get large fast.

docker compose up -d and wait. The first startup takes longer because it’s pulling the image and setting up directories. Once it’s running, hit http://localhost:8080/v1/models and you should get back a JSON list, probably empty.

Loading a model

This is where I spent the longest this afternoon. LocalAI can work with GGUF files, which are quantized versions of LLMs. The API has an endpoint to trigger downloads, but I couldn’t get it to work reliably from curl. Turns out the config file is the better path.

I created a config file at /mnt/storage/localai/config/neural-chat.yaml:

name: neural-chat
model: huggingface://TheBloke/neural-chat-7B-v3-2-GGUF/neural-chat-7b-v3-2.Q4_K_M.gguf
backend: llama
parameters:
  temperature: 0.7
  top_p: 0.9
  context_size: 2048
  threads: 4

Restart the container and LocalAI will download that GGUF file from Hugging Face on startup. It’s maybe 5GB for this one. I made coffee and checked in 10 minutes later.

Hit the models endpoint again and now I had neural-chat in the list. Good.

Testing the API

Here’s a basic curl to check it’s working:

curl -X POST http://localhost:8080/v1/chat/completions 
  -H "Content-Type: application/json" 
  -H "Authorization: Bearer anything" 
  -d '{
    "model": "neural-chat",
    "messages": [{"role": "user", "content": "What is 2+2?"}],
    "temperature": 0.7
  }'

The Authorization header doesn’t matter—LocalAI doesn’t validate it by default, which is fine for a local instance. You get back the same response structure as OpenAI, which means your existing clients just work.

The gotcha

I spent 20 minutes here, and I want to spare you it. The models directory needs to exist and be writable by the container user. If LocalAI can’t write to /models, the download fails silently and you get a vague error about the model not being found. I had set the volume mount correctly but the directory itself didn’t have right permissions. Fixed it with chmod 777 on the host directory, which isn’t elegant but it works for a homelab.

Also: context size matters. I set 2048 above because I’m running on eight cores with 16GB RAM. If you push it to 4096 or higher on modest hardware, response times get slow. I learned this by trying to be ambitious and then backing off.

Using it from Home Assistant

I wanted to replace an OpenAI automation with LocalAI. My Home Assistant instance was calling out to OpenAI’s API; I changed the base_url parameter:

conversation:
  - platform: openai
    api_key: local
    base_url: http://192.168.1.50:8080/v1
    model: neural-chat

That’s the whole thing. Restarted Home Assistant’s conversation component and started sending requests. Works. Response time is slower than cloud (CPU-bound instead of whatever edge infrastructure OpenAI uses), but it’s consistent and I’m not waiting for the network.

What surprised me

I expected the neural-chat model to be pretty basic. It’s actually reasonable at understanding intent. It hallucinated a couple of times when I asked it to summarize a long prompt, but for the Home Assistant use case—just parsing what a user said into an action—it’s solid. I won’t be replacing GPT-4 for anything that needs reliability, but for the utility-level stuff, this is fine and I feel better about the data not leaving the network.

Next thing I want to try is adding Whisper for speech-to-text. LocalAI supports it natively, and I could chain it with the text API for a full voice-understanding pipeline. But that’s another evening of tinkering.

Explore LocalAI in our AI Homelab Toolkit.

Share this article