Skip to main content
Local LLMs

What Open WebUI is actually doing when you click send

· · 5 min read

Open WebUI sits between you and a language model like it’s translating a conversation. Click send on a message, and something more complex than a direct API call happens. Understanding that flow changed how I debug issues and tune performance in my setup.

🎯 Not sure if this will run on your hardware?Use our free Local LLM Hardware Checker — pick your GPU and RAM, see which models will run with real tokens/sec estimates.
Check my hardware →
Open WebUI screenshot
Open WebUI u2014 from the official site

The basic routing problem

Open WebUI is a frontend first. It doesn’t run models itself. It’s a Docker container with a Node.js backend and a Vue.js frontend that needs to know where to find a model to talk to.

When you start Open WebUI, it connects to either Ollama (usually running on localhost:11434) or any OpenAI-compatible endpoint. You configure this in settings or via environment variables at startup. The backend stores this config in SQLite or PostgreSQL depending on your setup.

Here’s the thing that surprised me: it doesn’t validate the connection at startup. If your Ollama instance is down when Open WebUI boots, it won’t complain. It’ll only fail when you actually try to send a message. That meant my first week involved a lot of docker logs digging.

The request journey from UI to inference

You type a message in the browser. Hit send. The frontend JavaScript bundles your text, conversation history, selected model name, and any parameters (temperature, top_p, system prompt) into a JSON payload and POSTs it to /api/chat/completions.

The backend receives this. First, it checks authentication if you’ve enabled multi-user mode—comparing your session token against the user database. Then it looks up which model you selected and retrieves the model’s configuration from its database.

If you’ve set a custom system prompt or temperature for that model, it merges those settings with the user’s runtime parameters. Then it constructs the actual request to send downstream—either to Ollama’s REST API or to whatever OpenAI-compatible server you configured.

The backend doesn’t wait for the full response before starting to stream. It opens a Server-Sent Events (SSE) connection back to the frontend and pipes chunks as they arrive from the model. This is why you see text appearing word by word instead of a blank screen for thirty seconds.

# docker-compose example showing typical setup
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama

  open-webui:
    image: ghcr.io/open-webui/open-webui:latest
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=your_random_key_here
    depends_on:
      - ollama

volumes:
  ollama_data:

Where documents and search fit in

Upload a PDF to Open WebUI, and it doesn’t send that file to your LLM. Instead, the backend processes it.

If RAG (retrieval-augmented generation) is enabled, the file gets chunked—split into overlapping segments typically 512-1024 tokens long. Open WebUI uses a local embedding model to convert each chunk into a vector. These vectors get stored in a vector database (Chroma by default, but Milvus and others are supported).

When you ask a question that references uploaded documents, the system embeds your question the same way, searches the vector store for similar chunks, and appends those chunks to your message before sending it to the LLM. The model never sees your raw PDF—it sees the relevant excerpts.

Web search works differently. If you enable it, Open WebUI calls a search API (SearXNG if you self-host it, or external providers) and injects those results into the prompt context. The model gets recent information without needing to be retrained.

I noticed the embedding step adds latency. On my hardware, embedding a 20-page document takes 15-30 seconds. Chunking happens fast. Vector storage is nearly instant. The bottleneck is the embedding model itself running on CPU.

Multi-user and model management

Open WebUI’s SQLite database (or Postgres if you configure it) stores everything: user credentials, conversation history, documents, model metadata, settings. By default, SQLite is in a Docker volume, which means your data persists across container restarts but lives in one place.

When multiple users are enabled, each user gets their own conversation namespace and document space. The backend enforces this in queries—when user A fetches conversations, the SQL WHERE clause filters by their user_id. There’s role-based access control: admin, user, and in some versions, fine-grained permissions.

Model management is just configuration. Open WebUI lists available models by querying Ollama’s /api/tags endpoint or asking the OpenAI-compatible server what it has. When you pull a new model in Ollama, Open WebUI automatically detects it. No restart needed. The model’s metadata (name, description, parameters) gets cached in Open WebUI’s database for faster access.

Where the seams show

Open WebUI is stateless by design. The container can restart without losing data because everything important lives in the database. But this also means if your database connection drops, the API returns 500 errors even though the model is fine. I’ve had PostgreSQL full disks cause the whole interface to feel broken.

Error handling is inconsistent. Sometimes you get a clear message when Ollama is unreachable. Sometimes the request just hangs for thirty seconds and times out silently. The frontend doesn’t always surface what happened to the backend.

Memory usage can surprise you. The embedding model loads into RAM when you first upload a document. If you’re running Open WebUI and Ollama on the same 8GB machine and you load a 7B parameter model plus an embedding model, you’re going to swap. It won’t crash, but it’ll be slow.

What happens when inference takes a while

If the LLM is slow (and it will be on smaller hardware), the SSE stream is still live. Your browser shows the text arriving in real time. Cancel the request mid-stream, and the backend receives the signal and stops waiting for the model to finish. Ollama handles the cleanup on its end.

Long conversations can cause issues. Open WebUI includes the full conversation history in each request to the model—context matters. But if you’re 100 exchanges deep, you’re sending gigabytes of tokens upstream. Most models have a context limit. Once you hit it, the backend truncates the conversation, usually keeping the system prompt and dropping old messages. This happens silently unless you watch the logs.

The architecture is straightforward enough that debugging it doesn’t require reading much source code. Request goes in, config lookup, downstream call, response streams back. The friction points aren’t hidden—they’re just the normal tradeoffs of being a thin wrapper around something heavier. Run it for a few weeks and you learn where your own setup’s limits are.

Explore Open WebUI in our AI Homelab Toolkit.

Share this article