Skip to main content
Self-Hosted AI Apps

Paperless-ngx: Actually Getting Documents Searchable

· · 5 min read

I have a filing cabinet full of documents I will never organize by hand. Bank statements, medical records, old invoices, tax stuff. They sit there. So I bought a decent scanner, looked at commercial solutions, and decided to run Paperless-ngx in Docker instead. It’s open source, indexes everything into a searchable database, and doesn’t phone home. Four months in, I’ve scanned about 800 documents and actually found one. That’s already a win.

Why this instead of just scanning to a folder

You could throw PDFs into a folder and use desktop search. I did that for years. It mostly works until you have 2000 documents and you’re trying to remember which bank statement contained that one transaction. Paperless-ngx does OCR on everything, so you search the text inside the scans, not just the filename. It auto-tags documents by correspondent (the company that sent it) and type (invoice, statement, receipt). That alone saves time.

What surprised me: the AI tagging is actually useful. It learns from documents you tag manually and starts suggesting tags. I was skeptical it would work in a homelab environment, but it does. Not perfect—it occasionally suggests tags that make no sense—but it works well enough that I don’t have to manually categorize everything.

The catch is setup. Docker, PostgreSQL, Redis, Tesseract for OCR, a reverse proxy if you want it accessible outside your network. It’s not hard, but it’s not zero-effort either.

Getting it running on Docker

I’m running this in a Proxmox LXC with Ubuntu 22.04, 4 cores, 4GB RAM. That’s overkill for light use but fine if you plan to scan regularly. Here’s a working docker-compose:

version: '3.4'
services:
  db:
    image: postgres:15
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: your_secure_password_here
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    restart: unless-stopped
    volumes:
      - redisdata:/data

  paperless-ngx:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    depends_on:
      - db
      - redis
    environment:
      PAPERLESS_REDIS: redis://redis:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: your_secure_password_here
      PAPERLESS_SECRET_KEY: generate_a_random_string_here_min_32_chars
      PAPERLESS_TIME_ZONE: America/New_York
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_ADMIN_USER: admin
      PAPERLESS_ADMIN_PASSWORD: your_admin_password
      PAPERLESS_ENABLE_COMPRESSION: 'true'
    ports:
      - "8000:8000"
    volumes:
      - ./data:/home/paperless/data
      - ./media:/home/paperless/media
      - ./export:/home/paperless/export
    restart: unless-stopped

volumes:
  pgdata:
  redisdata:

Save that as docker-compose.yml, make sure the three volume directories exist, then docker-compose up -d. It’ll take a minute or two to initialize the database. Once it’s running, hit http://your-host:8000 and log in with the admin credentials you set.

The environment variables are where the magic happens. PAPERLESS_OCR_LANGUAGE defaults to English, but you can add multiple languages if you need them: PAPERLESS_OCR_LANGUAGE: eng+deu for English and German, for example. I stuck with English since that’s 99% of what I scan.

First run: import and configure

Once the container is up, go to the web interface and poke around. You’ll see settings under the gear icon. The important ones:

  • Consumption settings: Where Paperless watches for new documents. You can set it to watch a folder on your server, or upload through the web UI. I have a /home/paperless/media/documents/inbox folder mounted to my scanner’s network share, so I drop scans there and they get picked up automatically every minute.
  • Processing settings: Tweak OCR threads. If you only have 2 CPU cores, leave this at 1. More cores, you can bump it up. I’m at 2 on a 4-core system and OCR takes about 2-3 seconds per page.
  • Email settings: Optional, but nice. You can forward emails to Paperless and it’ll save them as documents. I use this for bills I get emailed.

The first document you import, manually set a correspondent and document type. Hit save. After five or six documents, the AI starts making suggestions. By document twenty, it’s usually right. Don’t overthink the tagging structure at the beginning—you can retag everything later if needed.

The bit that caught me off guard

Paperless needs a lot of disk space if you keep the originals. A 500-page book scanned at 300 DPI is about 1.5GB. I have about 100GB allocated to storage and I’m at 35GB after 800 documents. If you’re scanning decades of paperwork, budget for it. You can enable PAPERLESS_ENABLE_COMPRESSION to trade a little CPU for smaller PDFs, which I did. It helps.

Also: the search is powerful but not instant. If you’re searching for something specific, use the filter bar on the left—select correspondent, document type, date range—and narrow it down. Full-text search across thousands of documents takes a few seconds. It’s fine, just don’t expect Ctrl+F speed.

What to do next

Set up a routine. Once a week or twice a month, scan your mail and receipts in batch. Don’t let it pile up. I learned this the hard way—I left a stack of bank statements unopened for two months, then tried to backfill them all at once. It’s tedious either way, but at least scanning as you go means you catch problems faster.

If you want it accessible outside your network, stick it behind a reverse proxy (I use Nginx Proxy Manager) and enable PAPERLESS_ALLOWED_HOSTS to match your domain. Don’t expose port 8000 directly to the internet.

The export feature is good for peace of mind. Every few months I export everything as PDF + metadata so I’m not locked into the database. Takes a few minutes, not a backup strategy, but it exists.

I’m still building a habit around this. Some days I remember to scan things immediately. Other days I find a pile of documents in the inbox folder I forgot about. It’s better than paper in a cabinet, but it’s not magic—you still have to actually use it.

Explore Paperless-ngx in our AI Homelab Toolkit.

Share this article