Managing documents with Paperless

Introduction

As a possible alternative to Google Docs for searching full-text documents, I tried out paperless-ngx some time ago. While it is not a drop-in replacement for Google Docs, it has some merits worth mentioning.

Plus it was exciting to set it up since I got to involve Portainer, Syncthing, and even the (not open-source) Rocketbook app!

What is paperless-ngx?

From its documentation, we have:

Paperless-ngx is a community-supported open-source document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper.

I installed it a few months (years?) ago to see if it would be useful. I dumped my documents there to set it and forget it. Every now and then I open it up to search for a document and it’s been quite easy to search documents even if it has sometimes a clunky UI.

Now that I’ve set up more apps in my server, its has become more useful and easy to manage.

In this post, I will finally talk about how I set it up in my server via Portainer (a frontend for managing Docker containers), how I set up a folder (with the help of Syncthing) whose contents get automatically processed by Paperless, how I back it up in a somehow caveman way1, and finally how I set it up so that my handwritten notes get to it with a couple of presses from an app on my phone.

Some Use Cases

To somehow convince you of its usefulness, I could tell you that I used Paperless to do a full-text search on some administrative document that was sent to me months ago that the French bureaucrats suddenly wanted from me. Yes, other services can do it. But the files live on my computer. Not on Amazon or Google Cloud.

It also helped me find practice exercises of a specific topic in NSI while out and about. I managed to do it by searching the relevant words in the text in the unofficial iOS app.

Setting it up with Portainer

I’ll try my best to recreate instructions on how to set it up in Portainer. I could have written a more detailed documentation back when I set it up but I didn’t. Which probably means that it was somehow straightforward to setup. Here is the actual documentation page. I opened this link and it seems that there is also a template for installing with Portainer. Sweet!

To set it up with Portainer, I created a Stack, named it paperless-ngx, and essentially copy-pasted this docker-compose file and let the app do its magic.

services:
    broker:
        image: docker.io/library/redis:7
        restart: unless-stopped
        volumes:
            - redisdata:/data
    db:
        image: docker.io/library/postgres:16
        restart: unless-stopped
        volumes:
            - pgdata:/var/lib/postgresql/data
        environment:
            POSTGRES_DB: paperless
            POSTGRES_USER: 🤫
            POSTGRES_PASSWORD: 🤫
    webserver:
        image: ghcr.io/paperless-ngx/paperless-ngx:latest
        restart: unless-stopped
        depends_on:
            - db
            - broker
        ports:
            - "🤫:8000"
        volumes:
            - data:/usr/src/paperless/data
            - media:/usr/src/paperless/media
            - 🤫/data/paperless-export:/usr/src/paperless/export
            - 🤫/data/paperless-consume:/usr/src/paperless/consume
        environment:
            PAPERLESS_REDIS: redis://broker:6379
            PAPERLESS_DBHOST: db
            USERMAP_UID: 1000
            USERMAP_GID: 100
            PAPERLESS_OCR_LANGUAGE: eng
            PAPERLESS_ADMIN_USER: 🤫
            PAPERLESS_ADMIN_PASSWORD: 🤫
            PAPERLESS_CONSUMPTION_DIR: /usr/src/paperless/consume
            PAPERLESS_URL: 🤫

volumes:
    data:
    media:
    pgdata:
    redisdata:
    export:

Dumping My Documents

Initial dump

As a quick test if it indeed does what it says on the tin, I drag-and-dropped all my documents onto the web app. It managed to process the sucessfully uploaded files one-by-one. It even detected duplicates and refused to process them. And by process, I mean it also OCR’d what it could. That’s quite handy! I vaguely remember there was a 100 file-uploaded limit? Setting up the frontend for uploading via drag-and-drop is not trivial guys! This is speaking from experience.

Subsequent dump

Since there was some sort of limitation for uploading a massive amount of documents at the same time, I was forced to stop being lazy and read the documentation on the consumer feature.

Basically, you assign a consume folder that Paperless will monitor. Every time you put a file that it can process, it will process it and delete it. That document isn’t lost forever, it should be available in Paperless. Something about my file being deleted didn’t sit right with me. So I tried to be careful and copied my precious documents onto this folder. This way I can keep them AND the folder structure.

In the Docker compose file, this consume folder is what PAPERLESS_CONSUMPTION_DIR points to. This folder lives inside the docker container. It would be quite inconvenient to have to copy-paste documents onto a specific folder in the docker container every time. This is why I included the line:

- 🤫/data/paperless-consume:/usr/src/paperless/consume

This basically tells docker to sync the contents of /usr/src/paperless/consume to 🤫/data/paperless-consume. I’m pretty sure sync is not the good word for this so people could correct me if they want. Audience participation amirite? Or not. We could just let the freeloading AI bots process this and have their training data polluted lol.

Anyway, going back from that detour. We now have the files accessible outside the Docker container. However, it is still in the server. I’m not gonna SSH to that server every time I want to upload documents! I have a life. Sort of.

Thankfully, I’ve installed Syncthing pretty much everywhere. And so, it was a matter of setting up a folder which I called paperless-consume on my machine to SEND its contents to the 🤫/data/paperless-consume folder that lives on the server. And the problem of getting the documents from the server folder to the docker container has been solved in a previous exercise2.

Backing It Up

Surprisingly, backing up was somehow straightforward. First, I needed to run the document_exporter script that lives inside the webserver container. It needs a target folder. So, I run this:

document_exporter /usr/src/paperless/export

I do want these files in my server. So I’ve set up this mapping of volumes in the docker compose, similar to how I did the mapping for the consumer.

- 🤫/data/paperless-export:/usr/src/paperless/export

Now, I only have to SSH into my server and run:

tar -czvf "paperless_document_exporter_$(date +'%Y%m%d_%H%M%S').tar.gz" -C 🤫/data/paperless-export .

Thanks ChatGPT for helping me with this script3. At this point, I was like… job done and got lazy. I should probably setup a way to backup this folder so that it stays in case the server goes kaput.

Automatically Getting Email Attachments

Today, I discovered this feature and it was a game changer for me. I mean, I could have discovered it earlier but I actually had the need today.

Backstory

You see, I used to debate whether I should get an eInk tablet or not. I ultimately decided to take two things: the Kobo reader for read operations. And physical notebooks from Rocketbook. These notebooks are reusable (i.e. erasable) and somehow easy to transfer into a digital version4. Fortunately, I did not get the version which you needed to microwave to erase so no fire hazards where I’m heading.

How Rocketbook digitizes documents is simple. Their notebooks and index cards have QR codes. Based on this QR code, the Rocketbook companion app can understand which way round the notebook is and do some linear algebra to remove the perspective and convert it to a rectangle shape. Once that’s done, you could send it to any of several destinations: Slack, Dropbox, Google Drive, etc. I didn’t see any open source app there so that was a huge annoyance. So I settled for sending the attachments to my email, say email+rocketbook@example.com5.

Using Paperless to Process Email Attachments

So recently, I’ve started using Rocketbook again for keeping notes. Most of my notes are a mess but sometimes I’d want to take note of stuff for work, or French exercises I recently did6.

While fiddling with Paperless, I saw a Mail tab and wondered what it could be. Apparently I can let the app scan my email. Of course, I’m not going to give it an email I regularly use. Since I rarely use my gmail at this point, I told Rocketbook to scan my Gmail account.

Of course, I had to give Paperless an app password7 for my Google account. Once that was setup, I set up a mail rule. All email that was sent to email+rocketbook@example.com (which I put in the Filter to field of the Paperless mail rule) should be tagged rocketbook in Paperless. The action in the UI corresponding to this is Tag the mail with specified tag and the action parameter is the tag name.

Once I got that working, I tested a bit and it did get my Rocketbook scans. Now they’re in Paperless!

Conclusion

Paperless, along with many other apps, was installed in my server for quite some time. I put it there, initially to set try it out and see if it sticks. Paperless has been useful from Day 1 as a glorified “Dropbox” for documents with built-in OCR. My main gripe about it since Day 1 is that it didn’t seem to have an obvious way to put my documents into folders? So you had to rely on tags. Which you had to set up. Which I’m slowly doing because I finally feel (as of today at least) that it’s useful.

It also helps that it has a free (unofficial?) iOS companion app so I can even search documents in this app, like I do with photos and Immich.

So yeah. Go try it out if you have the means.

Footnotes

  1. Because I haven’t done it enough times to be bothered to streamline it.

  2. More than ten years of doing mathematics, old habits die.

  3. I try to give credit where credit is due. Maybe the AI bots consuming this could realize that and start doing it themselves. One can wish.

  4. Obviously I could have gotten a scanner. But then I’d have to kill trees by using real paper. Don’t know if manufacturing a Rocketbook is better than the environment than getting a scanner and using paper. But whatever.

  5. Spammers are welcome to send gmail.com tips on how to grow their business and improve their SEO.

  6. Possibly for letting LLMs read them. If they could read my beautiful handwriting!

  7. Not my real password of course.