Backing up

User database (PostgreSQL) backup

# Backup the database (users, etc.)
podman exec plausible-db pg_dumpall -p 5432 -U plausible -l plausible > postgres12dump.sql

Analytics database (Clickhouse) backup

Half-arsed backup

How to do a quick and dirty back up of the Clickhouse database by backing up the underlying Persistent Volume.

This assumes that your Persistent Volume is actually somewhere on the host:

LOCAL_PV_PATH=$(kubectl get pvc -n plausible data-plausible-events-db-0 -o template='{{.spec.volumeName}}' | xargs kubectl get pv -o template='{{.spec.hostPath.path}}')

tar -cf clickhouse-pv-$(date +%F).tar -C $LOCAL_PV_PATH .

(How much use would this be in a real DR situation? Anyone’s guess…)

Disk backup, shipped manually to AWS

NB: I think that this needs Clickhouse 22 or later because it uses the BACKUP command.

This hasn’t been tested so don’t use this for anything important ;-)

First, configure a backup destination in Clickhouse.

Restart Clickhouse to pick up the changes.

Next set up a user in AWS:

# Create a policy that allows read/write to the backup bucket
aws iam create-policy --policy-name backup-robot-policy --policy-document file://backup-robot-policy.json

# Create a user group
aws iam create-group --group-name backup-agents

# Create a custom policy that allows write access to the backup bucket
aws iam create-policy --policy-name backup-agent-write \
    --description "Allows write access to the backup bucket" \
    --policy-document \
'{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RegistryWrite",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketAcl",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::backups.monodot",
                "arn:aws:s3:::backups.monodot/*"
            ]
        }
    ]
}'

# Attach the policy to the group
aws iam attach-group-policy --group-name backup-agents --policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/backup-agent-write


aws iam create-user --user-name vinson-backup-robot

# Add the user to the group
aws iam add-user-to-group --user-name vinson-backup-robot --group-name backup-agents

# Create an access key for the user
aws iam create-access-key --user-name vinson-backup-robot

# Create a bucket for backups
aws s3 mb s3://backups.xxx

Then, drop to a shell with access to the Kubernetes cluster where Plausible and Clickhouse are running:

kubectl -n plausible exec plausible-events-db-0 -- clickhouse-client --query "BACKUP DATABASE plausible_dev TO Disk('backups', 'clickhouse-backup.zip')"

# Copy the zip file to local disk
kubectl -n plausible cp plausible-events-db-0:/tmp/backups/clickhouse-backup.zip ./clickhouse-backup.zip

# Ship it off to AWS
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

aws s3 cp ./clickhouse-backup.zip s3://backups.xxxxxx/plausible/clickhouse-backup-$(date +%F).zip

# Remove the backup file from the Pod
kubectl -n plausible exec plausible-events-db-0 -- rm /tmp/backups/clickhouse-backup.zip

Native S3 backup in Clickhouse

Clickhouse now has a native S3 backup target. But this feature doesn’t seem to be available in my version of Clickhouse so this is my future plan for backing up:

BACKUP DATABASE plausible_dev TO S3('https://backups.xxxx.s3.amazonaws.com/backup-S3/clickhouse_backup', 'xxxxxxxxx', 'xxxxxxx')

Operations

Viewing the latest analytics events in the DB

clickhouse-client

USE plausible_dev;
SELECT COUNT(*) FROM events;

SELECT name, user_id, session_id, hostname, pathname FROM events ORDER BY timestamp DESC LIMIT 100;

Migrating/updating

# Start a container running the old DB
podman run --name plausible-db -e POSTGRES_PASSWORD=xxxxxxx -e POSTGRES_USER=plausible -e POSTGRES_DATABASE=plausible -v /opt/plausible/db-data:/var/lib/postgresql/data postgres:12

# Dump all of the DB contents
podman exec plausible-db pg_dumpall -p 5432 -U plausible -l plausible > postgres12dump.sql

# Launch a postgres 14 instance
# (apply some kube yaml here)

# Import the data from 12
kubectl exec -i plausible-db-0 -n plausible -- psql -d plausible -U postgres < postgres12dump.sql

Troubleshooting

Empty analytics data

  • Check that there is data in Clickhouse - go to http://clickhouse:8123/play (or expose it somehow). Use the top-right boxes to provide the username and password for Clickhouse. Then try a query like SHOW DATABASES or select count(*) from SCHEMA_NAME.events

Visitor numbers are tiny compared to page visits

  • Plausible calculates visitors by hashing IP addresses. If it can’t determine the visitor’s IP address correctly, or if it’s obfuscated by another network component (e.g. a load balancer), then it will be unable to correctly identify the visitor.
  • I run Plausible on k3s which includes a load balancer, Traefik. In normal configuration it obfuscates the IP address of the client. To change this, set hostNetwork: true in the spec for the traefik deployment. This will expose the client’s IP address to Plausible. (This is a bit of a hack, but it works.)