12. Drop Longhorn in Favour of Static Local PVs + NFS Backups#

Status: Accepted

Supersedes: 0009 (workstation exclusion from Longhorn — no longer relevant with Longhorn removed).

Context#

The cluster previously used Longhorn as its block-storage CSI provider for six stateful PVCs (Supabase db/storage/minio, Grafana, Prometheus, Open WebUI), totalling about 225Gi of live data. Two compounding problems forced a rethink:

/rebuild-cluster destroys all Longhorn data. The decommission playbook wipes /var/lib/longhorn and /var/lib/rancher, and Longhorn’s volume metadata lives in Kubernetes CRDs that are gone once the cluster is torn down. Every rebuild was therefore a total loss of Supabase, Grafana history, and Open WebUI chat state.
No backup system. Longhorn’s built-in volume snapshots are node-local and vanish with the node. Longhorn-to-NFS backup targets exist but were never wired up, because the recovery story still required a Longhorn cluster on the other side.

Additionally, the RK1 nodes each have a 1TB NVMe sitting mostly unused (OS + free space) and nuc2 has a 931GB /home disk mounted at /home, so local high-quality storage is already paid for.

The NAS (gknas, 192.168.1.3) hosts existing NFS shares for LLM models (rkllama/llamacpp) and Supabase DB dumps, plus unrelated personal shares (JellyFin library, files, etc.) that must not be touched. Any NAS change has to respect that trust boundary.

Decision#

Drop Longhorn entirely. Replace it with two orthogonal layers:

Per-node static local PVs (additions/local-storage/). A local-nvme StorageClass with provisioner: kubernetes.io/no-provisioner, volumeBindingMode: WaitForFirstConsumer, and reclaimPolicy: Retain. Six static PersistentVolume objects — one per live Longhorn PVC — each pre-bound via spec.claimRef: {namespace, name} to the exact chart-generated PVC name, and pinned via spec.nodeAffinity to a specific node:

PVC	Node	On-disk path
`supabase-db`	nuc2	`/home/k8s-data/supabase-db`
`supabase-storage`	nuc2	`/home/k8s-data/supabase-storage`
`supabase-minio`	nuc2	`/home/k8s-data/supabase-minio`
`storage-grafana-prometheus-0`	node03	`/var/lib/k8s-data/grafana`
`prometheus-grafana-prometheus-kube-pr-prometheus-...`	node02	`/var/lib/k8s-data/prometheus`
`open-webui`	node04	`/var/lib/k8s-data/open-webui`

The k8s_data_dirs Ansible role creates these directories idempotently (part of the servers play), with owner/mode chosen to match each workload’s pod securityContext. pb_decommission.yml preserves /home/k8s-data and /var/lib/k8s-data by default; a new opt-in flag -e wipe_local_data=true removes them for a genuine clean-slate rebuild.

Per-app backup CronJobs writing to one NFS share on the NAS (additions/backups/). A single cluster-owned subtree /bigdisk/k8s-cluster/ hosts all cluster data the cluster reads or writes over NFS — LLM models, Supabase DB dumps, and the new backup targets (backups/<app>/{,weekly}). One static NFS PV/PVC covers the whole subtree; each CronJob mounts it with a workload-specific subPath. Daily + weekly schedules per app; retention is enforced in-job with find -mtime. Prometheus is deliberately excluded — metrics are reconstructible from re-scrape.

Rejected alternatives#

Keep Longhorn, add a backup pipeline#

Possible, but doesn’t fix the fundamental problem: a rebuild destroys the primary data store and the restore path would need a working Longhorn cluster on the other side. Longhorn is also the single biggest source of rebuild pain (iSCSI cleanup, stuck finalizers, multiple retry loops in pb_decommission.yml), and removing it simplifies the teardown significantly.

`local-path-provisioner` instead of static PVs#

local-path-provisioner is already the default StorageClass (bundled with k3s), so in principle we could just point charts at it. Rejected because its PVC→hostPath bindings live in etcd — they are lost when the cluster is rebuilt. The new cluster’s local-path-provisioner would allocate a brand new hostPath for each PVC, not re-bind to the existing on-disk data. Static PVs with claimRef pinning are the only way to guarantee a PVC re-binds to the same directory after a rebuild.

Ansible-managed NFS setup on the NAS#

The NAS is a QNAP (QTS) hosting unrelated personal data. Giving Ansible access to it would require a trust boundary we don’t want, and QTS regenerates /etc/exports from its web UI on every change, so any Ansible-written configuration would be fragile. Decision: NAS setup is a documented manual runbook (docs/how-to/nas-setup.md) the user runs by hand on the NAS. The runbook creates /bigdisk/k8s-cluster/ as a subdirectory of the existing /bigdisk NFS export (which is already rw to the cluster subnet), so no QTS config changes are needed at all. Rollback = leave the repo on the old paths (which still exist untouched on the NAS).

Tar-out / copy-in migration#

Rejected because the Longhorn data being left behind is recreatable: Supabase runs its init migrations fresh, Grafana starts with empty dashboards (we had no saved dashboards), Prometheus starts empty (fine), Open WebUI starts empty (chat history was disposable). A one-time fresh-start cost was deemed acceptable versus the effort of tar-dumping Longhorn volumes and restoring them into raw hostPath directories with the right ownership.

Consequences#

Positive#

Stateful app data now survives /rebuild-cluster by default. A rebuild re-binds the existing local-nvme PVs to fresh chart-generated PVCs; Supabase Studio shows the same thoughts, Grafana shows the same dashboards, Open WebUI shows the same chat history.
Actual backup system exists. Nightly CronJobs write to the NAS, retention is enforced automatically, restore recipes are documented in Backup and Restore.
Decommission is simpler. All Longhorn-specific teardown (volume detachment waits, finalizer stripping, CRD cleanup, iSCSI logout, /var/lib/longhorn wipe) is gone. pb_decommission.yml is shorter and faster.
open-iscsi no longer required on cluster nodes.

Negative#

No replication. Losing a node loses that node’s live data until the NFS backup is restored. Mitigation: per-app pinning spreads blast radius across four nodes (prometheus→node02, grafana→node03, open-webui→node04, supabase trio→nuc2), and RPO is one day (matching the daily CronJob schedule).
New RWO local-nvme workloads must choose a host explicitly. The StorageClass is WaitForFirstConsumer + static PVs, so a new PVC with no matching PV stays Pending until someone adds one. This is captured as a hard rule in CLAUDE.md.
NAS remains a single point of failure for LLM models, Supabase DB dumps, and backup targets — unchanged from the previous design.
Manual NAS runbook must be run once before the first rebuild on this plan. Documented in docs/how-to/nas-setup.md; the cluster cannot come up without it (rkllama/llamacpp/supabase-db-data PVs would fail to mount their new paths).
Supabase db-data NFS path changed from /bigdisk/OpenBrain to /bigdisk/k8s-cluster/supabase-dumps — ADR 0006 is still accepted (NFS for the dump store), but its path is superseded by this ADR.

References#

PR #321 — scaffolds (this ADR’s infrastructure, no cutover)
PR #NNN — cutover (this ADR, chart StorageClass flips + Longhorn removal)
ADR 0006 — Supabase DB dump storage on NFS (path updated by this ADR)
ADR 0009 — Longhorn workstation exclusion (superseded by this ADR)
docs/how-to/nas-setup.md — manual NAS runbook
docs/how-to/backup-restore.md — CronJob schedule + restore recipes