# 12. Drop Longhorn in Favour of Static Local PVs + NFS Backups

**Status:** Accepted

**Supersedes:** 0009 (workstation exclusion from Longhorn — no longer
relevant with Longhorn removed).

## Context

The cluster previously used Longhorn as its block-storage CSI provider for
six stateful PVCs (Supabase `db`/`storage`/`minio`, Grafana, Prometheus,
Open WebUI), totalling about 225Gi of live data. Two compounding problems
forced a rethink:

1. **`/rebuild-cluster` destroys all Longhorn data.** The decommission
   playbook wipes `/var/lib/longhorn` and `/var/lib/rancher`, and Longhorn's
   volume metadata lives in Kubernetes CRDs that are gone once the cluster
   is torn down. Every rebuild was therefore a total loss of Supabase,
   Grafana history, and Open WebUI chat state.
2. **No backup system.** Longhorn's built-in volume snapshots are
   node-local and vanish with the node. Longhorn-to-NFS backup targets
   exist but were never wired up, because the recovery story still
   required a Longhorn cluster on the other side.

Additionally, the RK1 nodes each have a 1TB NVMe sitting mostly unused
(OS + free space) and nuc2 has a 931GB `/home` disk mounted at `/home`,
so local high-quality storage is already paid for.

The NAS (`gknas`, `192.168.1.3`) hosts existing NFS shares for LLM models
(rkllama/llamacpp) and Supabase DB dumps, *plus unrelated personal shares*
(JellyFin library, files, etc.) that must not be touched. Any NAS change
has to respect that trust boundary.

## Decision

**Drop Longhorn entirely. Replace it with two orthogonal layers:**

1. **Per-node static local PVs** (`additions/local-storage/`). A
   `local-nvme` `StorageClass` with `provisioner: kubernetes.io/no-provisioner`,
   `volumeBindingMode: WaitForFirstConsumer`, and `reclaimPolicy: Retain`.
   Six static `PersistentVolume` objects — one per live Longhorn PVC —
   each pre-bound via `spec.claimRef: {namespace, name}` to the exact
   chart-generated PVC name, and pinned via `spec.nodeAffinity` to a
   specific node:

   | PVC | Node | On-disk path |
   |-----|------|---|
   | `supabase-db` | nuc2 | `/home/k8s-data/supabase-db` |
   | `supabase-storage` | nuc2 | `/home/k8s-data/supabase-storage` |
   | `supabase-minio` | nuc2 | `/home/k8s-data/supabase-minio` |
   | `storage-grafana-prometheus-0` | node03 | `/var/lib/k8s-data/grafana` |
   | `prometheus-grafana-prometheus-kube-pr-prometheus-...` | node02 | `/var/lib/k8s-data/prometheus` |
   | `open-webui` | node04 | `/var/lib/k8s-data/open-webui` |

   The `k8s_data_dirs` Ansible role creates these directories
   idempotently (part of the `servers` play), with owner/mode chosen to
   match each workload's pod `securityContext`. `pb_decommission.yml`
   **preserves** `/home/k8s-data` and `/var/lib/k8s-data` by default; a
   new opt-in flag `-e wipe_local_data=true` removes them for a genuine
   clean-slate rebuild.

2. **Per-app backup CronJobs writing to one NFS share on the NAS**
   (`additions/backups/`). A single cluster-owned subtree
   `/bigdisk/k8s-cluster/` hosts all cluster data the cluster reads or
   writes over NFS — LLM models, Supabase DB dumps, and the new backup
   targets (`backups/<app>/{,weekly}`). One static NFS `PV`/`PVC` covers
   the whole subtree; each CronJob mounts it with a workload-specific
   `subPath`. Daily + weekly schedules per app; retention is enforced
   in-job with `find -mtime`. Prometheus is deliberately excluded —
   metrics are reconstructible from re-scrape.

## Rejected alternatives

### Keep Longhorn, add a backup pipeline

Possible, but doesn't fix the fundamental problem: a rebuild destroys the
primary data store and the restore path would need a working Longhorn
cluster on the other side. Longhorn is also the single biggest source of
rebuild pain (iSCSI cleanup, stuck finalizers, multiple retry loops in
`pb_decommission.yml`), and removing it simplifies the teardown
significantly.

### `local-path-provisioner` instead of static PVs

`local-path-provisioner` is already the default StorageClass (bundled
with k3s), so in principle we could just point charts at it. Rejected
because its PVC→`hostPath` bindings live in **etcd** — they are lost
when the cluster is rebuilt. The new cluster's `local-path-provisioner`
would allocate a brand new `hostPath` for each PVC, not re-bind to the
existing on-disk data. Static PVs with `claimRef` pinning are the only
way to guarantee a PVC re-binds to the *same* directory after a rebuild.

### Ansible-managed NFS setup on the NAS

The NAS is a QNAP (QTS) hosting unrelated personal data. Giving Ansible
access to it would require a trust boundary we don't want, and QTS
regenerates `/etc/exports` from its web UI on every change, so any
Ansible-written configuration would be fragile. Decision: **NAS setup
is a documented manual runbook** (`docs/how-to/nas-setup.md`) the user
runs by hand on the NAS. The runbook creates `/bigdisk/k8s-cluster/` as
a subdirectory of the existing `/bigdisk` NFS export (which is already
`rw` to the cluster subnet), so no QTS config changes are needed at
all. Rollback = leave the repo on the old paths (which still exist
untouched on the NAS).

### Tar-out / copy-in migration

Rejected because the Longhorn data being left behind is recreatable:
Supabase runs its init migrations fresh, Grafana starts with empty
dashboards (we had no saved dashboards), Prometheus starts empty (fine),
Open WebUI starts empty (chat history was disposable). A one-time
fresh-start cost was deemed acceptable versus the effort of tar-dumping
Longhorn volumes and restoring them into raw hostPath directories with
the right ownership.

## Consequences

### Positive

- Stateful app data now **survives `/rebuild-cluster` by default.** A
  rebuild re-binds the existing local-nvme PVs to fresh chart-generated
  PVCs; Supabase Studio shows the same thoughts, Grafana shows the
  same dashboards, Open WebUI shows the same chat history.
- **Actual backup system exists.** Nightly CronJobs write to the NAS,
  retention is enforced automatically, restore recipes are documented
  in {doc}`../../how-to/backup-restore`.
- **Decommission is simpler.** All Longhorn-specific teardown (volume
  detachment waits, finalizer stripping, CRD cleanup, iSCSI logout,
  `/var/lib/longhorn` wipe) is gone. `pb_decommission.yml` is shorter
  and faster.
- **`open-iscsi` no longer required** on cluster nodes.

### Negative

- **No replication.** Losing a node loses that node's live data until
  the NFS backup is restored. Mitigation: per-app pinning spreads blast
  radius across four nodes (prometheus→node02, grafana→node03,
  open-webui→node04, supabase trio→nuc2), and RPO is one day (matching
  the daily CronJob schedule).
- **New RWO `local-nvme` workloads must choose a host explicitly.** The
  StorageClass is `WaitForFirstConsumer` + static PVs, so a new PVC
  with no matching PV stays Pending until someone adds one. This is
  captured as a hard rule in `CLAUDE.md`.
- **NAS remains a single point of failure** for LLM models, Supabase
  DB dumps, and backup targets — unchanged from the previous design.
- **Manual NAS runbook must be run once** before the first rebuild on
  this plan. Documented in `docs/how-to/nas-setup.md`; the cluster
  cannot come up without it (rkllama/llamacpp/supabase-db-data PVs
  would fail to mount their new paths).
- **Supabase `db-data` NFS path changed** from `/bigdisk/OpenBrain` to
  `/bigdisk/k8s-cluster/supabase-dumps` — ADR 0006 is still accepted
  (NFS for the dump store), but its path is superseded by this ADR.

## References

- PR #321 — scaffolds (this ADR's infrastructure, no cutover)
- PR #NNN — cutover (this ADR, chart StorageClass flips + Longhorn removal)
- ADR 0006 — Supabase DB dump storage on NFS (path updated by this ADR)
- ADR 0009 — Longhorn workstation exclusion (superseded by this ADR)
- `docs/how-to/nas-setup.md` — manual NAS runbook
- `docs/how-to/backup-restore.md` — CronJob schedule + restore recipes