Production Readiness Checklist#

Use this checklist when setting up a new cluster or auditing an existing one.

Infrastructure#

  • [ ] All nodes flashed with the target Ubuntu version

  • [ ] NVMe root filesystem migration completed (if applicable)

  • [ ] K3s installed and all nodes in Ready state

  • [ ] Control plane taint applied (multi-node clusters)

DNS and TLS#

  • [ ] Domain delegated to Cloudflare nameservers

  • [ ] Cloudflare API token created with Zone:DNS:Edit permission

  • [ ] API token SealedSecret committed to Git

  • [ ] cert-manager deployed and ClusterIssuer configured

  • [ ] All certificates showing READY: True (kubectl get certificate -A)

  • [ ] Grey-cloud A records created for LAN-only services

Cloudflare tunnel#

  • [ ] Tunnel created in Cloudflare dashboard

  • [ ] Tunnel token SealedSecret committed to Git

  • [ ] cloudflared deployment running (2 replicas)

  • [ ] Public hostnames configured for tunnel-exposed services

  • [ ] WAF skip rule for SSH hostname (if using SSH tunnel)

Sealed Secrets#

  • [ ] Sealed-secrets controller deployed

  • [ ] Private key backed up securely (see Backup and Restore)

  • [ ] All sensitive values stored as SealedSecrets (not plain Secrets in Git)

Authentication#

  • [ ] admin-auth secret created for Grafana/basic-auth services

  • [ ] oauth2-proxy deployed with GitHub OAuth credentials

  • [ ] OAuth enabled on Grafana, Longhorn, Headlamp, Open WebUI

  • [ ] ArgoCD admin password retrieved and changed from default

Resource limits#

  • [ ] All services have CPU/memory requests and limits set

  • [ ] LLM services have appropriate memory limits for loaded models

  • [ ] ingress-nginx has resource requests

Storage#

  • [ ] Longhorn deployed with 3 replicas per volume

  • [ ] VolumeSnapshotClass deployed

  • [ ] NFS server configured for LLM model storage (if applicable)

  • [ ] Backup target configured in Longhorn (NFS or S3)

Monitoring#

  • [ ] kube-prometheus-stack deployed (Prometheus + Grafana + Alertmanager)

  • [ ] Grafana accessible and showing dashboards

  • [ ] Longhorn ServiceMonitor enabled

  • [ ] Alert rules reviewed and customised

Security#

  • [ ] All container images pinned to specific versions

  • [ ] Security contexts applied to all custom deployments

  • [ ] Headlamp RBAC reviewed (currently cluster-admin; scope down when ready)

  • [ ] ArgoCD project sourceRepos restricted to known repositories

  • [ ] No plaintext secrets in Ansible output or Git

GitOps#

  • [ ] ArgoCD deployed and tracking the correct Git branch

  • [ ] All applications showing Synced and Healthy

  • [ ] Renovate bot configured for automated dependency updates

  • [ ] values.yaml repo_branch matches the active branch

Backups#

  • [ ] Longhorn volume snapshots configured (recurring)

  • [ ] Sealed-secrets key exported and stored securely

  • [ ] Disaster recovery procedure documented and tested