Production Readiness Checklist#
Use this checklist when setting up a new cluster or auditing an existing one.
Infrastructure#
[ ] All nodes flashed with the target Ubuntu version
[ ] NVMe root filesystem migration completed (if applicable)
[ ] K3s installed and all nodes in
Readystate[ ] Control plane taint applied (multi-node clusters)
DNS and TLS#
[ ] Domain delegated to Cloudflare nameservers
[ ] Cloudflare API token created with
Zone:DNS:Editpermission[ ] API token SealedSecret committed to Git
[ ] cert-manager deployed and ClusterIssuer configured
[ ] All certificates showing
READY: True(kubectl get certificate -A)[ ] Grey-cloud A records created for LAN-only services
Cloudflare tunnel#
[ ] Tunnel created in Cloudflare dashboard
[ ] Tunnel token SealedSecret committed to Git
[ ] cloudflared deployment running (2 replicas)
[ ] Public hostnames configured for tunnel-exposed services
[ ] WAF skip rule for SSH hostname (if using SSH tunnel)
Sealed Secrets#
[ ] Sealed-secrets controller deployed
[ ] Private key backed up securely (see Backup and Restore)
[ ] All sensitive values stored as SealedSecrets (not plain Secrets in Git)
Authentication#
[ ]
admin-authsecret created for Grafana/basic-auth services[ ] oauth2-proxy deployed with GitHub OAuth credentials
[ ] OAuth enabled on Grafana, Longhorn, Headlamp, Open WebUI
[ ] ArgoCD admin password retrieved and changed from default
Resource limits#
[ ] All services have CPU/memory requests and limits set
[ ] LLM services have appropriate memory limits for loaded models
[ ] ingress-nginx has resource requests
Storage#
[ ] Longhorn deployed with 3 replicas per volume
[ ] VolumeSnapshotClass deployed
[ ] NFS server configured for LLM model storage (if applicable)
[ ] Backup target configured in Longhorn (NFS or S3)
Monitoring#
[ ] kube-prometheus-stack deployed (Prometheus + Grafana + Alertmanager)
[ ] Grafana accessible and showing dashboards
[ ] Longhorn ServiceMonitor enabled
[ ] Alert rules reviewed and customised
Security#
[ ] All container images pinned to specific versions
[ ] Security contexts applied to all custom deployments
[ ] Headlamp RBAC reviewed (currently cluster-admin; scope down when ready)
[ ] ArgoCD project
sourceReposrestricted to known repositories[ ] No plaintext secrets in Ansible output or Git
GitOps#
[ ] ArgoCD deployed and tracking the correct Git branch
[ ] All applications showing
SyncedandHealthy[ ] Renovate bot configured for automated dependency updates
[ ]
values.yamlrepo_branchmatches the active branch
Backups#
[ ] Longhorn volume snapshots configured (recurring)
[ ] Sealed-secrets key exported and stored securely
[ ] Disaster recovery procedure documented and tested