Troubleshooting#
Common issues and their solutions.
Flashing#
USB error during tpi flash#
Symptom: Error occured during flashing: "USB"
Cause: BMC firmware USB enumeration bug.
Fix: Power-cycle the entire Turing Pi board (not just the individual node). After the BMC reboots, retry the flash:
ansible-playbook pb_all.yml --tags flash -e do_flash=true
Tip
You can verify the USB device is visible before flashing:
# SSH to the BMC
ssh root@turingpi
tpi advanced msd --node <slot>
lsusb # should show device ID 2207:350b for RK1
Node not reachable after flash#
Symptom: SSH connection refused or timeout after tpi flash completes.
Possible causes:
Node hasn’t finished booting — wait 60–90 seconds
IP address changed — check
hosts.ymlmatches the DHCP lease or static IPknown_hostsentry stale — delete the old entry and re-run:
ssh-keygen -R <node-ip>
ansible-playbook pb_all.yml --tags known_hosts
SSH / known_hosts#
Race condition writing known_hosts#
Symptom: known_hosts task fails with concurrent write errors.
Cause: The known_hosts play must run serial: 1. Parallel writes to
~/.ssh/known_hosts corrupt the file.
Fix: This is already handled in the playbook. If you see this error, check
that you haven’t overridden serial in a custom run.
Ansible#
kubernetes.core deprecation warning#
Symptom:
[DEPRECATION WARNING]: The default value for option validate_certs will be changed
Fix: These are warnings, not errors. They come from the kubernetes.core
collection and don’t affect functionality. Suppress with:
export ANSIBLE_DEPRECATION_WARNINGS=false
Helm plugin breakage after upgrade#
Symptom: Ansible’s helm module fails with “Error: plugin X not found”.
Fix: Clear the Helm cache and reinstall:
rm -rf ~/.cache/helm ~/.local/share/helm/plugins
ansible-playbook pb_all.yml --tags tools
ArgoCD#
Application stuck in “Syncing”#
Symptom: An application shows Syncing indefinitely in the ArgoCD UI.
Possible causes:
Invalid manifest — check the Events tab for validation errors
Namespace doesn’t exist — ArgoCD creates namespaces only if
CreateNamespace=trueis set in sync options (this project sets it for all apps)Resource hooks timing out — check hook pod logs
Fix:
# Force a hard refresh
kubectl -n argo-cd patch app <app-name> --type merge -p '{"operation":{"sync":{"force":true}}}'
# Or delete and let ArgoCD recreate from git
kubectl -n argo-cd delete app <app-name>
# ArgoCD will re-create it from the parent all-cluster-services app
Application stuck in “Running” operation (admission webhooks)#
Symptom: An ArgoCD application shows a perpetual Running operation and
never reaches Synced, even though all resources are healthy.
Cause: Charts like kube-prometheus-stack and ingress-nginx include
admission webhook jobs with helm.sh/hook-delete-policy: hook-succeeded. The
job deletes itself before ArgoCD records completion, leaving the operation stuck.
Fix: Disable admission webhooks in the chart values:
# In kubernetes-services/values.yaml (under the affected app)
admissionWebhooks:
enabled: false
If the operation is already stuck, clear it manually:
kubectl patch app <app-name> -n argo-cd --type json \
-p '[{"op": "remove", "path": "/operation"}]'
OAuth / Authentication#
Viewer emails can access oauth2-proxy-gated services#
Symptom: Users with viewer emails can access admin-only services (Headlamp, Supabase Studio) — oauth2-proxy returns 202 instead of 403.
Cause: The oauth2-proxy Helm chart generates email_domains = ["*"] in
its default ConfigMap. This acts as an OR with authenticatedEmailsFile —
any email from any domain passes the domain check, so the restrictive email
file is silently ignored.
Fix: Set email_domains = [] explicitly via config.configFile in the
Helm values so that only emails in the authenticatedEmailsFile (admin list)
are accepted. See kubernetes-services/templates/oauth2-proxy.yaml.
Diagnosis: Check the live ConfigMap:
kubectl get configmap oauth2-proxy -n oauth2-proxy -o yaml
# Look for: email_domains = ["*"] ← this is the bug
Watch auth decisions in real time:
kubectl logs -n oauth2-proxy deploy/oauth2-proxy -f
# 202 = allowed, 401 = no session, 403 = denied
Browser#
Service shows blank page or stale UI after config change#
Symptom: A web service (typically Headlamp) loads the page chrome but shows no content, or displays an outdated version of the UI. Works correctly in incognito/private mode.
Cause: Browsers aggressively cache JavaScript bundles, service workers, and API responses. After a cluster reconfiguration or branch switch, the cached assets may not match the current backend state.
Fix (per-site):
Open DevTools (
F12) → Application tab → Storage → click Clear site data (tick all boxes including “Unregister service workers”)Hard-reload:
Ctrl+Shift+R(Windows/Linux) orCmd+Shift+R(macOS)
Fix (nuclear — reset all Chrome state for one site):
Navigate to the affected URL
Click the padlock/tune icon in the address bar → Site settings
Click Clear data to remove cookies, cache, and local storage for that origin
Reload the page
Fix (Chrome profile reset — if the above doesn’t help):
Chrome can cache redirect state in places that “Clear site data” doesn’t reach. A profile reset clears this without deleting bookmarks or saved passwords:
Navigate to
chrome://settings/resetClick Restore settings to their original defaults
Reload the affected page
Fix (other browsers):
Firefox: Settings → Privacy → Clear Data → Cached Web Content
Try an incognito/private window first to confirm it’s a caching issue
Tip
When testing cluster changes that affect web UIs, use an incognito window first. This avoids polluting your browser cache with intermediate states.
Local-nvme PVs and NFS backups#
Static PV not bound#
Symptom: A pod stuck Pending with events like no persistent volumes available for this claim.
Check: kubectl get pv -l type=local-nvme — every PV listed in
additions/local-storage/ should be Bound.
Fix: Verify the data directory exists on the target node
(/home/k8s-data/<app> on nuc2, /var/lib/k8s-data/<app> on the RK1
nodes); re-run ansible-playbook pb_all.yml --tags cluster to let the
k8s_data_dirs role recreate any missing directory.
Restoring from an NFS backup#
The daily/weekly backup CronJobs in the backups namespace write
compressed dumps to /bigdisk/k8s-cluster/backups/ on the NAS. To find
the latest, ls -lt /bigdisk/k8s-cluster/backups/supabase-db/ on the
NAS host and pick the newest *.sql.gz.
Networking#
Ingress returning 404 or 503#
Symptom: https://<service>.<domain> returns 404 Not Found or 503 Bad
Gateway.
Checklist:
Service exists?
kubectl get svc -n <namespace>Endpoints populated?
kubectl get endpoints -n <namespace> <svc-name>Ingress resource correct?
kubectl get ingress -n <namespace> -o yamlTLS certificate ready?
kubectl get cert -n <namespace>DNS resolving?
dig <service>.<domain>— should return worker node IPs
Certificate not issuing#
Symptom: kubectl get cert shows False for Ready.
Checklist:
Check cert-manager logs:
kubectl logs -n cert-manager deploy/cert-manager -f
Check the CertificateRequest and Order:
kubectl get certificaterequest -A kubectl get order -A kubectl get challenge -A
Cloudflare API token valid? The SealedSecret in
additions/cert-manager/templates/cloudflare-api-token-secret.yamlmust decrypt to a valid token withZone:DNS:Editpermission.
Cloudflare Tunnel#
Redirect loop through tunnel#
Symptom: Browser shows ERR_TOO_MANY_REDIRECTS when accessing a
tunnelled service.
Cause: The tunnel service URL uses HTTPS, and ingress-nginx forces an SSL redirect — creating an infinite loop.
Fix: Use http:// (not https://) for the tunnel service URL in
the Cloudflare dashboard. The echo ingress has ssl-redirect: false as
a reference.
WAF blocks access to SSH tunnel#
Symptom: cloudflared access login returns
failed to find Access application.
Fix: Add a WAF skip rule for the SSH hostname. See Set Up a Cloudflare SSH Tunnel for Remote Cluster Access Part 3 for details.
Tunnel not connecting#
Symptom: cloudflared pods are running but the Cloudflare dashboard shows the tunnel as inactive.
Checklist:
Tunnel token valid? The SealedSecret must decrypt to a valid token:
kubectl get secret cloudflared-credentials -n cloudflared -o jsonpath='{.data.TUNNEL_TOKEN}' | base64 -d | head -c 20
Pods running?
kubectl get pods -n cloudflaredLogs show errors?
kubectl logs -n cloudflared deployment/cloudflared | tail -30Outbound connectivity? The pod needs to reach
*.cloudflareresearch.comon port 7844.
Connection refused for tunnelled service#
Symptom: Cloudflare returns 502 Bad Gateway.
Checklist:
Service URL correct? The hostname in the tunnel config must match the Kubernetes service DNS name (e.g.
ingress-ingress-nginx-controller.ingress-nginx.svc.cluster.local).Service port correct? Use port 80 (not 443) for HTTP backends.
Ingress resource exists? Check the target namespace has an ingress for the hostname.
New tunnel hostname won’t resolve on LAN#
Symptom: You add a new public hostname in the Cloudflare tunnel
dashboard, Cloudflare auto-creates the proxied CNAME, but LAN clients
keep getting NXDOMAIN for the new name. Public resolvers (dig @1.1.1.1) return the record correctly.
Cause: Negative DNS caching. Before the record existed, your
router / systemd-resolved / browser queried the name, got
NXDOMAIN, and cached that result for its TTL. The cache doesn’t
clear when Cloudflare publishes the new record — you have to flush it.
Fix: Flush the LAN-side caches in order of how hard they are to reach:
# 1. Local browser — use incognito or clear browser DNS cache
# (chrome://net-internals/#dns → "Clear host cache")
# 2. systemd-resolved on the client
sudo resolvectl flush-caches
# 3. The router's DNS cache — usually via Reboot or a "Flush DNS"
# button in the admin UI. This is the one that tends to persist.
To confirm the negative cache is the cause, bypass the local resolver:
dig @1.1.1.1 new-hostname.example.com # public resolver — should work
dig new-hostname.example.com # local resolver — still NXDOMAIN
If only the local-resolver lookup fails, you are hitting cached
NXDOMAIN, not a Cloudflare problem.
NFS Mount Issues#
PVC stuck in Pending#
Symptom: A PersistentVolumeClaim for LLM models stays in Pending.
Checklist:
NFS server reachable?
kubectl run nfs-test --rm -it --image=busybox:1.37 -- ping -c 3 <nfs-ip>
Export path correct? Check
kubernetes-services/values.yamlmatches the NFS server’s/etc/exports.PV exists and is Available?
kubectl get pvStorageClass mismatch? NFS PVs in this project do not use a StorageClass — the PVC binds directly by name.
K3s Control Plane#
API server unreachable#
Symptom: kubectl commands fail with connection refused on port 6443.
Checklist:
K3s service running?
ssh node01 sudo systemctl status k3s
Certificates valid? Check
/var/lib/rancher/k3s/server/tls/on the control plane node.Disk full? etcd can fail if the node runs out of disk space:
ssh node01 df -h /
etcd database too large#
Symptom: K3s logs show mvcc: database space exceeded.
Fix: Compact and defragment etcd:
ssh node01
sudo k3s etcd-snapshot save --name manual-backup
sudo systemctl restart k3s
K3s’s embedded etcd auto-compacts, but a restart forces immediate compaction.
Supabase#
Postgres fails on NFS storage#
Symptom: Supabase Postgres pod crashes with chown or permission errors.
Cause: NFS with root_squash prevents the Postgres container (UID 105,
GID 106) from changing file ownership. Unlike most Postgres images that use
UID/GID 999, the Supabase image uses non-standard IDs.
Fix: The Supabase database runs on a static local-nvme PV pinned to
nuc2 (backed by /home/k8s-data/supabase-db) — plain filesystem ownership
works correctly. NFS is used only by the backup CronJobs in the backups
namespace to write compressed pg_dump output to the NAS; the live
database never touches NFS. See
6. Supabase Database Storage on NAS via NFS for the original
context and 12. Drop Longhorn in Favour of Static Local PVs + NFS Backups for the
current architecture.
Supabase clients CrashLoop with “password authentication failed” after rebuild#
Symptom: After a cluster rebuild with preserved local-nvme volumes,
supabase-auth, supabase-rest, supabase-storage, supabase-realtime,
and open-brain-mcp all CrashLoop with password authentication failed for user "supabase_admin".
Cause: Postgres init scripts only run when PG_VERSION is absent from
PGDATA. On a preserved volume they are skipped, so service roles keep the
old passwords. If SUPABASE_PASSWORD was not set in .env before the
rebuild, generate-secrets rolled a fresh random password that does not
match what Postgres has.
Prevention: Run just export-external-creds before decommission. This
extracts SUPABASE_PASSWORD and SUPABASE_JWT_SECRET into .env so the
rebuild reuses the existing values.
Fix (if prevention was missed):
# 1) Read the new password from the live secret
NEW_PW=$(kubectl get secret -n supabase supabase-credentials \
-o jsonpath='{.data.password}' | base64 -d)
# 2) ALTER all service roles to match (must use supabase_admin, not postgres)
# Must use -h 127.0.0.1 (trust auth) not local socket (scram)
kubectl exec -i -n supabase supabase-supabase-db-0 -c supabase-db -- \
psql -U supabase_admin -h 127.0.0.1 <<SQL
ALTER USER supabase_admin WITH PASSWORD '$NEW_PW';
ALTER USER supabase_auth_admin WITH PASSWORD '$NEW_PW';
ALTER USER supabase_storage_admin WITH PASSWORD '$NEW_PW';
ALTER USER supabase_functions_admin WITH PASSWORD '$NEW_PW';
ALTER USER supabase_realtime_admin WITH PASSWORD '$NEW_PW';
ALTER USER supabase_replication_admin WITH PASSWORD '$NEW_PW';
ALTER USER supabase_read_only_user WITH PASSWORD '$NEW_PW';
ALTER USER authenticator WITH PASSWORD '$NEW_PW';
ALTER USER pgbouncer WITH PASSWORD '$NEW_PW';
ALTER USER dashboard_user WITH PASSWORD '$NEW_PW';
ALTER USER postgres WITH PASSWORD '$NEW_PW';
SQL
# 3) Restart the crashing clients
kubectl rollout restart -n supabase \
deploy/supabase-supabase-auth deploy/supabase-supabase-rest \
deploy/supabase-supabase-storage deploy/supabase-supabase-realtime
kubectl rollout restart -n open-brain-mcp deploy/open-brain-mcp
Important
kubectl exec needs -i for the heredoc to reach psql. Without it the
ALTER statements silently do nothing.
Kong OOMKilled#
Symptom: Supabase Kong pod restarts repeatedly with OOMKilled.
Fix: Set Kong memory limit to at least 2Gi. Lower values (512Mi–1Gi) cause consistent OOM kills under normal load.
Edge Function not updating after ConfigMap change#
Symptom: Supabase Edge Function serves stale code after updating the ConfigMap.
Cause: subPath ConfigMap mounts do not receive automatic updates from Kubernetes. The pod must be restarted to pick up changes.
Fix: Delete the Edge Function pod to force a restart:
kubectl delete pod -n supabase -l app.kubernetes.io/name=supabase-functions
Edge Function returns 404#
Symptom: Requests to the Edge Function return 404 Not Found.
Cause: The Supabase Edge Runtime requires the basePath in the Hono
application to match the function directory name (the subPath mount point).
Fix: Ensure the basePath in the function code matches the directory name.
For example, if the function is mounted at /open-brain-mcp, the Hono app must
use basePath: '/open-brain-mcp'.
MinIO / Supabase Storage#
MinIO CrashLoopBackOff — file access denied#
Symptom: MinIO pod crashes with unable to rename /data/.minio.sys/tmp — file access denied, drive may be faulty.
Cause: The Chainguard MinIO image (cgr.dev/chainguard/minio) runs as
UID 65532 (nonroot). Static local-nvme PVs are backed by hostPath
directories created with root ownership, so MinIO cannot write to the
volume.
Fix: Add podSecurityContext.fsGroup: 65532 to the MinIO deployment
config in kubernetes-services/templates/supabase.yaml:
deployment:
minio:
podSecurityContext:
fsGroup: 65532
Storage bucket not found after SQL migration#
Symptom: Supabase Storage returns 404 Bucket not found even though
the bucket exists in storage.buckets table.
Cause: SQL migrations create the bucket row in PostgreSQL but MinIO needs to be notified separately. Supabase Storage syncs bucket state on startup.
Fix: Restart the storage pod after creating buckets via SQL:
kubectl rollout restart deployment supabase-supabase-storage -n supabase
Alternatively, create buckets via the Supabase Storage REST API (POST /storage/v1/bucket) which handles both PostgreSQL and MinIO in one call.
MinIO PVC created with wrong size#
Symptom: kubectl get pvc shows MinIO PVC at 1Gi instead of the
configured 50Gi.
Cause: PVC size is set at creation time. If the Helm values were incorrect when the PVC was first created, fixing the values won’t resize the existing PVC.
Fix: Delete the PVC (safe if MinIO has no data yet) and let ArgoCD recreate it:
kubectl delete pod -n supabase -l app.kubernetes.io/name=supabase-minio
kubectl delete pvc supabase-minio -n supabase
# ArgoCD recreates both with correct size
Hardware#
RK1 module not detected in slot#
Symptom: tpi info shows a slot as empty despite a module being seated.
Fix:
Power off the board
Reseat the compute module firmly
Power on and check again:
ssh root@turingpi
tpi info
NPU not available for RKLlama#
Symptom: RKLlama pod can’t access /dev/rknpu.
Checklist:
Node must be an RK1 (not CM4) — NPU is only on RK3588
Pod must run privileged (already set in the DaemonSet)
Node must have label
node-type: rk1:kubectl label node <node> node-type=rk1