Troubleshooting#
Common issues and their solutions.
Flashing#
USB error during tpi flash#
Symptom: Error occured during flashing: "USB"
Cause: BMC firmware USB enumeration bug.
Fix: Power-cycle the entire Turing Pi board (not just the individual node). After the BMC reboots, retry the flash:
ansible-playbook pb_all.yml --tags flash -e do_flash=true
Tip
You can verify the USB device is visible before flashing:
# SSH to the BMC
ssh root@turingpi
tpi advanced msd --node <slot>
lsusb # should show device ID 2207:350b for RK1
Node not reachable after flash#
Symptom: SSH connection refused or timeout after tpi flash completes.
Possible causes:
Node hasn’t finished booting — wait 60–90 seconds
IP address changed — check
hosts.ymlmatches the DHCP lease or static IPknown_hostsentry stale — delete the old entry and re-run:
ssh-keygen -R <node-ip>
ansible-playbook pb_all.yml --tags known_hosts
SSH / known_hosts#
Race condition writing known_hosts#
Symptom: known_hosts task fails with concurrent write errors.
Cause: The known_hosts play must run serial: 1. Parallel writes to
~/.ssh/known_hosts corrupt the file.
Fix: This is already handled in the playbook. If you see this error, check
that you haven’t overridden serial in a custom run.
Ansible#
kubernetes.core deprecation warning#
Symptom:
[DEPRECATION WARNING]: The default value for option validate_certs will be changed
Fix: These are warnings, not errors. They come from the kubernetes.core
collection and don’t affect functionality. Suppress with:
export ANSIBLE_DEPRECATION_WARNINGS=false
Helm plugin breakage after upgrade#
Symptom: Ansible’s helm module fails with “Error: plugin X not found”.
Fix: Clear the Helm cache and reinstall:
rm -rf ~/.cache/helm ~/.local/share/helm/plugins
ansible-playbook pb_all.yml --tags tools
ArgoCD#
Application stuck in “Syncing”#
Symptom: An application shows Syncing indefinitely in the ArgoCD UI.
Possible causes:
Invalid manifest — check the Events tab for validation errors
Namespace doesn’t exist — ArgoCD creates namespaces only if
CreateNamespace=trueis set in sync options (this project sets it for all apps)Resource hooks timing out — check hook pod logs
Fix:
# Force a hard refresh
kubectl -n argo-cd patch app <app-name> --type merge -p '{"operation":{"sync":{"force":true}}}'
# Or delete and let ArgoCD recreate from git
kubectl -n argo-cd delete app <app-name>
# ArgoCD will re-create it from the parent all-cluster-services app
Application stuck in “Running” operation (admission webhooks)#
Symptom: An ArgoCD application shows a perpetual Running operation and
never reaches Synced, even though all resources are healthy.
Cause: Charts like kube-prometheus-stack and ingress-nginx include
admission webhook jobs with helm.sh/hook-delete-policy: hook-succeeded. The
job deletes itself before ArgoCD records completion, leaving the operation stuck.
Fix: Disable admission webhooks in the chart values:
# In kubernetes-services/values.yaml (under the affected app)
admissionWebhooks:
enabled: false
If the operation is already stuck, clear it manually:
kubectl patch app <app-name> -n argo-cd --type json \
-p '[{"op": "remove", "path": "/operation"}]'
Browser#
Service shows blank page or stale UI after config change#
Symptom: A web service (typically Headlamp) loads the page chrome but shows no content, or displays an outdated version of the UI. Works correctly in incognito/private mode.
Cause: Browsers aggressively cache JavaScript bundles, service workers, and API responses. After a cluster reconfiguration or branch switch, the cached assets may not match the current backend state.
Fix (per-site):
Open DevTools (
F12) → Application tab → Storage → click Clear site data (tick all boxes including “Unregister service workers”)Hard-reload:
Ctrl+Shift+R(Windows/Linux) orCmd+Shift+R(macOS)
Fix (nuclear — reset all Chrome state for one site):
Navigate to the affected URL
Click the padlock/tune icon in the address bar → Site settings
Click Clear data to remove cookies, cache, and local storage for that origin
Reload the page
Fix (Chrome profile reset — if the above doesn’t help):
Chrome can cache redirect state in places that “Clear site data” doesn’t reach. A profile reset clears this without deleting bookmarks or saved passwords:
Navigate to
chrome://settings/resetClick Restore settings to their original defaults
Reload the affected page
Fix (other browsers):
Firefox: Settings → Privacy → Clear Data → Cached Web Content
Try an incognito/private window first to confirm it’s a caching issue
Tip
When testing cluster changes that affect web UIs, use an incognito window first. This avoids polluting your browser cache with intermediate states.
Longhorn#
Cannot uninstall Longhorn#
Symptom: helm uninstall longhorn hangs or fails.
Cause: Longhorn requires its uninstall job to detach all volumes and clean up node state. Volumes still attached to running pods prevent uninstall.
Fix:
Scale down all workloads using Longhorn PVCs
Delete any remaining PVCs manually
Run the Longhorn uninstall procedure:
kubectl -n longhorn apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/uninstall/uninstall.yaml
kubectl -n longhorn get job/longhorn-uninstall -w
# Wait for completion, then
helm uninstall longhorn -n longhorn
Volume degraded — replica rebuilding#
Symptom: Longhorn UI shows a volume as Degraded with replicas rebuilding.
Cause: A node was restarted or lost network temporarily. Longhorn automatically rebuilds under-replicated volumes.
Action: No action needed. Monitor progress in the Longhorn UI. Rebuilding typically completes within minutes depending on volume size.
Networking#
Ingress returning 404 or 503#
Symptom: https://<service>.<domain> returns 404 Not Found or 503 Bad
Gateway.
Checklist:
Service exists?
kubectl get svc -n <namespace>Endpoints populated?
kubectl get endpoints -n <namespace> <svc-name>Ingress resource correct?
kubectl get ingress -n <namespace> -o yamlTLS certificate ready?
kubectl get cert -n <namespace>DNS resolving?
dig <service>.<domain>— should return worker node IPs
Certificate not issuing#
Symptom: kubectl get cert shows False for Ready.
Checklist:
Check cert-manager logs:
kubectl logs -n cert-manager deploy/cert-manager -f
Check the CertificateRequest and Order:
kubectl get certificaterequest -A kubectl get order -A kubectl get challenge -A
Cloudflare API token valid? The SealedSecret in
additions/cert-manager/templates/cloudflare-api-token-secret.yamlmust decrypt to a valid token withZone:DNS:Editpermission.
Cloudflare Tunnel#
Redirect loop through tunnel#
Symptom: Browser shows ERR_TOO_MANY_REDIRECTS when accessing a
tunnelled service.
Cause: The tunnel service URL uses HTTPS, and ingress-nginx forces an SSL redirect — creating an infinite loop.
Fix: Use http:// (not https://) for the tunnel service URL in
the Cloudflare dashboard. The echo ingress has ssl-redirect: false as
a reference.
WAF blocks access to SSH tunnel#
Symptom: cloudflared access login returns
failed to find Access application.
Fix: Add a WAF skip rule for the SSH hostname. See Set Up a Cloudflare SSH Tunnel for Remote Cluster Access Part 3 for details.
Tunnel not connecting#
Symptom: cloudflared pods are running but the Cloudflare dashboard shows the tunnel as inactive.
Checklist:
Tunnel token valid? The SealedSecret must decrypt to a valid token:
kubectl get secret cloudflared-credentials -n cloudflared -o jsonpath='{.data.TUNNEL_TOKEN}' | base64 -d | head -c 20
Pods running?
kubectl get pods -n cloudflaredLogs show errors?
kubectl logs -n cloudflared deployment/cloudflared | tail -30Outbound connectivity? The pod needs to reach
*.cloudflareresearch.comon port 7844.
Connection refused for tunnelled service#
Symptom: Cloudflare returns 502 Bad Gateway.
Checklist:
Service URL correct? The hostname in the tunnel config must match the Kubernetes service DNS name (e.g.
ingress-ingress-nginx-controller.ingress-nginx.svc.cluster.local).Service port correct? Use port 80 (not 443) for HTTP backends.
Ingress resource exists? Check the target namespace has an ingress for the hostname.
NFS Mount Issues#
PVC stuck in Pending#
Symptom: A PersistentVolumeClaim for LLM models stays in Pending.
Checklist:
NFS server reachable?
kubectl run nfs-test --rm -it --image=busybox:1.37 -- ping -c 3 <nfs-ip>
Export path correct? Check
kubernetes-services/values.yamlmatches the NFS server’s/etc/exports.PV exists and is Available?
kubectl get pvStorageClass mismatch? NFS PVs in this project do not use a StorageClass — the PVC binds directly by name.
K3s Control Plane#
API server unreachable#
Symptom: kubectl commands fail with connection refused on port 6443.
Checklist:
K3s service running?
ssh node01 sudo systemctl status k3s
Certificates valid? Check
/var/lib/rancher/k3s/server/tls/on the control plane node.Disk full? etcd can fail if the node runs out of disk space:
ssh node01 df -h /
etcd database too large#
Symptom: K3s logs show mvcc: database space exceeded.
Fix: Compact and defragment etcd:
ssh node01
sudo k3s etcd-snapshot save --name manual-backup
sudo systemctl restart k3s
K3s’s embedded etcd auto-compacts, but a restart forces immediate compaction.
Supabase#
Postgres fails on NFS storage#
Symptom: Supabase Postgres pod crashes with chown or permission errors.
Cause: NFS with root_squash prevents the Postgres container (UID 105,
GID 106) from changing file ownership. Unlike most Postgres images that use
UID/GID 999, the Supabase image uses non-standard IDs.
Fix: Use Longhorn (or another block storage provider) instead of NFS for the Postgres PVC. See 6. Supabase Database Storage on NAS via NFS.
Kong OOMKilled#
Symptom: Supabase Kong pod restarts repeatedly with OOMKilled.
Fix: Set Kong memory limit to at least 2Gi. Lower values (512Mi–1Gi) cause consistent OOM kills under normal load.
Edge Function not updating after ConfigMap change#
Symptom: Supabase Edge Function serves stale code after updating the ConfigMap.
Cause: subPath ConfigMap mounts do not receive automatic updates from Kubernetes. The pod must be restarted to pick up changes.
Fix: Delete the Edge Function pod to force a restart:
kubectl delete pod -n supabase -l app.kubernetes.io/name=supabase-functions
Edge Function returns 404#
Symptom: Requests to the Edge Function return 404 Not Found.
Cause: The Supabase Edge Runtime requires the basePath in the Hono
application to match the function directory name (the subPath mount point).
Fix: Ensure the basePath in the function code matches the directory name.
For example, if the function is mounted at /open-brain-mcp, the Hono app must
use basePath: '/open-brain-mcp'.
MinIO / Supabase Storage#
MinIO CrashLoopBackOff — file access denied#
Symptom: MinIO pod crashes with unable to rename /data/.minio.sys/tmp — file access denied, drive may be faulty.
Cause: The Chainguard MinIO image (cgr.dev/chainguard/minio) runs as
UID 65532 (nonroot). Longhorn PVCs are created with root ownership, so MinIO
cannot write to the volume.
Fix: Add podSecurityContext.fsGroup: 65532 to the MinIO deployment
config in kubernetes-services/templates/supabase.yaml:
deployment:
minio:
podSecurityContext:
fsGroup: 65532
Storage bucket not found after SQL migration#
Symptom: Supabase Storage returns 404 Bucket not found even though
the bucket exists in storage.buckets table.
Cause: SQL migrations create the bucket row in PostgreSQL but MinIO needs to be notified separately. Supabase Storage syncs bucket state on startup.
Fix: Restart the storage pod after creating buckets via SQL:
kubectl rollout restart deployment supabase-supabase-storage -n supabase
Alternatively, create buckets via the Supabase Storage REST API (POST /storage/v1/bucket) which handles both PostgreSQL and MinIO in one call.
MinIO PVC created with wrong size#
Symptom: kubectl get pvc shows MinIO PVC at 1Gi instead of the
configured 50Gi.
Cause: PVC size is set at creation time. If the Helm values were incorrect when the PVC was first created, fixing the values won’t resize the existing PVC.
Fix: Delete the PVC (safe if MinIO has no data yet) and let ArgoCD recreate it:
kubectl delete pod -n supabase -l app.kubernetes.io/name=supabase-minio
kubectl delete pvc supabase-minio -n supabase
# ArgoCD recreates both with correct size
Hardware#
RK1 module not detected in slot#
Symptom: tpi info shows a slot as empty despite a module being seated.
Fix:
Power off the board
Reseat the compute module firmly
Power on and check again:
ssh root@turingpi
tpi info
NPU not available for RKLlama#
Symptom: RKLlama pod can’t access /dev/rknpu.
Checklist:
Node must be an RK1 (not CM4) — NPU is only on RK3588
Pod must run privileged (already set in the DaemonSet)
Node must have label
node-type: rk1:kubectl label node <node> node-type=rk1