llama.cpp CUDA Models#
llama.cpp is an OpenAI-compatible LLM inference server with CUDA acceleration for NVIDIA GPUs. It serves GGUF-format models and integrates with Open WebUI alongside the RKLLama NPU backend.
Prerequisites#
1 — Add an NVIDIA GPU node to the inventory#
In hosts.yml, add your GPU machine to extra_nodes with nvidia_gpu_node: true:
extra_nodes:
hosts:
my-gpu-node: # ← your GPU node's hostname
nvidia_gpu_node: true # installs NVIDIA container toolkit
Then run the full provisioning for that node:
ansible-playbook pb_all.yml --tags servers,k3s --limit my-gpu-node
This will:
Install the NVIDIA GPU driver and container toolkit
Configure k3s’s containerd to use the NVIDIA runtime as default
Label the node
nvidia.com/gpu.present=trueso the device plugin DaemonSet schedules thereJoin the node to the K3s cluster
Find a compatible GGUF model on Hugging Face#
llama.cpp loads any GGUF-format model. A good starting point:
Go to https://huggingface.co/models and search for
GGUF.Look for quantised variants — Q4_K_M is a good balance of quality and VRAM use:
Token
Meaning
Q4_K_M4-bit quantisation, medium quality — recommended
Q5_K_M5-bit — better quality, more VRAM
Q8_08-bit — near-lossless, needs ~8GB VRAM for 7B
F16Full float16 — maximum quality, needs most VRAM
Note the exact filename you want (e.g.
mistral-7b-instruct-v0.2.Q4_K_M.gguf).
Download a model#
Use the llamacpp-pull script (installed by the tools role) to search HuggingFace
and download directly to the cluster’s NFS share:
llamacpp-pull mistral 7b
The script will:
Search HuggingFace for matching GGUF repos
Let you pick a repo and a specific
.gguffileDownload it via
kubectl execinto the llamacpp pod’s/modelsvolumeOffer to activate the model immediately (
--set)
Manual download via kubectl exec#
If you prefer to download manually, exec into the llamacpp pod directly:
kubectl get pods -n llamacpp
kubectl exec -n llamacpp <pod-name> -- \
curl -L -o /models/my-model.Q4_K_M.gguf \
https://huggingface.co/<owner>/<repo>/resolve/main/<filename>.gguf
The download runs inside the pod and writes directly to NFS.
Update the model filename in values.yaml#
After downloading, update llamacpp.model.file in kubernetes-services/values.yaml
to match the filename you downloaded, then commit and push:
git add kubernetes-services/values.yaml
git commit -m "Update llamacpp model to <new filename>"
git push
ArgoCD will sync the change, the llamacpp pod will restart, and the model will load automatically. Once running, it appears in Open WebUI’s model dropdown under the OpenAI API section.
Verify the model is loaded#
# Check pod is running
kubectl get pods -n llamacpp
# Confirm the model is ready
kubectl logs -n llamacpp -l app=llamacpp --tail=5
# Should end with: "main: server is listening on http://0.0.0.0:8080"
# Test the API directly
curl http://llamacpp.<your-domain>/v1/models
Adjust GPU memory usage#
If the model fails to load due to insufficient VRAM, reduce gpuLayers in
kubernetes-services/values.yaml. Each layer offloads roughly equal amounts of VRAM;
setting a lower value causes remaining layers to run on CPU:
llamacpp:
model:
gpuLayers: 20 # offload 20 layers to GPU, rest on CPU
memoryLimit: "12Gi" # reduce memory limit accordingly
A 7B Q4_K_M model at full offload (gpuLayers: 99) requires approximately 4–5 GB VRAM.