llama.cpp CUDA Models#

llama.cpp is an OpenAI-compatible LLM inference server with CUDA acceleration for NVIDIA GPUs. It serves GGUF-format models and integrates with Open WebUI alongside the RKLLama NPU backend.

Prerequisites#

1 — Add an NVIDIA GPU node to the inventory#

In hosts.yml, add your GPU machine to extra_nodes with nvidia_gpu_node: true:

extra_nodes:
  hosts:
    my-gpu-node:                   # ← your GPU node's hostname
      nvidia_gpu_node: true  # installs NVIDIA container toolkit

Then run the full provisioning for that node:

ansible-playbook pb_all.yml --tags servers,k3s --limit my-gpu-node

This will:

Install the NVIDIA GPU driver and container toolkit
Configure k3s’s containerd to use the NVIDIA runtime as default
Label the node nvidia.com/gpu.present=true so the device plugin DaemonSet schedules there
Join the node to the K3s cluster

Find a compatible GGUF model on Hugging Face#

llama.cpp loads any GGUF-format model. A good starting point:

Go to https://huggingface.co/models and search for GGUF.

Look for quantised variants — Q4_K_M is a good balance of quality and VRAM use:

Token	Meaning
`Q4_K_M`	4-bit quantisation, medium quality — recommended
`Q5_K_M`	5-bit — better quality, more VRAM
`Q8_0`	8-bit — near-lossless, needs ~8GB VRAM for 7B
`F16`	Full float16 — maximum quality, needs most VRAM

Note the exact filename you want (e.g. mistral-7b-instruct-v0.2.Q4_K_M.gguf).

Download a model#

Use the llamacpp-pull script (installed by the tools role) to search HuggingFace and download directly to the cluster’s NFS share:

llamacpp-pull mistral 7b

The script will:

Search HuggingFace for matching GGUF repos
Let you pick a repo and a specific .gguf file
Download it via kubectl exec into the llamacpp pod’s /models volume
Offer to activate the model immediately (--set)

Manual download via kubectl exec#

If you prefer to download manually, exec into the llamacpp pod directly:

kubectl get pods -n llamacpp

kubectl exec -n llamacpp <pod-name> -- \
  curl -L -o /models/my-model.Q4_K_M.gguf \
  https://huggingface.co/<owner>/<repo>/resolve/main/<filename>.gguf

The download runs inside the pod and writes directly to NFS.

Update the model filename in values.yaml#

After downloading, update llamacpp.model.file in kubernetes-services/values.yaml to match the filename you downloaded, then commit and push:

git add kubernetes-services/values.yaml
git commit -m "Update llamacpp model to <new filename>"
git push

ArgoCD will sync the change, the llamacpp pod will restart, and the model will load automatically. Once running, it appears in Open WebUI’s model dropdown under the OpenAI API section.

Verify the model is loaded#

# Check pod is running
kubectl get pods -n llamacpp

# Confirm the model is ready
kubectl logs -n llamacpp -l app=llamacpp --tail=5
# Should end with: "main: server is listening on http://0.0.0.0:8080"

# Test the API directly
curl http://llamacpp.<your-domain>/v1/models

Adjust GPU memory usage#

If the model fails to load due to insufficient VRAM, reduce gpuLayers in kubernetes-services/values.yaml. Each layer offloads roughly equal amounts of VRAM; setting a lower value causes remaining layers to run on CPU:

llamacpp:
  model:
    gpuLayers: 20    # offload 20 layers to GPU, rest on CPU
    memoryLimit: "12Gi"  # reduce memory limit accordingly

A 7B Q4_K_M model at full offload (gpuLayers: 99) requires approximately 4–5 GB VRAM.