llama.cpp CUDA Models#

llama.cpp is an OpenAI-compatible LLM inference server with CUDA acceleration for NVIDIA GPUs. It serves GGUF-format models and integrates with Open WebUI alongside the RKLLama NPU backend.

Prerequisites#

1 — Add an NVIDIA GPU node to the inventory#

In hosts.yml, add your GPU machine to extra_nodes with nvidia_gpu_node: true:

extra_nodes:
  hosts:
    my-gpu-node:                   # ← your GPU node's hostname
      nvidia_gpu_node: true  # installs NVIDIA container toolkit

Then run the full provisioning for that node:

ansible-playbook pb_all.yml --tags servers,k3s --limit my-gpu-node

This will:

  • Install the NVIDIA GPU driver and container toolkit

  • Configure k3s’s containerd to use the NVIDIA runtime as default

  • Label the node nvidia.com/gpu.present=true so the device plugin DaemonSet schedules there

  • Join the node to the K3s cluster

2 — Configure your NFS share#

llama.cpp stores models on an NFS PersistentVolume. Keep GGUF models in a separate subdirectory from RKLLama — the two backends use incompatible model formats.

Edit kubernetes-services/values.yaml:

llamacpp:
  nfs:
    server: 192.168.1.3           # ← your NFS server IP
    path: /bigdisk/LMModels/cuda  # ← separate from rkllama's path
  model:
    file: "mistral-7b-instruct-v0.2.Q4_K_M.gguf"  # ← model to load
    gpuLayers: 99      # offload all layers; reduce if VRAM is insufficient
    contextSize: 8192
    parallel: 4
    memoryLimit: "24Gi"

Warning

Do not share the NFS path with RKLLama. RKLLama uses .rkllm files (Rockchip NPU format) and llama.cpp uses .gguf files (GGUF/GGML format). They cannot use each other’s models.

Commit and push the change; ArgoCD will reconcile the PersistentVolume automatically.

Find a compatible GGUF model on Hugging Face#

llama.cpp loads any GGUF-format model. A good starting point:

  1. Go to https://huggingface.co/models and search for GGUF.

  2. Look for quantised variants — Q4_K_M is a good balance of quality and VRAM use:

    Token

    Meaning

    Q4_K_M

    4-bit quantisation, medium quality — recommended

    Q5_K_M

    5-bit — better quality, more VRAM

    Q8_0

    8-bit — near-lossless, needs ~8GB VRAM for 7B

    F16

    Full float16 — maximum quality, needs most VRAM

  3. Note the exact filename you want (e.g. mistral-7b-instruct-v0.2.Q4_K_M.gguf).

Download a model#

Use the llamacpp-pull script (installed by the tools role) to search HuggingFace and download directly to the cluster’s NFS share:

llamacpp-pull mistral 7b

The script will:

  1. Search HuggingFace for matching GGUF repos

  2. Let you pick a repo and a specific .gguf file

  3. Download it via kubectl exec into the llamacpp pod’s /models volume

  4. Offer to activate the model immediately (--set)

Manual download via kubectl exec#

If you prefer to download manually, exec into the llamacpp pod directly:

kubectl get pods -n llamacpp

kubectl exec -n llamacpp <pod-name> -- \
  curl -L -o /models/my-model.Q4_K_M.gguf \
  https://huggingface.co/<owner>/<repo>/resolve/main/<filename>.gguf

The download runs inside the pod and writes directly to NFS.

Update the model filename in values.yaml#

After downloading, update llamacpp.model.file in kubernetes-services/values.yaml to match the filename you downloaded, then commit and push:

git add kubernetes-services/values.yaml
git commit -m "Update llamacpp model to <new filename>"
git push

ArgoCD will sync the change, the llamacpp pod will restart, and the model will load automatically. Once running, it appears in Open WebUI’s model dropdown under the OpenAI API section.

Verify the model is loaded#

# Check pod is running
kubectl get pods -n llamacpp

# Confirm the model is ready
kubectl logs -n llamacpp -l app=llamacpp --tail=5
# Should end with: "main: server is listening on http://0.0.0.0:8080"

# Test the API directly
curl http://llamacpp.<your-domain>/v1/models

Adjust GPU memory usage#

If the model fails to load due to insufficient VRAM, reduce gpuLayers in kubernetes-services/values.yaml. Each layer offloads roughly equal amounts of VRAM; setting a lower value causes remaining layers to run on CPU:

llamacpp:
  model:
    gpuLayers: 20    # offload 20 layers to GPU, rest on CPU
    memoryLimit: "12Gi"  # reduce memory limit accordingly

A 7B Q4_K_M model at full offload (gpuLayers: 99) requires approximately 4–5 GB VRAM.