# llama.cpp CUDA Models [llama.cpp](https://github.com/ggml-org/llama.cpp) is an OpenAI-compatible LLM inference server with CUDA acceleration for NVIDIA GPUs. It serves GGUF-format models and integrates with Open WebUI alongside the RKLLama NPU backend. ## Prerequisites ### 1 — Add an NVIDIA GPU node to the inventory In `hosts.yml`, add your GPU machine to `extra_nodes` with `nvidia_gpu_node: true`: ```yaml extra_nodes: hosts: my-gpu-node: # ← your GPU node's hostname nvidia_gpu_node: true # installs NVIDIA container toolkit ``` Then run the full provisioning for that node: ```bash ansible-playbook pb_all.yml --tags servers,k3s --limit my-gpu-node ``` This will: - Install the NVIDIA GPU driver and container toolkit - Configure k3s's containerd to use the NVIDIA runtime as default - Label the node `nvidia.com/gpu.present=true` so the device plugin DaemonSet schedules there - Join the node to the K3s cluster ### 2 — Configure your NFS share llama.cpp stores models on an NFS PersistentVolume. Keep GGUF models in a **separate subdirectory** from RKLLama — the two backends use incompatible model formats. Edit **`kubernetes-services/values.yaml`**: ```yaml llamacpp: nfs: server: 192.168.1.3 # ← your NFS server IP path: /bigdisk/LMModels/cuda # ← separate from rkllama's path model: file: "mistral-7b-instruct-v0.2.Q4_K_M.gguf" # ← model to load gpuLayers: 99 # offload all layers; reduce if VRAM is insufficient contextSize: 8192 parallel: 4 memoryLimit: "24Gi" ``` :::{warning} Do not share the NFS path with RKLLama. RKLLama uses `.rkllm` files (Rockchip NPU format) and llama.cpp uses `.gguf` files (GGUF/GGML format). They cannot use each other's models. ::: Commit and push the change; ArgoCD will reconcile the PersistentVolume automatically. ## Find a compatible GGUF model on Hugging Face llama.cpp loads any GGUF-format model. A good starting point: 1. Go to and search for `GGUF`. 2. Look for quantised variants — Q4_K_M is a good balance of quality and VRAM use: | Token | Meaning | |---|---| | `Q4_K_M` | 4-bit quantisation, medium quality — recommended | | `Q5_K_M` | 5-bit — better quality, more VRAM | | `Q8_0` | 8-bit — near-lossless, needs ~8GB VRAM for 7B | | `F16` | Full float16 — maximum quality, needs most VRAM | 3. Note the **exact filename** you want (e.g. `mistral-7b-instruct-v0.2.Q4_K_M.gguf`). ## Download a model Use the `llamacpp-pull` script (installed by the `tools` role) to search HuggingFace and download directly to the cluster's NFS share: ```bash llamacpp-pull mistral 7b ``` The script will: 1. Search HuggingFace for matching GGUF repos 2. Let you pick a repo and a specific `.gguf` file 3. Download it via `kubectl exec` into the llamacpp pod's `/models` volume 4. Offer to activate the model immediately (`--set`) ### Manual download via kubectl exec If you prefer to download manually, exec into the llamacpp pod directly: ```bash kubectl get pods -n llamacpp kubectl exec -n llamacpp -- \ curl -L -o /models/my-model.Q4_K_M.gguf \ https://huggingface.co///resolve/main/.gguf ``` The download runs inside the pod and writes directly to NFS. ## Update the model filename in values.yaml After downloading, update `llamacpp.model.file` in `kubernetes-services/values.yaml` to match the filename you downloaded, then commit and push: ```bash git add kubernetes-services/values.yaml git commit -m "Update llamacpp model to " git push ``` ArgoCD will sync the change, the llamacpp pod will restart, and the model will load automatically. Once running, it appears in Open WebUI's model dropdown under the OpenAI API section. ## Verify the model is loaded ```bash # Check pod is running kubectl get pods -n llamacpp # Confirm the model is ready kubectl logs -n llamacpp -l app=llamacpp --tail=5 # Should end with: "main: server is listening on http://0.0.0.0:8080" # Test the API directly curl http://llamacpp./v1/models ``` ## Adjust GPU memory usage If the model fails to load due to insufficient VRAM, reduce `gpuLayers` in `kubernetes-services/values.yaml`. Each layer offloads roughly equal amounts of VRAM; setting a lower value causes remaining layers to run on CPU: ```yaml llamacpp: model: gpuLayers: 20 # offload 20 layers to GPU, rest on CPU memoryLimit: "12Gi" # reduce memory limit accordingly ``` A 7B Q4_K_M model at full offload (`gpuLayers: 99`) requires approximately 4–5 GB VRAM.