How to run CodeLlama using llama.cpp

Clone repo

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

Download a model

wget https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF/blob/main/phind-codellama-34b-v2.Q3_K_L.gguf -P ~/models/34b/

Tip: You can also use aria2. It's a great download utility that utilizes max available bandwidth by using multiple connections. You can install it with home brew : brew install aria2

aria2c --out="~/models/34b/phind-codellama-34b-v2.Q3_K_L.gguf" "https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF/blob/main/phind-codellama-34b-v2.Q3_K_L.gguf"

The size of the download is 17.8 GB
You can pause download using Ctrl+C``. Both wgetandaria2` are able to continue download. Just run the same download command again.
The model is hosted on Hugging Face: https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF
This model will require at least ~20 GB of available RAM.
Please refer to the following page to get help choosing a model that fits your setup. How to choose a model

Run a model

In llama.cpp repo, run the following

make clean

This will clean the directory from previous build files.

LLAMA_METAL=1 make -j

This will compile llama.cpp, enabling Metal support to use macOS GPU.

./server -m ~/home/models/34B/phind-codellama-34b-v2.Q3_K_L.gguf -ngl 10

This will start llama.cpp server with the selected model, and offload 10 layers to the GPU.

Then, open a browser at localhost:8080. This is llama.cpp web ui. You can write a message and chat with the model.

NB

These instructions have been tested on macOS, but you should be good on other platforms as well, In doubt, refer to the llama.cpp repo.

How to run CodeLlama using llama.cpp

Clone repo​

Download a model​

Run a model​

NB​

Clone repo

Download a model

Run a model

NB