How to run CodeLlama using llama.cpp
Clone repo
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
Download a model
wget https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF/blob/main/phind-codellama-34b-v2.Q3_K_L.gguf -P ~/models/34b/
Tip: You can also use aria2. It's a great download utility that utilizes max available bandwidth by using multiple connections. You can install it with home brew : brew install aria2
aria2c --out="~/models/34b/phind-codellama-34b-v2.Q3_K_L.gguf" "https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF/blob/main/phind-codellama-34b-v2.Q3_K_L.gguf"
- The size of the download is 17.8 GB
- You can pause download using
Ctrl+C``. Bothwgetandaria2` are able to continue download. Just run the same download command again. - The model is hosted on Hugging Face: https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF
- This model will require at least ~20 GB of available RAM.
- Please refer to the following page to get help choosing a model that fits your setup. How to choose a model
Run a model
In llama.cpp repo, run the following
make clean
This will clean the directory from previous build files.
LLAMA_METAL=1 make -j
This will compile llama.cpp, enabling Metal support to use macOS GPU.
./server -m ~/home/models/34B/phind-codellama-34b-v2.Q3_K_L.gguf -ngl 10
This will start llama.cpp server with the selected model, and offload 10 layers to the GPU.
Then, open a browser at localhost:8080. This is llama.cpp web ui.
You can write a message and chat with the model.
NB
These instructions have been tested on macOS, but you should be good on other platforms as well, In doubt, refer to the llama.cpp repo.