About quantization
What is quantization?β
Large language models require significant memory for inference and expensive GPUs. Quantization is a way to run large models on setups with limited RAM. Weights in a model are represented with floating point data types with different options for precision : fp32, fp16, etc. A quantized model uses less bits and use low-bit-precision matrix multiplication. This results is some degradation in quality. However, there can be a positive tradeoff - i.e. itβs possible to have better results with a larger model that has been quantized than with a smaller model that is not quantized.
Read further:
- https://huggingface.co/blog/hf-bitsandbytes-integration
- https://huggingface.co/blog/4bit-transformers-bitsandbytes
What are types of quantization mostly used?β
TLDR: I you can, use Q5_K_M or even Q6_K.
If not, try smaller models down to Q3_K_S.
| Code |Comment | Description |
|--- |--- |--- |
|Q4_0 |small, very high quality loss - legacy, prefer using Q3_K_M | |
| Q4_1 | small, substantial quality loss - legacy, prefer using Q3_K_L | |
| Q5_0 | medium, balanced quality - legacy, prefer using Q4_K_M | |
| Q5_1 | medium, low quality loss - legacy, prefer using Q5_K_M | |
| Q2_K | smallest, extreme quality loss - not recommended | New k-quant method. Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors. |
| Q3_K | alias for Q3_K_M | New k-quant method. Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K |
| Q3_K_S| very small, very high quality loss | |
| Q3_K_M| very small, very high quality loss | |
| Q3_K_L| small, substantial quality loss | New k-quant method. Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K |
| Q4_K | alias for Q4_K_M | |
| Q4_K_S| small, significant quality loss |
| Q4_K_M| medium, balanced quality - recommended |
| Q5_K | alias for Q5_K_M |
| Q5_K_S| large, low quality loss - recommended |
| Q5_K_M| large, very low quality loss - recommended |
| Q6_K | very large, extremely low quality loss |
| Q8_0 | very large, extremely low quality loss - not recommended |
| F16 | extremely large, virtually no quality loss - not recommended |
| F32 | absolutely huge, lossless - not recommended |
Linksβ
- Quantization methods available in llama.cpp - https://github.com/ggerganov/llama.cpp/blob/06abf8eebabe086ca4003dee2754ab45032cd3fd/examples/make-ggml.py#L35
- Pull request implementing new quantization methods to llama.cpp - https://github.com/ggerganov/llama.cpp/pull/1684