About quantization

What is quantization?

Large language models require significant memory for inference and expensive GPUs. Quantization is a way to run large models on setups with limited RAM. Weights in a model are represented with floating point data types with different options for precision : fp32, fp16, etc. A quantized model uses less bits and use low-bit-precision matrix multiplication. This results is some degradation in quality. However, there can be a positive tradeoff - i.e. it’s possible to have better results with a larger model that has been quantized than with a smaller model that is not quantized.

Read further:

What are types of quantization mostly used?

TLDR: I you can, use Q5_K_M or even Q6_K. If not, try smaller models down to Q3_K_S. | Code |Comment | Description | |--- |--- |--- | |Q4_0 |small, very high quality loss - legacy, prefer using Q3_K_M | | | Q4_1 | small, substantial quality loss - legacy, prefer using Q3_K_L | | | Q5_0 | medium, balanced quality - legacy, prefer using Q4_K_M | | | Q5_1 | medium, low quality loss - legacy, prefer using Q5_K_M | | | Q2_K | smallest, extreme quality loss - not recommended | New k-quant method. Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors. | | Q3_K | alias for Q3_K_M | New k-quant method. Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K | | Q3_K_S| very small, very high quality loss | | | Q3_K_M| very small, very high quality loss | | | Q3_K_L| small, substantial quality loss | New k-quant method. Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K | | Q4_K | alias for Q4_K_M | | | Q4_K_S| small, significant quality loss | | Q4_K_M| medium, balanced quality - recommended | | Q5_K | alias for Q5_K_M | | Q5_K_S| large, low quality loss - recommended | | Q5_K_M| large, very low quality loss - recommended | | Q6_K | very large, extremely low quality loss | | Q8_0 | very large, extremely low quality loss - not recommended | | F16 | extremely large, virtually no quality loss - not recommended | | F32 | absolutely huge, lossless - not recommended |

About quantization

What is quantization?​

What are types of quantization mostly used?​

Links​

What is quantization?

What are types of quantization mostly used?

Links