LLM Quantization evaluation— leave no bits behind
Quantization is a powerful methodology to reduce the computational and memory cost of ML models. In the context of LLM, it is making it possible to host models that otherwise would just not fit in memory. In this article I will focus on a personal use of LLMs, where in most of cases, the problem is not on latency/throughput at inference time, but on the availability of GPU memory.
Just to recall, in order to calculate a lower bound of the required memory to self-host an LLM entirely on the GPU, you have to sum the memory required for the weights of the model and the memory required for the KV Cache. For example considering LLAMA 3.1 8B, weights in BF16, a single sequence of 128K tokens, and KV Cache in FP16, you would need ~32GB of VRAM (( 128K * 2 * 32 * 8 * 128 ) * 2 )) / (1024^3) + 16*(1000^3)/(1024^3)). 32GB of memory is not easy to obtain when you have a very limited budget, most of the cheapest GPUs available on the cloud (e.g. L4, T4, A10G) have only 24GB of VRAM -> Quantization is not an option in these cases.
Quantization is very useful, but the main problem is accuracy loss. Like most of you, I have seen a lot of HuggingFace repos with a lot of quantized models without any metrics that try to estimate their accuracy loss. It is great, I mean, they do it for free, and evaluation costs.
What most people do is:
- download a quantized version of a model that looks great at full precision on the leaderboard
- use it
The biggest problem with this approach is that quantization is not “order-preserving” with respect to accuracy. A model that is better than another at full precision, could be worst when compared after quantization (applied to both). Additionally, the impact of quantization is something under study, for example, it is well known that LLAMA 3.0 “quantizes” worse than LLAMA 2.0 (https://arxiv.org/pdf/2404.14047) and the reasons for this are still under study.
In this article I will not show you how it is possible to make a full analysis of the impact of quantization on your model, but how risky it is to not perform it.
There are a lot of quantization methods generally used for LLMs, such as GPTQ, GGUF, AWQ, … In this article, I will use GGUF for two reasons:
- it is very fast
- it does not require a calibration set (excl. Importance Matrix)
I will use LLAMA.CPP (https://github.com/ggerganov/llama.cpp) for the quantization and the computation of the metrics of the quantized models.
You should consider it just as an example, the same arguments apply also to the other methods.
I will measure the impact of quantization with a metric that is not the best one to do it, but it is something that is very easy and fast (more or less) to compute and it is well implemented in LLAMA.CPP: perplexity (on wikitext2 testset).
I will perform the experiments on the base versions of the models that I will consider (no Instruction-Tuned versions). The details of the code that I have used for the quantization and the computation of the perplexity are written at the end of this article.
A better approach would be to make a full analysis of the models on reasoning tasks and datasets (e.g. MMLU, HellaSwag, WinoGrande, …) but this analysis would cost more — I have a limited budget :-(. Perplexity is not easily comparable among different models, so I will show how quantization increases the perplexity of a quantized model wrt its original full precision version:
I will use these GGUF quantization types:
LLAMA 3.1 8B, LLAMA 3.0 8B, LLAMA 2 7B
With LLAMA 3, Meta released a very powerful model, but wrt LLAMA 2 it was observed that LLAMA 3.0 degrades more when quantized. Citing https://arxiv.org/pdf/2404.14047:
“Our findings indicate that while LLAMA 3.0 still demonstrates superior performance after quantization, the performance degradation associated with quantization is significant and can lead to larger declines”.
In the paper, they did not compare GGUF, but it is quite easy to perform a similar analysis just considering perplexity. In the paper, they made the comparison just between LLAMA 2 and LLAMA 3.0 (the paper was made before LLAMA 3.1), in my analysis I added also LLAMA 3.1 to make it more complete:
As expected, also these analyses show that LLAMA 2 7B degrades less than LLAMA 3.0 8B, but what is very interesting is that LLAMA 3.1 8B has a shape that is very similar to LLAMA 3.0 8B, below Q3_K_M it would be better to not quantize LLAMA 3.0 8B and LLAMA 3.1 8B with GGUF. If you want, you can, but the loss in accuracy could be huge. The reason behind this discrepancy between LLAMA 2 7B and LLAMA 3.X 8B is not yet well understood, there are multiple hypotheses and you can find some interesting discussion here:
- https://www.reddit.com/r/LocalLLaMA/comments/1cci5w6/quantizing_llama_3_8b_seems_more_harmful_compared/
- https://www.reddit.com/r/LocalLLaMA/comments/1cetn9z/quantization_seems_to_hurt_the_quality_of_llama_3/
Probably the main reason for this difference is that LLAMA 3.X 8B is more trained wrt LLAMA 2 7B and, putting it in very simple terms, this caused a more “dense” use of all the available bits to represent the weights — 15T tokens were used to train LLAMA 3.X and only 2T for LLAMA 2. This “overtraining” should be more evident for smaller models and less for bigger models. This interpretation is quite logical, and following the same argument, we should see a smaller drop if we consider larger models (70B and 405B versions of LLAMA 3.X), but also with LLAMA 2 70B and LLAMA 3.0 70B we obtain a similar discrepancy:
But maybe the problem is that 15T is still a lot even for the 70B version of LLAMA 3.0. Fortunately, Grimulkan (https://www.reddit.com/user/Grimulkan/) computed perplexity over wikitext also for LLAMA 3.1 405B (graph below) and it seems that LLAMA 3.1 405B has the same type of drop — quite surprising for me. The quantization method used by Grimulkan is different than GGUF, but I do not have the resources to make the same analysis with the 405B version. These results suggest that the hypothesis I mentioned should be deeply analyzed. Personally, I think that training with a lot of tokens, has a non-uniform impact on the “precision” required by the weights and the analysis should be made locally layer by layer — maybe it could be that a denser numerical representation is learned just on some layers of the network and/or the denser numerical representation is something that is learned regardless of the dimension of the network.
LLAMA 3.1 8B, GEMMA 2 9B, QWEN2 7B
We have seen in the previous analysis that LLAMA 3.X 8B is strongly impacted by quantization when we go below Q3_K_M, but it is something related to LLAMA 3.X or not? How does it compare to other SOTA models, like GEMMA 2 9B and QWEN2 7B? GEMMA 2 9B has been pretrained using 8T tokens meanwhile QWEN2 7B using 7T tokens. In the figure the results:
As shown in the figure, GEMMA2 9B degrades in performance but less than LLAMA 3.1 8B. Why does this happen? Maybe because it is bigger than LLAMA 3.1 and was trained with less token? Or maybe because the original precision of GEMMA2 9B is FP32 meanwhile the original precision of LLAMA 3.1 8B is BF16? What is clear from this plot is that GEMMA2 and QWEN2 can be quantized better than LLAMA 3.1 at Q3_K_M, and this is very interesting. Although these analyses are made just using perplexity as a possible proxy for the accuracy of the model on reasoning tasks, a deeper investigation should be made.
By the way, I hope that these analyses show that you should always test in some way, what you obtain/download after quantization. If you have two models: A and B, A performs better than B on reasoning tasks, it is not right to assume that A after quantization will perform better than B after the same type of quantization.
Conclusion
Please, test your quantized model before using it. It takes time but it could save you from using something that is strongly affected by quantization. To be honest, I think that LLM leaderboards are very very very very useful, but it would be really great to have something similar also for quantized models — most people use them. I said that the article is mainly focused on the use of LLM for personal use, but quantized versions of LLM (below Q8) are also used in production or on the development of POCs. So it is a very important subject.
Reproducibility
The weights of the quantized models can be found in:
- https://huggingface.co/fedric95/gemma-2-9b-GGUF
- https://huggingface.co/fedric95/Meta-Llama-3.1-8B-GGUF
- https://huggingface.co/fedric95/Qwen2-7B-GGUF
In these repositories I have added the perplexity of each model. For LLAMA2 70B I have used the results reported in: https://www.reddit.com/r/LocalLLaMA/comments/197mip0/some_perplexities_for_the_new_2_bit_sota_gguf
For LLAMA2 7B I have computed the perplexity starting from the weights available in https://huggingface.co/TheBloke/Llama-2-7B-GGUF
ANNEX
GEMMA 2 9B, GEMMA 2 2B, GEMMA 7B (update 22/08)
Just for completeness, this is a plot that compares GEMMA 2 9B, GEMMA 7 (6T tokens training) and GEMMA 2 2B (2T tokens pretraining).
Acknowledgements
The Story has been written by:
Federico Ricciuti
Feel free to add me on Linkedin:
linkedin.com/in/federico-ricciuti-b490ab59