SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
On this page
Large language models (LLMs) achieve remarkable performance in naturallanguage understanding but require substantial computation and memoryresources. Post-training quantization (PTQ) is a powerful compression techniqueextensively investigated in LLMs. However, existing PTQ methods are still notideal in terms of accuracy and efficiency, especially with below 4 bit-widths.Standard PTQ methods using group-wise quantization suffer difficulties inquantizing LLMs accurately to such low-bit, but advanced methods remaininghigh-precision weights element-wisely are hard to realize their theoreticalhardware efficiency. This paper presents a Salience-Driven Mixed-PrecisionQuantization scheme for LLMs, namely SliM-LLM. The scheme exploits the saliencedistribution of weights to determine optimal bit-width and quantizers foraccurate LLM quantization, while aligning bit-width partition to groups forcompact memory usage and fast integer inference. Specifically, the proposedSliM-LLM mainly relies on two novel techniques: (1) Salience-Determined BitAllocation utilizes the clustering characteristics of salience distribution toallocate the bit-widths of each group, increasing the accuracy of quantizedLLMs and maintaining the inference efficiency; (2) Salience-Weighted QuantizerCalibration optimizes the parameters of the quantizer by considering theelement-wise salience within the group, balancing the maintenance of salientinformation and minimization of errors. Comprehensive experiments show thatSliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g.,2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIAA800 GPUs, and 48
Further reading
- Access Paper in arXiv.org