ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification

KV cache stores key and value states from previous tokens to avoidre-computation, yet it demands substantial storage space, especially for longsequences. Adaptive KV cache compression seeks to discern the saliency oftokens, preserving vital information while aggressively compressing those ofless importance. However, previous methods of this approach exhibit significantperformance degradation at high compression ratios due to inaccuracies inidentifying salient tokens. In this paper, we present ZipCache, an accurate andefficient KV cache quantization method for LLMs. First, we construct a strongbaseline for quantizing KV cache. Through the proposed channel-separabletokenwise quantization scheme, the memory overhead of quantization parametersare substantially reduced compared to fine-grained groupwise quantization. Toenhance the compression ratio, we propose normalized attention score as aneffective metric for identifying salient tokens by considering the lowertriangle characteristics of the attention matrix. Moreover, we develop anefficient approximation method that decouples the saliency metric from fullattention scores, enabling compatibility with fast attention implementationslike FlashAttention. Extensive experiments demonstrate that ZipCache achievessuperior compression ratios, fast generation speed and minimal performancelosses compared with previous KV cache compression methods. For instance, whenevaluating Mistral-7B model on GSM8k dataset, ZipCache is capable ofcompressing the KV cache by 4.98×, with only a 0.38% drop inaccuracy. In terms of efficiency, ZipCache also showcases a 37.3% reductionin prefill-phase latency, a 56.9% reduction in decoding-phase latency, and a19.8% reduction in GPU memory usage when evaluating LLaMA3-8B model with ainput length of 4096.

Further reading