Researchers from Nvidia have developed RateQuant, a novel approach to mixed-precision quantization of key-value (KV) caches used in large language models (LLMs). The paper, titled "RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory," was submitted to arXiv on April 22, 2026. The research addresses the memory bottleneck caused by the KV cache, which grows linearly with sequence length during LLM generation. 1
The study highlights that current quantizers often assign the same bit-width to each attention head, overlooking the varying importance of different heads. RateQuant aims to allocate more bits to important heads and fewer to others. However, the researchers found that different quantizers have varying distortion curves, which can lead to performance degradation if not properly addressed. This issue is termed "distortion model mismatch." 1
RateQuant resolves the distortion model mismatch by fitting a per-quantizer distortion model from a small calibration set. It then solves the bit-allocation problem using reverse waterfilling from rate-distortion theory. The method was tested on Qwen3-8B, where RateQuant, at 2.5 average bits, reduced KIVI's perplexity from 49.3 to 14.9 (70% reduction) and improved QuaRot by 6.6 PPL. The calibration process takes 1.6 seconds on a single GPU and adds no overhead during inference. 1
The paper's authors are Fei Zuo, Zikang Zhou, Hao Cong, Xiaoyan Xi, and Ho Fai Leung. The research falls under the categories of Machine Learning (cs.LG), Computation and Language (cs.CL), and Information Theory (cs.IT). 1
ops.llm_calls. Every fact traces to a citation. If a fact looks wrong, write to corrections.