Google Research Squeezes the KV Cache to 3 Bits With TurboQuant, an ICLR 2026 Breakthrough
A two-stage compression algorithm from Google Research cuts the KV cache 6x and accelerates attention 8x on H100s, with no fine-tuning and no measurable accuracy loss.
A new algorithm out of Google Research is rewriting what is possible at the memory bottleneck of large language model inference. TurboQuant, presented at ICLR 2026, compresses the key-value cache that LLMs use to track attention down to roughly three bits per value, delivers at least a 6x reduction in KV memory on needle-in-haystack benchmarks, and runs up to 8x faster than 32-bit attention on H100 GPUs. Crucially, it does all of this without any retraining or fine-tuning of the underlying model.
The KV cache is the silent cost driver of every long-context model on the market. Each new token forces the system to store and re-read growing tensors of keys and values, and at long contexts that memory pressure, not the weights themselves, is what limits throughput and drives up serving costs. Previous quantization techniques shaved bits off those tensors but typically required calibration data and gave back accuracy on long-context tasks. TurboQuant's pitch is that you no longer have to choose.
The method works in two stages. First, PolarQuant rotates each key and value vector with a random rotation matrix and converts it to polar coordinates, redistributing variance evenly across dimensions so that a fixed quantization grid works without calibration. Then a Quantized Johnson-Lindenstrauss correction, called QJL, reduces residual error to single sign bits with effectively zero additional memory. The authors, including research scientist Amir Zandieh and Google Fellow Vahab Mirrokni, report that the combination preserves near-lossless quality across LongBench, Needle In A Haystack, ZeroSCROLLS, RULER and L-Eval on Gemma, Mistral, and Llama-3.1-8B-Instruct.
The performance numbers are aggressive. Four-bit TurboQuant delivers up to 8x faster attention logit computation on H100 accelerators versus 32-bit baselines, and the technique extends naturally to vector search: on the GloVe-200 benchmark, TurboQuant beats both PQ and the recent RabbiQ quantizer on 1@k recall, suggesting it could underpin retrieval systems and semantic search at scale alongside its role inside the attention block.
Adoption has been almost instantaneous. The paper was published on March 24 and within weeks, independent developers had built working implementations of the math in PyTorch, MLX for Apple Silicon, and a llama.cpp CUDA backend that advertises 5.2x KV memory reduction with near-lossless quality. For inference vendors squeezing margins out of GPU clusters, a free 6x cut to KV memory with no model surgery is the kind of result that gets reflected in pricing within a quarter, and it strengthens Google's case that some of the most consequential infrastructure work in AI right now is still happening in research rather than in product launches.