Google Research Drops TurboQuant Whitepaper for LLM Compression
Summary
Google Research published a whitepaper introducing TurboQuant, a new compression algorithm that shrinks LLM memory by up to 6x with zero accuracy loss (no dumbing down), achieving up to 8x speed gains on H100 GPUs. The method is set to be presented at ICLR (International Conference on Learning Representations) 2026
AI models use massive amounts of memory to store data called “vectors,” essentially the model’s way of understanding meaning. A big bottleneck is the key-value KV cache (think of it as the model’s short-term memory during a conversation). It gets bloated fast, slowing everything down and costing a lot of compute.
TurboQuant
TurboQuant is a compression algorithm from Google Research that massively shrinks LLM size with zero accuracy loss, designed for both speeding up AI inference and vector search.
TurboQuant is built on two sub-methods:
1. PolarQuant — Instead of storing data using standard X/Y/Z coordinates, it converts vectors into polar coordinates (think “5 blocks at a 37° angle” instead of “3 blocks East, 4 blocks North”). This eliminates the extra memory overhead that traditional compression methods carry.
1. QJL (1-bit trick) — It uses a mathematical technique to shrink high-dimensional data and then reduces each number down to a single sign bit (+1 or -1), creating a high-speed shorthand that requires zero memory overhead.
TurboQuant runs PolarQuant first for the heavy lifting, then uses QJL (Quantized Johnson-Lindenstrauss) to clean up residual errors.
These are the results:
- Compresses the KV cache down to just 3 bits without any retraining or fine-tuning, while maintaining full model accuracy.
- 4-bit TurboQuant achieves up to an 8x speed increase over standard 32-bit models on NVIDIA H100 GPUs.
- Reduces KV memory by at least 6x on long-context tasks with near-zero performance degradation.
It makes running large language models significantly cheaper and faster, relevant for anything from Gemini to future AI deployments. The fact that it requires no retraining is a big deal for practical deployment at scale.
Official link to the whitepaper — https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression




