3-Bit KV Cache Quantization for LLMs (TurboQuant-Inspired)
FeaturedA from-scratch PyTorch benchmark of TurboQuant-inspired 3-bit KV cache quantization, compressing the cache ~4.9× while holding key reconstruction above 0.999 cosine similarity. Measured across the Qwen2.5 family on consumer hardware.
- Primary language
- Python · PyTorch
Overview
When an LLM generates text, it caches the key and value tensors for every token it has already seen so it doesn't recompute attention at each step. The catch: this KV cache grows linearly with both context length and model size, and at long contexts it routinely eats more memory than the model weights themselves. That memory wall is often what decides whether a 7B model can hold an 8K-token conversation on a single consumer GPU or Mac.
TurboQuant (Google Research & NYU, ICLR 2026) showed you can quantize that cache down to 3 bits with near-lossless quality. This project takes its core ideas and implements a simplified version from scratch in PyTorch, then benchmarks how much of the memory win actually holds up on the Qwen2.5 model family. The driving question: how aggressively can you compress the cache before attention quality starts to break down?
How it works
Three stages, applied to the cached keys and values:
- Walsh–Hadamard rotation — spreads activation outliers evenly across dimensions so no single channel dominates the quantization range.
- 3-bit scalar quantization — a Lloyd–Max quantizer fit to the (Beta-like) distribution of the rotated values, placing quantization levels where the data actually lives.
- Optional QJL error correction — a toggleable 1-bit residual pass that recovers precision lost in step 2, the part that keeps inner products (and so attention) honest.
Results
~4.9× smaller KV cache with > 0.999 cosine similarity on key reconstruction — near-lossless.
- 7B @ 8K context: 448 MB → 91 MB
- Perplexity impact (3B): +0.58 (estimated offline)
Tested on Qwen2.5-0.5B, 1.5B, 3B, and 7B (7B partially, under memory constraints). Runs on Apple Silicon (MPS) and CPU.
Scope
This is a measurement and analysis tool, and it's deliberately clear about its edges. It quantifies the memory-versus-fidelity trade-off and reconstructs the cache offline to verify quality. It does not yet wire quantized KV into the live inference path, and it makes no wall-clock speedup claims — real latency gains need custom CUDA/Metal kernels, which is a separate body of work. Perplexity figures are approximated offline rather than measured during generation.