Think Miniml / Research

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Autoregressive language models rely on a Key-Value (KV) Cache to avoid re-computing past hidden states during generation. As model sizes and context l...

Full paper · available on arxiv.org

Read paper

Autoregressive language models rely on a Key-Value (KV) Cache to avoid re-computing past hidden states during generation. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection.

Contrarily to many alternatives, Q-Filters is compatible with FlashAttention. Experimental results demonstrate that Q-Filters is competitive with attention-based compression methods in retrieval tasks while consistently outperforming efficient compression schemes in generation setups. Q-Filters achieves 99% accuracy in the needle-in-a-haystack task with 32x compression while reducing generation perplexity drop by up to 65% compared to Streaming-LLM.

Start the conversation

Talk to a senior consultant.

30 minutes. Bring a problem you’re stuck on — we’ll tell you what we’d do next.

Book a consultation