Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

Autoregressive language models rely on a Key-Value (KV) Cache to avoid re-computing past hidden states during generation. As model sizes and context l...

August 12, 2025 1 min read

Full paper · available on arxiv.org

Read paper

Autoregressive language models rely on a Key-Value (KV) Cache to avoid re-computing past hidden states during generation. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection.

Contrarily to many alternatives, Q-Filters is compatible with FlashAttention. Experimental results demonstrate that Q-Filters is competitive with attention-based compression methods in retrieval tasks while consistently outperforming efficient compression schemes in generation setups. Q-Filters achieves 99% accuracy in the needle-in-a-haystack task with 32x compression while reducing generation perplexity drop by up to 65% compared to Streaming-LLM.

Keep reading.

November 2025

GRADA: Graph-based Reranker against Adversarial Documents Attack

Retrieval Augmented Generation (RAG) frameworks improve the accuracy of large language models (LLMs) by integrating external knowledge from retrieved ...

November 2025

FLARE: Faithful Logic-Aided Reasoning and Exploration

We introduce Faithful Logic-Aided Reasoning and Exploration (FLARE), a novel interpretable approach for traversing the problem space using task decomp...

November 2025

DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations

Large Language Models (LLMs) often hallucinate, producing unfaithful or factually incorrect outputs by misrepresenting the provided context. We propos...

Start the conversation

Talk to a senior consultant.

30 minutes. Bring a problem you’re stuck on — we’ll tell you what we’d do next.

Book a consultation →