Q-Filters: Efficient KV Cache Compression
Autoregressive language models rely on a Key-Value (KV) Cache to avoid re-computing past hidden states during generation. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection.
Contrarily to many alternatives, Q-Filters is compatible with FlashAttention. Experimental results demonstrate that Q-Filters is competitive with attention-based compression methods in retrieval tasks while consistently outperforming efficient compression schemes in generation setups. Q-Filters achieves 99% accuracy in the needle-in-a-haystack task with 32x compression while reducing generation perplexity drop by up to 65% compared to Streaming-LLM.




