Deploying large language models (LLMs) is challenging due to the high memory demands of the Key-Value (KV) cache, especially with longer context lengths. Traditional solutions involve fine-tuning models or using attention scores to manage sequence length, but we propose a more efficient strategy.
Our analysis of decoder-only Transformers reveals consistent attention patterns across layers and a strong correlation between the L2 norm of key embeddings and their attention scores. Low L2 values tend to yield high attention, suggesting that the key embeddings themselves influence KV pair importance before querying.
Leveraging this insight, we developed a compression method that reduces KV cache size by 50% for language modeling and needle-in-a-haystack tasks, and by 90% for passkey retrieval tasks, all without sacrificing accuracy. Additionally, this method is compatible with FlashAttention, expanding its utility across various models and tasks.
Additionally, the BM25Chunk method improves in-context learning, knowledge retention, and context utilization by grouping related documents, all while maintaining efficiency.
©2024 Miniml Ltd. All rights reserved