Analysing The Impact of Sequence Composition on Language Model Pre-Training
Analysing The Impact of Sequence Composition on Language Model Pre-Training Read Paper Pre-training sequence composition plays a critical role in language model performance. Traditional causal masking can introduce distractions from unrelated documents, hindering effectiveness. Intra-document causal masking, which conditions tokens only within the same document, addresses this issue and enhances results. Additionally, the BM25Chunk method improves in-context learning, knowledge retention, and context utilization by grouping related documents, all while maintaining efficiency.
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression Read Paper Deploying large language models (LLMs) is challenging due to the high memory demands of the Key-Value (KV) cache, especially with longer context lengths. Traditional solutions involve fine-tuning models or using attention scores to manage sequence length, but we propose a more efficient strategy. Our analysis of decoder-only Transformers reveals consistent attention patterns across layers and a strong correlation between the L2 norm of key embeddings and their attention scores. Low L2 values tend to yield high attention, suggesting that the key embeddings themselves influence KV pair importance before querying. Leveraging this insight, we developed a compression method that reduces KV cache size by 50% for language modeling and needle-in-a-haystack tasks, and by 90% for passkey retrieval tasks, all without sacrificing accuracy. Additionally, this method is compatible with FlashAttention, expanding its utility across various models and tasks. Additionally, the BM25Chunk method improves in-context learning, knowledge retention, and context utilization by grouping related documents, all while maintaining efficiency.