Analysing The Impact of Sequence Composition on Language Model Pre-Training

Pre-training sequence composition plays a critical role in language model performance. Traditional causal masking can introduce distractions from unrelated documents, hindering effectiveness. Intra-document causal masking, which conditions tokens only within the same document, addresses this issue and enhances results. Additionally, the BM25Chunk method improves in-context learning, knowledge retention, and context utilization by grouping related documents, all while maintaining efficiency.