Steering Knowledge Selection Behaviours in LLMs

Large language models (LLMs) often face conflicts between stored knowledge and contextual information, which can lead to outdated or incorrect responses. Analyzing LLMs’ internal activations, we find that mid-layer signals can detect these context-memory knowledge conflicts. To address this, we introduce SpARE, a training-free representation engineering approach using pre-trained sparse auto-encoders (SAEs) to steer knowledge selection during inference. By editing specific internal activations, SpARE effectively manages knowledge conflicts, improving accuracy in open-domain question-answering tasks by 10% over existing methods and 15% over contrastive decoding.
Low-rank lottery tickets

Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations Read Paper Neural networks deliver exceptional performance but can be impractical for applications with limited hardware or energy resources due to their high memory and computational demands. This work introduces a novel algorithm to identify efficient low-rank subnetworks during the training phase, significantly reducing both training and evaluation costs. The approach restricts weight matrices to a low-rank manifold and updates only the low-rank factors during training. Using dynamic model order reduction techniques, the method ensures approximation, stability, and descent guarantees. It also adapts ranks dynamically throughout training to maintain the desired accuracy. Numerical experiments on fully-connected and convolutional networks demonstrate the efficiency of this technique
Robust low-rank training via approximate orthonormal constraints

Robust low-rank training via approximate orthonormal constraints Read Paper As models and datasets grow, pruning techniques using low-rank matrix factorizations have become popular for reducing resource demands while maintaining accuracy. However, we find that these methods often degrade robustness against adversarial attacks due to exploding singular values in the low-rank matrices. To address this, we propose a robust low-rank training algorithm that keeps weights on the low-rank manifold while enforcing approximate orthonormal constraints. This approach reduces both training and inference costs while improving model conditioning and adversarial robustness, all without compromising accuracy. Our theoretical analysis and experimental results confirm that the robust low-rank network closely approximates the performance of full models when effective low-rank sub-networks are available. To address this, we introduce a new error taxonomy and create MMLU-Redux—a refined subset of 3,000 manually re-annotated questions across 30 subjects. Our experiments with MMLU-Redux reveal notable discrepancies in previously reported model performance metrics, underscoring the need for revising MMLU’s flawed questions. We invite the community to contribute to further annotations to enhance the reliability of this important benchmark.
Are We Done with MMLU?

Are We Done with MMLU? Read Paper Our analysis uncovers significant issues with the Massive Multitask Language Understanding (MMLU) benchmark, which is widely used to assess LLMs. We found numerous ground truth errors, with 57% of the Virology subset’s questions containing inaccuracies, obscuring the true capabilities of models. To address this, we introduce a new error taxonomy and create MMLU-Redux—a refined subset of 3,000 manually re-annotated questions across 30 subjects. Our experiments with MMLU-Redux reveal notable discrepancies in previously reported model performance metrics, underscoring the need for revising MMLU’s flawed questions. We invite the community to contribute to further annotations to enhance the reliability of this important benchmark.
Enhancing AI Model Robustness with Natural Language Explanations
Enhancing AI Model Robustness with Natural Language Explanations Read Paper In this paper, we explore how natural language explanations (NLEs) can improve the robustness of large language models (LLMs) in tasks like natural language inference and paraphrase detection. By prompting LLMs with a mix of human-generated and AI-produced NLEs, we observed notable improvements in handling adversarial inputs. Our findings indicate that this method consistently outperforms traditional approaches, offering a more effective way to enhance model accuracy in challenging scenarios.
Probing the Emergence of Cross-lingual Alignment during LLM Training
Probing the Emergence of Cross-lingual Alignment during LLM Training Read Paper Multilingual LLMs excel at zero-shot cross-lingual transfer, likely by aligning languages without parallel sentence supervision. This study uses intrinsic probing to analyze neuron overlap encoding linguistic features, correlating it with transfer performance. By examining BLOOM checkpoints across training steps and model scales, a strong link between neuron overlap and downstream performance is identified. The findings also reveal phases in pre-training where alignment and multilingual abilities degrade, offering new insights into multilingual training dynamics.
Using Natural Language Explanations to Improve Robustness of In-context Learning
Using Natural Language Explanations to Improve Robustness of In-context Learning Read Paper This work explores improving the robustness of LLMs against adversarial inputs by augmenting in-context learning (ICL) with natural language explanations (NLEs). Prompting models to generate NLEs from a small set of human-crafted examples yields better results than zero-shot ICL and using only human-generated NLEs. Evaluated across five LLMs, the approach delivers a 6% improvement on eight adversarial datasets. Additionally, while prompt selection strategies boost ICL on standard tests, they prove less effective for robustness, showing an 8% accuracy drop compared to this method.
A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression

A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression Read Paper The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyze the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the L2 and the attention scores over cached KV pairs, where a low L2 of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the L2 of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modeling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.
SPARSEFIT: Few-shot Prompting with Sparse Fine-tuning
SPARSEFIT: Few-shot Prompting with Sparse Fine-tuning for Jointly Generating Predictions and Natural Language Explanations Read Paper This work introduces SparseFit, a sparse few-shot fine-tuning strategy for generating natural language explanations (NLEs) with pre-trained language models (PLMs). SparseFit uses discrete prompts to jointly generate predictions and NLEs while fine-tuning only 6.8% of the model’s parameters, making it more efficient than full fine-tuning. Tested on three T5 model sizes and four datasets, SparseFit achieves competitive task performance and NLE quality, outperforming other parameter-efficient fine-tuning (PEFT) methods on average in both predictive accuracy and explanation quality.