TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
We investigate the cross-lingual transferability of backdoor attacks in instruction-tuned large language models. Our findings reveal that backdoors ca…
Steering Knowledge Selection Behaviours in LLMs
Read Paper Knowledge Selection in LLMs We investigate how large language models (LLMs) select and utilise knowledge when generating responses. Our analysis reveals that LLMs exhibit systematic biases in knowledge selection, often favouring certain types of information over others regardless of relevance or accuracy. Through controlled experiments using knowledge-steering techniques, we demonstrate that it’s possible to influence LLMs’ knowledge selection behaviours. We introduce novel methods for steering models towards more balanced and contextually appropriate knowledge utilisation, significantly improving response quality and factual accuracy. Our findings have important implications for developing more reliable and controllable language models, particularly in knowledge-intensive applications where accurate information retrieval and utilisation are critical.
Mixtures of In-Context Learners
We introduce Mixtures of In-Context Learners (MiCL), a novel approach that combines multiple in-context learning strategies to improve few-shot perfor…
Low-rank lottery tickets

Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations Read Paper Neural networks deliver exceptional performance but can be impractical for applications with limited hardware or energy resources due to their high memory and computational demands. This work introduces a novel algorithm to identify efficient low-rank subnetworks during the training phase, significantly reducing both training and evaluation costs. The approach restricts weight matrices to a low-rank manifold and updates only the low-rank factors during training. Using dynamic model order reduction techniques, the method ensures approximation, stability, and descent guarantees. It also adapts ranks dynamically throughout training to maintain the desired accuracy. Numerical experiments on fully-connected and convolutional networks demonstrate the efficiency of this technique
Robust low-rank training via approximate orthonormal constraints

Robust low-rank training via approximate orthonormal constraints Read Paper As models and datasets grow, pruning techniques using low-rank matrix factorizations have become popular for reducing resource demands while maintaining accuracy. However, we find that these methods often degrade robustness against adversarial attacks due to exploding singular values in the low-rank matrices. To address this, we propose a robust low-rank training algorithm that keeps weights on the low-rank manifold while enforcing approximate orthonormal constraints. This approach reduces both training and inference costs while improving model conditioning and adversarial robustness, all without compromising accuracy. Our theoretical analysis and experimental results confirm that the robust low-rank network closely approximates the performance of full models when effective low-rank sub-networks are available. To address this, we introduce a new error taxonomy and create MMLU-Redux—a refined subset of 3,000 manually re-annotated questions across 30 subjects. Our experiments with MMLU-Redux reveal notable discrepancies in previously reported model performance metrics, underscoring the need for revising MMLU’s flawed questions. We invite the community to contribute to further annotations to enhance the reliability of this important benchmark.
Are We Done with MMLU?
Read Paper Analyzing Flaws in MMLU Our analysis uncovers significant issues with the Massive Multitask Language Understanding (MMLU) benchmark, which is widely used to assess LLMs. We found numerous ground truth errors, with 57% of the Virology subset’s questions containing inaccuracies, obscuring the true capabilities of models. To address this, we introduce a new error taxonomy and create MMLU-Redux—a refined subset of 3,000 manually re-annotated questions across 30 subjects. Our experiments with MMLU-Redux reveal notable discrepancies in previously reported model performance metrics, underscoring the need for revising MMLU’s flawed questions. We invite the community to contribute to further annotations to enhance the reliability of this important benchmark.
Enhancing AI Model Robustness with Natural Language Explanations
Enhancing AI Model Robustness with Natural Language Explanations Read Paper In this paper, we explore how natural language explanations (NLEs) can improve the robustness of large language models (LLMs) in tasks like natural language inference and paraphrase detection. By prompting LLMs with a mix of human-generated and AI-produced NLEs, we observed notable improvements in handling adversarial inputs. Our findings indicate that this method consistently outperforms traditional approaches, offering a more effective way to enhance model accuracy in challenging scenarios.
Probing the Emergence of Cross-lingual Alignment during LLM Training
Probing the Emergence of Cross-lingual Alignment during LLM Training Read Paper Multilingual LLMs excel at zero-shot cross-lingual transfer, likely by aligning languages without parallel sentence supervision. This study uses intrinsic probing to analyze neuron overlap encoding linguistic features, correlating it with transfer performance. By examining BLOOM checkpoints across training steps and model scales, a strong link between neuron overlap and downstream performance is identified. The findings also reveal phases in pre-training where alignment and multilingual abilities degrade, offering new insights into multilingual training dynamics.
Using Natural Language Explanations to Improve Robustness of In-context Learning
Using Natural Language Explanations to Improve Robustness of In-context Learning Read Paper This work explores improving the robustness of LLMs against adversarial inputs by augmenting in-context learning (ICL) with natural language explanations (NLEs). Prompting models to generate NLEs from a small set of human-crafted examples yields better results than zero-shot ICL and using only human-generated NLEs. Evaluated across five LLMs, the approach delivers a 6% improvement on eight adversarial datasets. Additionally, while prompt selection strategies boost ICL on standard tests, they prove less effective for robustness, showing an 8% accuracy drop compared to this method.