Steering Knowledge Selection Behaviours in LLMs


Large language models (LLMs) often face conflicts between stored knowledge and contextual information, which can lead to outdated or incorrect responses. Analyzing LLMs’ internal activations, we find that mid-layer signals can detect these context-memory knowledge conflicts. To address this, we introduce SpARE, a training-free representation engineering approach using pre-trained sparse auto-encoders (SAEs) to steer knowledge selection during inference. By editing specific internal activations, SpARE effectively manages knowledge conflicts, improving accuracy in open-domain question-answering tasks by 10% over existing methods and 15% over contrastive decoding.