Using Natural Language Explanations to Improve Robustness of In-context Learning

This work explores improving the robustness of LLMs against adversarial inputs by augmenting in-context learning (ICL) with natural language explanations (NLEs). Prompting models to generate NLEs from a small set of human-crafted examples yields better results than zero-shot ICL and using only human-generated NLEs. Evaluated across five LLMs, the approach delivers a 6% improvement on eight adversarial datasets. Additionally, while prompt selection strategies boost ICL on standard tests, they prove less effective for robustness, showing an 8% accuracy drop compared to this method.