PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
Generating accurate and concise textual summaries from multimodal documents is challenging, especially when dealing with visually complex content like…
Inverse Scaling in Test-Time Compute
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse sc…
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs
Understanding time from visual representations is a fundamental cognitive skill, yet it remains a challenge for multimodal large language models (MLLM…
SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
We present SynDARin, a methodology for synthesizing high-quality reasoning datasets in low-resource languages. Our approach combines template-based ge…
Adaptive Computation Modules: Granular Conditional Computation for Efficient Inference
We introduce Adaptive Computation Modules (ACMs) that enable fine-grained conditional computation for more efficient neural network inference. Our app…
Self-Training Large Language Models for Tool-Use Without Demonstrations
We propose a self-training approach that enables large language models to learn tool use without requiring human demonstrations. Our method uses self-…
When Can Proxies Improve the Sample Complexity of Preference Learning?
We provide theoretical and empirical analysis of when proxy rewards can improve sample efficiency in preference learning. Our work establishes conditi…
Is Complex Query Answering Really Complex?
We challenge conventional wisdom about complex query answering by demonstrating that many supposedly complex queries can be solved with surprisingly s…
An Auditing Test to Detect Behavioral Shift in Language Models
We develop a comprehensive auditing framework to detect behavioral shifts in language models across different contexts and time periods. Our approach …