MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Read Paper

We introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks to evaluate long-context vision-language models (LCVLMs) effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL.

All examples are delivered at five standardized input lengths (8K-128K tokens). Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of current models’ vision-language long-context ability. Our results show that both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement.

Share :