Think Miniml / Research

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

We introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks to evaluate long-context vision-language mo...

Full paper · available on arxiv.org

Read paper

We introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks to evaluate long-context vision-language models (LCVLMs) effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL.

All examples are delivered at five standardized input lengths (8K-128K tokens). Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of current models’ vision-language long-context ability. Our results show that both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement.

Start the conversation

Talk to a senior consultant.

30 minutes. Bring a problem you’re stuck on — we’ll tell you what we’d do next.

Book a consultation