PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models

1 Shanghai Artificial Intelligence Laboratory,  2 The University of Sydney,  3 The Chinese University of Hong Kong,  4 University of North Carolina at Chapel Hill,  5 Michigan State University,  6 Fudan University,  7 Tsinghua University

Equal Contribution    * Corresponding Author

Benchmark Overview

PhysUniBench includes diverse multi-modal physics question.

Overview

PhysUniBench is a large-scale, multimodal benchmark specifically designed to evaluate the advanced reasoning capabilities of MLLMs on undergraduate-level physics problems. It aims to fill a critical gap in current benchmark ecosystems by offering a challenging, diverse, and diagnostic dataset that reflects the complexity and multimodal nature of real-world scientific problem solving.

Unlike prior benchmarks that focus on text-only math or physics tasks, PhysUniBench emphasizes multimodal scientific reasoning: all questions are paired with visual diagrams, requiring models to integrate textual and visual information to arrive at correct answers. This makes PhysUniBench uniquely suited to test the limits of current MLLMs in performing concept-rich, symbol-heavy, and context-dependent reasoning.

The benchmark comprises a total of 3,304 problems, divided into:

  • 2,057 open-ended questions (QA format), requiring free-form answers that test the model's generation and justification capabilities.
  • 1,247 multiple-choice questions (MCQ format), constructed by converting especially difficult QA items into single-choice questions with model-generated distractors.

PhysUniBench spans 8 major subfields of university physics, including:

(1) Electromagnetism and Electrodynamics;
(2) Classical Mechanics;
(3) Optics;
(4) Atomic, Molecular, and Subatomic Physics;
(5) Relativity;
(6) Solid-State Physics and Measurement;
(7) Thermodynamics;
(8) Quantum Mechanics.

Subfield Distribution

Figure : Distribution of PhysUniBench

The problems in PhysUniBench are meticulously curated from resources aligned with undergraduate physics curricula. It encompasses a wide range of topics, covering eight major subfields to facilitate a broad evaluation of a model's physics knowledge and reasoning skills. A detailed breakdown of the problem distribution across these sub-disciplines is provided in Figure 1.

To ensure a meaningful and discriminative evaluation, all problems in PhysUniBench are annotated with a difficulty level from 1 to 5, calibrated based on the performance of a strong baseline MLLM (Qwen2.5-VL-72B) through a 16-sample roll-out protocol. Problems that were trivially solved by the model were filtered out to raise the difficulty floor.

Example Problems from PhysUniBench

Dataset Statistics

Key Statistics of PhysUniBench

Total questions3304
 Multiple-choice questions1247
 Open-ended questions2057
Unique number of images3304
Difficulty level-1 questions663
Difficulty level-2 questions661
Difficulty level-3 questions660
Difficulty level-4 questions661
Difficulty level-5 questions659
Average question tokens150.7
Average option tokens184.0
Average answer tokens441.9

Subfield Distribution by Language and Topic

Experimental Results

Radar Chart of Model Accuracy

Figure: Accuracy comparison across subfields for different models

Table 1: Accuracy of Different Models Across Subfields

Abbreviations: OP = Optics; AMS = Atomic, Molecular, and Subatomic Physics; ME = Mechanics; SP = Solid Physics and Measurement; TH = Thermodynamics and Statistical Physics; EM = Electromagnetism and Electrodynamics; RE = Relativity; QM = Quantum Mechanics.

Models Overall OP AMS ME SP TH EM RE QM
Multi-choice Questions (MCQs)
GPT-4o33.742.939.240.1 33.3 35.4
Claude-3.5-Sonnet 44.0 45.5 44.9
Qwen2.5-VL-72B33.431.932.940.823.926.929.838.333.9
Gemini-2.5-Pro26.527.829.726.625.525.024.724.335.9
GPT-o4-mini36.7 57.9 55.3 42.1 31.224.7 41.2 30.935.9
InternVL-3-38B33.641.341.937.621.626.629.212.132.1
Open-Ended Questions (OEQs)
GPT-4o20.930.524.127.85.03.420.22.1 6.2
Claude-3.5-Sonnet19.037.628.526.28.04.817.92.10.0
Qwen2.5-VL-72B23.738.529.129.5 21.42.10.0
Gemini-2.5-Pro 6.04.1 0.0
GPT-o4-mini26.5 51.2 31.0 38.2 10.0 6.2 28.4 2.10.0
InternVL-3-38B17.727.417.921.05.53.419.32.10.0

BibTeX

@misc{wang2025physunibenchundergraduatelevelphysicsreasoning,
      title={PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models}, 
      author={Lintao Wang and Encheng Su and Jiaqi Liu and Pengze Li and Peng Xia and Jiabei Xiao and Wenlong Zhang and Xinnan Dai and Xi Chen and Yuan Meng and Mingyu Ding and Lei Bai and Wanli Ouyang and Shixiang Tang and Aoran Wang and Xinzhu Ma},
      year={2025},
      eprint={2506.17667},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.17667}, 
}