PhysUniBenchmark

PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models

Lintao Wang^1,2,†, Encheng Su^1,†, Jiaqi Liu^1,†, Pengze Li^1,6, Peng Xia¹, Jiabei Xiao¹, Wenlong Zhang¹, Xinnan Dai⁵, Xi Chen⁶, Yuan Meng⁷, Mingyu Ding⁴, Lei Bai¹, Wanli Ouyang^1,3, Shixiang Tang^1,3, Aoran Wang^1,*, Xinzhu Ma^1,3,*

¹ Shanghai Artificial Intelligence Laboratory, ² The University of Sydney, ³ The Chinese University of Hong Kong, ⁴ University of North Carolina at Chapel Hill, ⁵ Michigan State University, ⁶ Fudan University, ⁷ Tsinghua University

^† Equal Contribution ^* Corresponding Author

Overview

PhysUniBench is a large-scale, multimodal benchmark specifically designed to evaluate the advanced reasoning capabilities of MLLMs on undergraduate-level physics problems. It aims to fill a critical gap in current benchmark ecosystems by offering a challenging, diverse, and diagnostic dataset that reflects the complexity and multimodal nature of real-world scientific problem solving.

Unlike prior benchmarks that focus on text-only math or physics tasks, PhysUniBench emphasizes multimodal scientific reasoning: all questions are paired with visual diagrams, requiring models to integrate textual and visual information to arrive at correct answers. This makes PhysUniBench uniquely suited to test the limits of current MLLMs in performing concept-rich, symbol-heavy, and context-dependent reasoning.

The benchmark comprises a total of 3,304 problems, divided into:

2,057 open-ended questions (QA format), requiring free-form answers that test the model's generation and justification capabilities.
1,247 multiple-choice questions (MCQ format), constructed by converting especially difficult QA items into single-choice questions with model-generated distractors.

PhysUniBench spans 8 major subfields of university physics, including:

(1) Electromagnetism and Electrodynamics;
(2) Classical Mechanics;
(3) Optics;
(4) Atomic, Molecular, and Subatomic Physics;
(5) Relativity;
(6) Solid-State Physics and Measurement;
(7) Thermodynamics;
(8) Quantum Mechanics.

Figure : Distribution of PhysUniBench

The problems in PhysUniBench are meticulously curated from resources aligned with undergraduate physics curricula. It encompasses a wide range of topics, covering eight major subfields to facilitate a broad evaluation of a model's physics knowledge and reasoning skills. A detailed breakdown of the problem distribution across these sub-disciplines is provided in Figure 1.

To ensure a meaningful and discriminative evaluation, all problems in PhysUniBench are annotated with a difficulty level from 1 to 5, calibrated based on the performance of a strong baseline MLLM (Qwen2.5-VL-72B) through a 16-sample roll-out protocol. Problems that were trivially solved by the model were filtered out to raise the difficulty floor.

Example Problems from PhysUniBench

Example 1: Open-ended electromagnetism question.

Example 1: Open-ended thermodynamics question.

Example 2: Multiple-choice question on atomic and subatomic physics.

Example 3: Multiple-choice question on quantum physics.

Dataset Statistics

Key Statistics of PhysUniBench

Total questions	3304
Multiple-choice questions	1247
Open-ended questions	2057
Unique number of images	3304
Difficulty level-1 questions	663
Difficulty level-2 questions	661
Difficulty level-3 questions	660
Difficulty level-4 questions	661
Difficulty level-5 questions	659
Average question tokens	150.7
Average option tokens	184.0
Average answer tokens	441.9

Subfield Distribution by Language and Topic

Experimental Results

Figure: Accuracy comparison across subfields for different models

Table 1: Accuracy of Different Models Across Subfields

Abbreviations: OP = Optics; AMS = Atomic, Molecular, and Subatomic Physics; ME = Mechanics; SP = Solid Physics and Measurement; TH = Thermodynamics and Statistical Physics; EM = Electromagnetism and Electrodynamics; RE = Relativity; QM = Quantum Mechanics.

Models	Overall	OP	AMS	ME	SP	TH	EM	RE	QM
Multi-choice Questions (MCQs)
GPT-4o	33.7	42.9	39.2	40.1	33.3	32.8	35.4	41.2	36.2
Claude-3.5-Sonnet	36.5	43.2	50.9	41.3	32.1	44.0	35.8	45.5	44.9
Qwen2.5-VL-72B	33.4	31.9	32.9	40.8	23.9	26.9	29.8	38.3	33.9
Gemini-2.5-Pro	26.5	27.8	29.7	26.6	25.5	25.0	24.7	24.3	35.9
GPT-o4-mini	36.7	57.9	55.3	42.1	31.2	24.7	41.2	30.9	35.9
InternVL-3-38B	33.6	41.3	41.9	37.6	21.6	26.6	29.2	12.1	32.1
Open-Ended Questions (OEQs)
GPT-4o	20.9	30.5	24.1	27.8	5.0	3.4	20.2	2.1	6.2
Claude-3.5-Sonnet	19.0	37.6	28.5	26.2	8.0	4.8	17.9	2.1	0.0
Qwen2.5-VL-72B	23.7	38.5	29.1	29.5	9.0	5.5	21.4	2.1	0.0
Gemini-2.5-Pro	25.5	49.3	31.0	35.2	6.0	4.1	23.7	2.1	0.0
GPT-o4-mini	26.5	51.2	31.0	38.2	10.0	6.2	28.4	2.1	0.0
InternVL-3-38B	17.7	27.4	17.9	21.0	5.5	3.4	19.3	2.1	0.0

BibTeX

@misc{wang2025physunibenchundergraduatelevelphysicsreasoning,
      title={PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models}, 
      author={Lintao Wang and Encheng Su and Jiaqi Liu and Pengze Li and Peng Xia and Jiabei Xiao and Wenlong Zhang and Xinnan Dai and Xi Chen and Yuan Meng and Mingyu Ding and Lei Bai and Wanli Ouyang and Shixiang Tang and Aoran Wang and Xinzhu Ma},
      year={2025},
      eprint={2506.17667},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.17667}, 
}