Scientific General Intelligence (SGI) is defined as an AI system that can autonomously navigate the full, iterative cycle of scientific inquiry—Deliberation, Conception, Action, and Perception—with the versatility and proficiency of a human scientist. SGI-Bench operationalizes this definition via four scientist-aligned task families: deep research, idea generation, AI-assisted experiments (dry/wet), and multimodal experimental reasoning. The benchmark spans nine disciplines and ~1,000 expert-curated samples inspired by Science's 125 Big Questions.
Deep research with multi-hop retrieval and meta-analysis style quantitative synthesis.
Structured idea generation and multi-dimensional comparative evaluation.
AI-assisted experiments: dry (code/simulation) and wet (lab protocol).
Multimodal reasoning: process, observation, simulation, experiment, visualization images.
Grounded in the Practical Inquiry Model (PIM), SGI‑Bench views science as an iterative cycle that links deliberation, conception, action and perception at a high level. Under this lens, Scientific General Intelligence (SGI) denotes an AI’s capacity to traverse that cycle coherently and autonomously—integrating knowledge retrieval, idea formation, action execution, and interpretation into a unified loop of inquiry.
Expert-curated texts/images across nine domains from Science’s 125 Big Questions.
100+ graduate/PhD annotators; continuous expert review for scientific value.
Rule-based validation, model checks, expert QA for executability and unique answers.
Remove samples solvable by >50% strong LLMs to maintain high challenge.
SGI-Bench data is scientist-aligned and high-fidelity: an expert-sourced corpus spanning nine disciplines (inspired by Science’s 125 Big Questions), questions constructed by 100+ graduate/PhD annotators with continuous scientist-in-the-loop review, multi-stage cleaning (rules + model checks + expert QA) to ensure executability and unique answers, and difficulty filtering that removes items solved by >50% strong LLMs—yielding authentic, challenging, and broadly representative scientific tasks.
Question Selection → Metric Customization → Predict & Eval → Report Generation.
Web search, PDF parser, Python interpreter, file reader, metric functions.
EM/SLA; Implementation Similarity; PassAll@k/SER; MCA/RV.
Add scientist-aligned metrics (e.g., rigor, feasibility) on demand.
An agent-based evaluation stack coordinating specialized agents and tools to assess models end-to-end with task-specific and customizable metrics. By formalizing question selection, metric construction, scoring, and reporting into traceable stages, it strengthens reproducibility, mitigates evaluator–model coupling bias, and offers scientist‑aligned, actionable insights for model selection and iteration.
Handle no-ground-truth idea generation by optimizing novelty at test time with online retrieval as the moving baseline.
R = R_format + R_novelty. Enforce XML format (<think>, <answer>); reward embedding dissimilarity from retrieved works using a gating threshold.
GRPO on Qwen3‑8B (ms-swift); group sampling G=8, high temperature, bfloat16; online retrieval n=4.
Format reward saturates quickly; novelty steadily increases. Average novelty improves from 49.36 → 62.06 without labels.
TTRL converts open-ended scientific exploration into a measurable test-time optimization process and can be extended to multi-objective scientist-aligned rewards (rigor, feasibility, safety, cost). In practice, it improves idea novelty without labels by coupling strict output structure with retrieval-grounded rewards, and generalizes to multi‑objective optimization that balances creativity with rigor and feasibility—making scientific ideation auditable and adaptable at inference time.
Explore scientific multimodal inputs across process, observation, experiment, simulation, and visualization images. (Click images to expand)
































































Tables below reflect the exact results reported in SGI-Bench Evaluation.
| Model | Type | Deep Research | Idea Generation | Dry Experiment | Wet Experiment | Experimental Reasoning | SGI-Score |
|---|
Overview Results Across SGI-Bench Tasks: Aggregated performance across Deep Research, Idea Generation, Dry/Wet Experiment, and Scientific Experimental Reasoning. The scores for Deep Research are based on the exact match metric (the strictest metric). Idea Generation scores are the average of four metrics evaluating ideas. Dry Experiment scores are based on PassAll@5 (the strictest metric). Wet Experiment scores are the average of action sequence similarity and parameter accuracy. Experimental Reasoning scores are based on the multi-choice accuracy metric (the strictest metric). The SGI-Score is the average across these five tasks, reflecting the overall capability of an AI model in various scientific research scenarios. Note: The Experimental Reasoning metrics for Qwen3-Max and Qwen3-8B come from their multimodal versions.
| Model | Type | Exact Match Accuracy | Step-Level Accuracy |
|---|
Analysis: Models can often generate partially correct reasoning steps with moderate Step-Level Accuracy (SLA, up to ~50% for agents) but fail to maintain end-to-end coherence, leading to low Exact-Match accuracy (~10-20% best cases); performance is especially weak on dataset and property questions where evidence is numerically dispersed across sources, with category accuracy rarely exceeding 30%, revealing failures in cross-paper numeric aggregation and mechanism synthesis rather than retrieval.
| Model | Type | Effectiveness | Novelty | Detailedness | Feasibility | Average |
|---|
Analysis: Generated ideas are typically fluent and structurally well-formatted (instruction following >91, best 98.02) but lack implementability, exhibiting low feasibility (most 0-2, best only 3.81) and underspecified steps, missing parameters, data-flow, compute assumptions, and solver or evaluation choices, indicating a persistent gap between high-level hypothesis articulation and executable planning.
| Model | Type | PassAll@5 | PassAll@3 | PassAll@1 | AET (s) | SER |
|---|
Analysis: Most models generate syntactically valid, runnable code (SER >90%, best 98.85) but show low test-case correctness when evaluated strictly (PassAll@1 ~42.07% → PassAll@5 ~36.64% best), and collapse on numerical integration and simulation fidelity, exposing that code fluency does not imply scientific computational competence, with numerical and physics-aware algorithm selection being the dominant failure mode.
| Model | Type | Action Sequence Similarity | Parameter Accuracy | Average |
|---|
Analysis: LLMs can list reasonable lab actions but fail to organize them into correct experimental trajectories, showing uniformly low sequence similarity and parameter accuracy despite permutation equivalence checks, frequently adding irrelevant steps, omitting essential operations, or misordering branches, and struggling with temporal sampling, multi-sample pipeline alignment, and clear sample-purpose delineation.
| Model | Type | Multi-choice Accuracy | Reasoning Validity |
|---|
Analysis: Open-source models substantially lag behind closed ones in Multi-choice Accuracy (MCA), but most LLMs produce better Reasoning Validity than answer accuracy, achieving strong causal detection yet weak comparative reasoning, particularly on cross-image contrast, materials/life/Earth science domains, revealing persistent cognitive bottlenecks in subtle discrimination and numeric extraction over heterogeneous scientific visual evidence.
This work advances the study of Scientific General Intelligence (SGI) from both theory and practice. Grounded in the Practical Inquiry Model, we formalize SGI as the capacity to navigate the iterative cycle of Deliberation, Conception, Action, and Perception with the versatility of a human scientist. Building on this principle-grounded definition, we operationalize SGI through SGI-Bench—a comprehensive, scientist-aligned benchmark that instantiates four core task families: Scientific Deep Research, Idea Generation, AI-Assisted Scientific Experiment (dry/wet), and Scientific Experimental Reasoning. Complemented by our agentic evaluation framework and multi-metric protocol, SGI-Bench enables scalable, transparent, and domain-faithful assessment.
Experiments reveal a consistent pattern: in Deep Research, models show step-level alignment but low exact-match accuracy (10--20%), with brittleness in quantitative reasoning; in Idea Generation, hypotheses are fluent but underspecified and infeasible; in Dry Experiment, code is executable but PassAll@k remains low; in Wet Experiment, sequences show omissions and misordering; and in Experimental Reasoning, causal reasoning outperforms comparative, with persistent multimodal challenges. These highlight gaps between linguistic fluency and integrated scientific cognition. Moreover, SGI exhibits dynamic capacity: Test-Time Reinforcement Learning with novelty rewards improves idea generation without reference answers.
Taken together, SGI-Bench clarifies both what SGI is and where current systems fail.By integrating principled task design, multi-metric evaluation, and agentic tool use, our framework provides a concrete foundation for systematically advancing SGI. Looking forward, the combination of numerically robust reasoning, planning-aware conception, executable experimentation, comparative multimodal inference, dynamic test-time learning, and efficient tool ecosystems charts a clear path toward AI systems capable of genuine scientific discovery—bridging isolated competencies into fully integrated scientific intelligence.
@article{sgi2025,
title={SGI-Bench: Scientific Intelligence Benchmark via Scientist-Aligned Workflows},
author={Research Team},
journal={arXiv preprint arXiv:2401.xxxxx},
year={2025}
}