Scientific ML Agent Benchmark

NatureBench

Can coding agents match the published SOTA of Nature-family papers?

We evaluate 10 coding-agent configurations on 90 Nature-family scientific ML tasks in isolated containers, with web search disabled and a 4-hour budget per task.

Leaderboard

g: SOTA-normalized relative gap, g = dir * (m - m_sota) / |m_sota|; dir handles metric direction. Surpass-SOTA: share of judge-valid tasks with g > 0.1. Match-SOTA: share of judge-valid tasks with g >= 0; g = 0 matches paper SOTA. CR: Completion Rate, the valid-score rate. SR: Score Rate, the any-score rate.

90
Tasks
10
Agent-model configurations
6
Scientific domains
17.8%
Best Surpass-SOTA
47.8%
Best Match-SOTA
100.0%
Top completion

Main Leaderboard

Numeric ranking of agent-model configurations by Surpass-SOTA.

Rank Model Harness Surpass-SOTA Match-SOTA Median g (all) CR SR Invalid

Score Distribution

Task-level g bins per configuration

Domain Breakdown

Domain labels follow the six-domain taxonomy used in the NatureBench paper. Domain rankings use the same Surpass-SOTA rule as the overall leaderboard.

Domain Ranking

Domain Winners

Best configuration within each scientific domain

Domain N Winner Surpass-SOTA Match-SOTA Median g (all)

Case-Level View

Each cell shows task-level normalized relative gap g. Values above 0.1 count as Surpass-SOTA; nonnegative values match the paper's reported SOTA.

Surpass-SOTA, g > 0.1 Match-SOTA, 0 <= g <= 0.1 Below SOTA, g < 0 Invalid No score/submission Outlined cell: best configuration for that task

Per-Case Score Matrix