NatureBench Leaderboard

Leaderboard

g: SOTA-normalized relative gap, g = dir * (m - m_sota) / |m_sota|; dir handles metric direction. Surpass-SOTA: share of judge-valid tasks with g > 0.1. Match-SOTA: share of judge-valid tasks with g >= 0; g = 0 matches paper SOTA. CR: Completion Rate, the valid-score rate. SR: Score Rate, the any-score rate.

Tasks

Agent-model configurations

Scientific domains

17.8%

Best Surpass-SOTA

47.8%

Best Match-SOTA

100.0%

Top completion

Main Leaderboard

Numeric ranking of agent-model configurations by Surpass-SOTA.

Rank	Model	Harness	Surpass-SOTA	Match-SOTA	Median g (all)	CR	SR	Invalid

Case Studies

Representative cases distilled from full agent traces

3 cases

Score Distribution

Task-level g bins per configuration

Domain Breakdown

Domain labels follow the six-domain taxonomy used in the NatureBench paper. Domain rankings use the same Surpass-SOTA rule as the overall leaderboard.

Domain Ranking

Domain

Domain Winners

Best configuration within each scientific domain

Domain	N	Winner	Surpass-SOTA	Match-SOTA	Median g (all)

Case-Level View

Each cell shows task-level normalized relative gap g. Values above 0.1 count as Surpass-SOTA; nonnegative values match the paper's reported SOTA.

Surpass-SOTA, g > 0.1 Match-SOTA, 0 <= g <= 0.1 Below SOTA, g < 0 Invalid No score/submission Outlined cell: best configuration for that task

Per-Case Score Matrix

Domain