Leaderboard
g: SOTA-normalized relative gap, g = dir * (m - m_sota) / |m_sota|; dir handles metric direction.
Surpass-SOTA: share of judge-valid tasks with g > 0.1.
Match-SOTA: share of judge-valid tasks with g >= 0; g = 0 matches paper SOTA.
CR: Completion Rate, the valid-score rate.
SR: Score Rate, the any-score rate.
Main Leaderboard
Numeric ranking of agent-model configurations by Surpass-SOTA.
| Rank | Model | Harness | Surpass-SOTA | Match-SOTA | Median g (all) | CR | SR | Invalid |
|---|
Case Studies
Representative cases distilled from full agent traces
Score Distribution
Task-level g bins per configuration
Domain Breakdown
Domain labels follow the six-domain taxonomy used in the NatureBench paper. Domain rankings use the same Surpass-SOTA rule as the overall leaderboard.
Domain Ranking
Domain Winners
Best configuration within each scientific domain
| Domain | N | Winner | Surpass-SOTA | Match-SOTA | Median g (all) |
|---|
Case-Level View
Each cell shows task-level normalized relative gap g. Values above 0.1 count as Surpass-SOTA; nonnegative values match the paper's reported SOTA.
g > 0.1
Match-SOTA, 0 <= g <= 0.1
Below SOTA, g < 0
Invalid
No score/submission
Outlined cell: best configuration for that task