Main Leaderboard
Ranked by the primary attributable-failure-zero score used in the EnterpriseClawBench report.
Agent-model configurations
Primary score combines text and visual judge routing with attributable failures scored as zero.
| Rank | Model | Harness | Primary | Completed Avg | Completion | Visual N | Text N | Cost RMB | Cost / Completed | Median Time |
|---|
Performance Cockpit
Multiple views of score, cost, runtime, reliability, and judge routing.
Top Scores
Highest primary scores across configurations.
Score vs Cost
Each point is one model-harness configuration.
Score vs Runtime
Each point is one model-harness configuration.
Harness Heatmap
Aggregated metrics by execution harness.
Score Distribution
Primary-score bins across all configurations.
Five-Dimension Profile
Semantic score profile across the benchmark's five evaluation dimensions.
Task Class Breakdown
Enterprise task categories with class-level scores, winners, and hardest subclasses.
Hardest Subclasses
Subclasses with the lowest average score.
Highest Variance Tasks
Tasks where model-harness configurations disagree the most.
Efficiency Board
Configurations with the highest primary score per RMB spent.
Reliability Board
Completion, missing primary evidence, and warning-aware ranking.
Benchmark Diagnostics
How tasks are filtered into the benchmark and how model performance varies by deliverable type.
Construction Funnel
Raw enterprise TaskInstances through benchmark packaging.
Artifact Insights
Model performance grouped by the primary artifact type required by each task.
Case Studies
Representative public-safe examples of generated artifacts and scoring behavior.