EnterpriseClawBench Leaderboard

Main Leaderboard

Ranked by the primary attributable-failure-zero score used in the EnterpriseClawBench report.

Source: loading

Primary score combines text and visual judge routing with attributable failures scored as zero.

Harness Rank by

Rank	Model	Harness	Primary	Completed Avg	Completion	Visual N	Text N	Cost RMB	Cost / Completed	Median Time

Multiple views of score, cost, runtime, reliability, and judge routing.

Highest primary scores across configurations.

Each point is one model-harness configuration.

Each point is one model-harness configuration.

Aggregated metrics by execution harness.

Primary-score bins across all configurations.

Semantic score profile across the benchmark's five evaluation dimensions.

Combo

Enterprise task categories with class-level scores, winners, and hardest subclasses.

Subclasses with the lowest average score.

Tasks where model-harness configurations disagree the most.

Configurations with the highest primary score per RMB spent.

Completion, missing primary evidence, and warning-aware ranking.

How tasks are filtered into the benchmark and how model performance varies by deliverable type.

Raw enterprise TaskInstances through benchmark packaging.

Model performance grouped by the primary artifact type required by each task.

Representative public-safe examples of generated artifacts and scoring behavior.