Enterprise Agent Benchmark

EnterpriseClawBench

A release-oriented leaderboard for evaluating coding agents on realistic enterprise workflows, file deliverables, and multimodal business artifacts.

Primary board hard120
Top score -
Best config -

Main Leaderboard

Ranked by the primary attributable-failure-zero score used in the EnterpriseClawBench report.

Source: loading

Agent-model configurations

Primary score combines text and visual judge routing with attributable failures scored as zero.

Rank Model Harness Primary Completed Avg Completion Visual N Text N Cost RMB Cost / Completed Median Time

Performance Cockpit

Multiple views of score, cost, runtime, reliability, and judge routing.

Top Scores

Highest primary scores across configurations.

Score vs Cost

Each point is one model-harness configuration.

Task Class Breakdown

Enterprise task categories with class-level scores, winners, and hardest subclasses.

Benchmark Diagnostics

How tasks are filtered into the benchmark and how model performance varies by deliverable type.

Construction Funnel

Raw enterprise TaskInstances through benchmark packaging.

Artifact Insights

Model performance grouped by the primary artifact type required by each task.

Case Studies

Representative public-safe examples of generated artifacts and scoring behavior.