Report Abuse

Basic Information

The repository provides an extensible benchmark for measuring how well large language model (LLM) agents perform consequential, real-world professional tasks. It is designed to evaluate agents that act like digital workers by browsing the web, writing code, running programs, and communicating with coworkers. The benchmark bundles tasks as Docker images with standardized layouts (utilities, instructions, and workspaces), includes pre-baked server services for realistic environments, and documents end-to-end procedures to deploy servers, run tasks, and grade agent behavior. The project is accompanied by a research paper and a public leaderboard and is intended for researchers and practitioners who need reproducible, scalable evaluation of multi-step agent capabilities across diverse workplace roles and data types.

Links

App Details

Features
The repo structures every task as a Docker image with a utilities folder (init and eval entrypoints) and an instruction file, enabling automated setup and grading. It supports 175 task images and includes example servers (GitLab, Plane, ownCloud, RocketChat) with pre-baked data to simulate workplace environments. The benchmark lists diverse human roles and data types, supports multiple-agent interaction, and provides a comprehensive scoring system including result-based evaluation and subcheckpoints. Multiple evaluation methods are available, including deterministic evaluators and LLM-based evaluators. Quick-start scripts automate server setup using Docker and Docker Compose. The project is extensible so users can add tasks, evaluators, and checkpoints and integrates with the OpenHands evaluation harness as an option.
Use Cases
This benchmark lets teams quantify and compare agent performance on realistic job tasks, helping researchers evaluate agent capabilities and helping industry assess readiness for automation. It provides reproducible infrastructure and tooling to run baseline experiments, collect trajectories, and run standardized grading with provided eval scripts. Integration options include the OpenHands harness or manual Docker workflows, and documentation covers server provisioning, task initialization, execution, and grading. The scoring and evaluator options enable both automated deterministic checks and judgment via LLM evaluators. The framework’s extensibility allows organizations to add new tasks and evaluators to mirror their own workflows, making it useful for benchmarking, development, and policy-oriented studies on the impact of agent automation.

Please fill the required fields*