TheAgentCompany

New

The repository provides an extensible benchmark for measuring how well large language model (LLM) agents perform consequential, real-world professional tasks. It is designed to evaluate agents that act like digital workers by browsing the web, writing code, running programs, and communicating with coworkers. The benchmark bundles tasks as Docker images with standardized layouts (utilities, instructions, and workspaces), includes pre-baked server services for realistic environments, and documents end-to-end procedures to deploy servers, run tasks, and grade agent behavior. The project is accompanied by a research paper and a public leaderboard and is intended for researchers and practitioners who need reproducible, scalable evaluation of multi-step agent capabilities across diverse workplace roles and data types.

Stars

513

App URL

https://github.com/TheAgentCompany/TheAgentCompany

Github Repository

https://github.com/TheAgentCompany/TheAgentCompany/blob/main/README.md

Features

The repo structures every task as a Docker image with a utilities folder (init and eval entrypoints) and an instruction file, enabling automated setup and grading. It supports 175 task images and includes example servers (GitLab, Plane, ownCloud, RocketChat) with pre-baked data to simulate workplace environments. The benchmark lists diverse human roles and data types, supports multiple-agent interaction, and provides a comprehensive scoring system including result-based evaluation and subcheckpoints. Multiple evaluation methods are available, including deterministic evaluators and LLM-based evaluators. Quick-start scripts automate server setup using Docker and Docker Compose. The project is extensible so users can add tasks, evaluators, and checkpoints and integrates with the OpenHands evaluation harness as an option.

Use Cases

This benchmark lets teams quantify and compare agent performance on realistic job tasks, helping researchers evaluate agent capabilities and helping industry assess readiness for automation. It provides reproducible infrastructure and tooling to run baseline experiments, collect trajectories, and run standardized grading with provided eval scripts. Integration options include the OpenHands harness or manual Docker workflows, and documentation covers server provisioning, task initialization, execution, and grading. The scoring and evaluator options enable both automated deterministic checks and judgment via LLM evaluators. The framework‚Äôs extensibility allows organizations to add new tasks and evaluators to mirror their own workflows, making it useful for benchmarking, development, and policy-oriented studies on the impact of agent automation.

TheAgentCompany

Basic Information

Links

App Details

Categories

Similar Listings

virtual lab

mnemo

EdgeChains

RAG Agents Accelerator

LLM Zero to Hundred

Featured Listings

Terry Bison Ranch

The Singapore Flyer

Tags

More Filters

TheAgentCompany

Categories

Similar Listings

virtual lab

mnemo

EdgeChains

RAG Agents Accelerator

LLM Zero to Hundred

Featured Listings

Terry Bison Ranch

The Singapore Flyer

Tags