TheAgentCompany
Basic Information
The repository provides an extensible benchmark for measuring how well large language model (LLM) agents perform consequential, real-world professional tasks. It is designed to evaluate agents that act like digital workers by browsing the web, writing code, running programs, and communicating with coworkers. The benchmark bundles tasks as Docker images with standardized layouts (utilities, instructions, and workspaces), includes pre-baked server services for realistic environments, and documents end-to-end procedures to deploy servers, run tasks, and grade agent behavior. The project is accompanied by a research paper and a public leaderboard and is intended for researchers and practitioners who need reproducible, scalable evaluation of multi-step agent capabilities across diverse workplace roles and data types.