Report Abuse

Basic Information

OSWorld is an open research environment and benchmark for evaluating multimodal agents on open-ended tasks that interact with real computer environments. It provides a programmatic DesktopEnv for running agent policies against virtual machines or containerized desktops and includes interfaces and baseline agents used in the paper. The repository bundles task definitions, evaluation examples, scripts to run single and parallel experiments, and guidance for local and public verification of results. It targets researchers and developers who want to test agent behavior on GUI and web tasks inside reproducible VM or Docker-based environments, compare performance against published baselines, and submit results for the OSWorld-Verified leaderboard. The project also supplies documentation, a citation for academic use, and downloadable init-state files to accelerate setup.

Links

Categorization

App Details

Features
Provides a DesktopEnv API and agent/environment interfaces with example usage and baseline agents. Supports multiple providers: VMware (including Fusion on Apple chips), VirtualBox, Docker with KVM support, and AWS for large-scale parallel evaluation. Includes runnable scripts (run.py, run_multienv.py, show_result.py) and experiment tooling to collect screenshots, actions, and video recordings into a results directory. Offers installation instructions, requirements, and a convenience package (desktop-env). Contains evaluation guidelines, public verification procedures, account and proxy configuration guidance, a data viewer and evaluation_examples dataset, and an OSWorld-Verified workflow for verified leaderboard submissions.
Use Cases
OSWorld lets researchers and practitioners reproduce and benchmark agent behaviour in realistic desktop-like settings so multimodal models can be tested on real interaction tasks. It helps scale evaluations via Docker and AWS parallelization to reduce experiment time and supports reproducible local runs on VMware or VirtualBox. The collected logs, screenshots and recordings assist debugging and result analysis. Guidance for Google OAuth and proxy configuration enables tasks that need web account access. The framework makes it easier to run baseline comparisons, validate results for a verified leaderboard, and cite a consistent academic benchmark when publishing agent performance.

Please fill the required fields*