WindowsAgentArena

Report Abuse

Basic Information

Windows Agent Arena (WAA) is a reproducible, scalable platform for testing and benchmarking multi-modal, desktop AI agents in a realistic Windows OS environment. The repository provides the infrastructure, scripts, images and example agents needed to deploy, run and evaluate agentic workflows that interact with a Windows 11 virtual machine. It is intended for researchers and developers who want to measure agent performance across many GUI-driven tasks, compare screen-understanding pipelines, and run experiments locally or at scale on Azure ML. WAA includes automation to prepare a golden Windows VM image, Docker images to host the server components, configuration files for OpenAI or Azure OpenAI keys, and orchestration scripts to run baseline agents, customize agent parameters, and collect benchmark outputs.

Links

Categorization

App Details

Features
WAA bundles a base Docker image and build scripts, a reproducible Windows 11 golden image workflow, and run scripts for local and Azure deployment. Core artifacts include run-local.sh, build-container-image.sh, run_azure.py and experiments.json for parallel Azure ML runs, a config.json template for API keys, and utilities like show_results.py to parse results. The project ships example agents (including the Navi agent modes and support for Omniparser), multiple screen-observation backends (som-origin options and accessibility backends like uia and win32), options for GPU acceleration, and development tips for attaching debuggers and live-editing code in containers. It also documents cost/time estimates for large runs and provides task development and agent contribution guides.
Use Cases
The repository helps researchers reproduce and scale evaluations of GUI-capable AI agents by providing an end-to-end environment from VM preparation to parallel execution and result aggregation. Users can run full benchmarks locally to validate agents or accelerate experiments by launching many Azure ML workers to reduce wall-clock time. The platform standardizes task configs and experiment specifications so different agents and screen-understanding stacks can be compared fairly. It lowers engineering overhead by providing scripts to build images, upload VM storage, and collect logs, and by offering a BYOA template for plugging custom agents. Documentation and development tips simplify debugging, resource tuning, and iterating on agent designs.

Please fill the required fields*