AgentBench

Report Abuse

Basic Information

AgentBench is a research benchmark and accompanying framework for evaluating large language models acting as autonomous agents across diverse simulated environments. It provides datasets, task servers, scripts and configuration to run multi-turn agent interactions over eight environments including operating system, database, knowledge graph, digital card game, lateral thinking puzzles, house-holding, web shopping and web browsing. The repo offers Dev and Test splits, a leaderboard of model scores, documentation for framework structure and configuration, and quick start instructions showing how to run tasks with an OpenAI model such as gpt-3.5-turbo-0613. It also links to VisualAgentBench, an extended suite for visual foundation agents and trajectory data for behavior cloning. The project is intended for reproducible evaluation, comparison and extension of LLM-as-agent research and includes citation details for academic use.

Links

Categorization

App Details

Features
AgentBench includes eight benchmark environments with two dataset splits (Dev and Test) and multi-turn interactions requiring roughly 4k and 13k model responses respectively. The suite bundles a modular framework architecture, configuration files for agents (for example configs/agents/openai-chat.yaml), scripts to validate agents (src.client.agent_test), to start task workers (src.start_task) and to assign tasks (src.assigner). It provides Docker images and build instructions for task environments and lists additional Docker images to download for remaining tasks. The repo ships task resource guidance, an extension guide for adding new tasks, a leaderboard of test results, and VisualAgentBench for evaluating 17 large multimodal models across embodied, GUI and visual design settings. It includes dataset release, reproducible setup steps and an academic citation.
Use Cases
This repository helps researchers and developers benchmark LLMs acting as agents by providing end-to-end tooling to run, reproduce and extend multi-turn agent evaluations. Users can launch preconfigured tasks with Docker, configure API-backed agents, run workers and assigners, and collect standardized scores for comparison on the included leaderboard. The provided datasets and Dev/Test splits enable controlled experiments and ablations. Extension guides and framework updates make it easier to add new tasks or plug in different models. VisualAgentBench support and trajectory data enable training and evaluation of visual foundation agents. The README also supplies resource and deployment notes such as KnowledgeGraph service setup and per-task memory/startup profiles to plan experiments.

Please fill the required fields*