AgentBench
Basic Information
AgentBench is a research benchmark and accompanying framework for evaluating large language models acting as autonomous agents across diverse simulated environments. It provides datasets, task servers, scripts and configuration to run multi-turn agent interactions over eight environments including operating system, database, knowledge graph, digital card game, lateral thinking puzzles, house-holding, web shopping and web browsing. The repo offers Dev and Test splits, a leaderboard of model scores, documentation for framework structure and configuration, and quick start instructions showing how to run tasks with an OpenAI model such as gpt-3.5-turbo-0613. It also links to VisualAgentBench, an extended suite for visual foundation agents and trajectory data for behavior cloning. The project is intended for reproducible evaluation, comparison and extension of LLM-as-agent research and includes citation details for academic use.