GamingAgent
Basic Information
This repository provides tooling and benchmarks to enable and test LLM/VLM-based agents in standardized interactive video game environments. It is built to evaluate models in two main ways: single-model visual-language model (VLM) evaluation without a gaming harness, and agentic evaluation using a customized GamingAgent workflow (gaming harness) to improve gaming performance. The project includes a benchmark suite called lmgame-Bench, a leaderboard, an accompanying paper, and support for running agents locally as computer-use agents on PCs and laptops. It standardizes game interfaces using Gymnasium and Retro integrations and includes guidance for adding custom games and configuring environments. The repo is organized to run parallel evaluations, reproduce experiments via notebooks, and generate replay videos from logged episodes. It targets researchers and developers who want to measure and improve how large models perform on classical and puzzle-style video games.