Report Abuse

Basic Information

This repository provides tooling and benchmarks to enable and test LLM/VLM-based agents in standardized interactive video game environments. It is built to evaluate models in two main ways: single-model visual-language model (VLM) evaluation without a gaming harness, and agentic evaluation using a customized GamingAgent workflow (gaming harness) to improve gaming performance. The project includes a benchmark suite called lmgame-Bench, a leaderboard, an accompanying paper, and support for running agents locally as computer-use agents on PCs and laptops. It standardizes game interfaces using Gymnasium and Retro integrations and includes guidance for adding custom games and configuring environments. The repo is organized to run parallel evaluations, reproduce experiments via notebooks, and generate replay videos from logged episodes. It targets researchers and developers who want to measure and improve how large models perform on classical and puzzle-style video games.

Links

App Details

Features
Lmgame-Bench benchmark and evaluation scripts for launching parallel runs and agentic versus vanilla evaluations. Support for Gymnasium and Retro environments with out-of-the-box games such as Sokoban, Tetris, 2048, Candy Crush, Super Mario Bros and Ace Attorney, plus instructions for integrating ROMs and using pyboy for Game Boy emulation. Command-line tooling includes run.py and single_agent_runner.py and configurable harness_mode and parallelism options. Model API compatibility list for OpenAI, Anthropic, Gemini, xAI, Deepseek and Qwen. Configuration-driven agent settings in gamingagent/configs and modular prompts in module_prompts.json. Utilities for game replay video generation from episode logs and an evaluation notebook and colab integration for analysis and leaderboard comparisons. A computer_use directory provides instructions for running agents locally.
Use Cases
The repository standardizes evaluation of multimodal LLM/VLM agents on a diverse set of games, enabling reproducible comparisons across models and configurations. It lets users run large-scale, parallelized experiments and compare single-model and agentic harness performance to quantify improvements from agent workflows. Config-driven environment and prompt files simplify adapting agents to new games and make it straightforward to add custom environments following Gymnasium or Stable Retro conventions. Built-in support for replay video generation and evaluation notebooks aids qualitative and quantitative analysis. The included model compatibility list and scripts reduce engineering overhead for testing popular commercial and research models, while the computer-use agent instructions allow live, local deployment for experimentation. The repo also links to a paper and leaderboard to contextualize results.

Please fill the required fields*