Report Abuse

Basic Information

AppAgent is an open-source research framework that turns large multimodal language models into agents that can operate smartphone applications. It provides a pipeline to control Android apps via a simplified human-like action space such as taps and swipes, without requiring back-end access to target apps. The repo implements a two-phase method: an exploration phase where the agent autonomously explores or learns from human demonstrations to build a documentation base of UI elements, and a deployment phase where the agent uses that documentation to complete user-specified tasks. The project includes Python scripts for learning and running agents, configuration via a YAML file for model choice and request settings, and support for real devices or Android emulators connected through adb. The codebase and benchmark were released alongside a CHI paper and are provided under an MIT license.

Links

Categorization

App Details

Features
Multi-modal LLM integration with support examples for gpt-4-vision-preview and an alternative qwen-vl-max model. Two operating modes for exploration: fully autonomous exploration and learning from human demonstrations that label interactive UI elements. Automated documentation generation that records elements and interactions during exploration for later reuse in deployment. A simplified action interface focused on taps and swipes and an optional grid overlay to target unlabeled UI elements. Device connectivity through Android Debug Bridge allowing use with real phones or Android Studio emulators. Configurable model and request parameters via config.yaml and a model plugin point in scripts/model.py. Included demo videos, an evaluation benchmark, and install instructions via pip requirements.
Use Cases
AppAgent helps researchers and developers prototype and evaluate LLM-based GUI agents that can interact with Android applications without modifying the apps. By generating a reusable documentation base during exploration, the agent can perform complex, multi-step tasks in deployment with fewer manual rules. The human-demonstration option enables quick bootstrapping of app knowledge when autonomous exploration is impractical. Configurability of models and request intervals lets users balance performance and API cost, and support for emulators lowers the barrier to experimentation. The repository also provides benchmark data and example workflows to reproduce experiments from the accompanying CHI paper, making it useful for reproducible research and further development of smartphone-operating agents.

Please fill the required fields*