Report Abuse

Basic Information

ScreenAgent provides an environment and codebase to build, run, and train Visual Language Model (VLM) agents that observe computer screenshots and control desktop GUIs by issuing mouse and keyboard actions. The repository bundles a PyQt5 controller that connects to a VNC server, collects screenshots, sends formatted prompts to a VLM inferencer, parses returned action sequences, and executes them on the remote desktop. It implements an iterative planning-execution-reflection loop to handle multi-step tasks. The project also includes data and processing code for multiple datasets, model worker interfaces for several VLMs, training scripts to fine-tune a ScreenAgent model, and configuration and prompt templates. Setup instructions cover preparing a VNC-enabled desktop or Docker container, a clipboard service for long text input, selecting or running an inferencer, and launching the controller.

Links

App Details

Features
A complete controller implementing a planning-execution-reflection control loop for continuous GUI interaction. VNC-based action space design supporting basic mouse and keyboard operations with coordinate-level commands. Ready-to-run controller client code using PyQt5 with prompt templates and a configuration file. Model worker adapters for GPT-4V, LLaVA-1.5, CogAgent, and ScreenAgent, and examples showing how to add new inferencers. A ScreenAgent dataset plus support for COCO, Rico/widget-caption, and Mind2Web preprocessing for visual grounding and web browsing tasks. Training scripts, dataset mixture configuration, and utilities to finetune and merge model weights. Docker image recommendations and a clipboard server for reliable long-text keyboard input.
Use Cases
ScreenAgent helps researchers and developers prototype and evaluate VLM-driven desktop agents by providing an integrated stack from data to inference to execution. It removes boilerplate around VNC integration, screenshot collection, prompt construction, and action parsing so teams can focus on model behavior and task design. The included ScreenAgent dataset and dataset processing tools accelerate training for visual positioning and GUI interaction. Multi-model inferencer examples let users compare GPT-4V and open models or deploy custom workers. Training scripts and configuration files enable reproducible finetuning and weight merging. The controller UI and automation loop support running and debugging multi-step tasks, and the repository documents setup steps for desktop or container environments and clipboard handling to enable practical experiments.

Please fill the required fields*