ScreenAgent
Basic Information
ScreenAgent provides an environment and codebase to build, run, and train Visual Language Model (VLM) agents that observe computer screenshots and control desktop GUIs by issuing mouse and keyboard actions. The repository bundles a PyQt5 controller that connects to a VNC server, collects screenshots, sends formatted prompts to a VLM inferencer, parses returned action sequences, and executes them on the remote desktop. It implements an iterative planning-execution-reflection loop to handle multi-step tasks. The project also includes data and processing code for multiple datasets, model worker interfaces for several VLMs, training scripts to fine-tune a ScreenAgent model, and configuration and prompt templates. Setup instructions cover preparing a VNC-enabled desktop or Docker container, a clipboard service for long text input, selecting or running an inferencer, and launching the controller.