GPT 4V Act
Basic Information
GPT-4V-Act is a multimodal Chromium copilot that combines GPT-4V(ision) with a web browser to mirror human screen-based input and low-level mouse/keyboard outputs. The project is intended to bridge human-computer interactions by interpreting screenshots and task prompts to determine precise UI actions, enabling accessibility enhancements, workflow automation, and automated UI testing. It uses a JS DOM auto-labeler to assign numeric IDs to interactable elements so the model can refer to exact pixel coordinates. The design emphasizes producing structured action outputs (click, type, scroll, request-info, remember-info, done) so downstream systems can execute UI gestures. The repository includes a demo UI and instructions to run locally with npm so developers can experiment with the agent behavior and extend capabilities.