Report Abuse

Basic Information

GPT-4V-Act is a multimodal Chromium copilot that combines GPT-4V(ision) with a web browser to mirror human screen-based input and low-level mouse/keyboard outputs. The project is intended to bridge human-computer interactions by interpreting screenshots and task prompts to determine precise UI actions, enabling accessibility enhancements, workflow automation, and automated UI testing. It uses a JS DOM auto-labeler to assign numeric IDs to interactable elements so the model can refer to exact pixel coordinates. The design emphasizes producing structured action outputs (click, type, scroll, request-info, remember-info, done) so downstream systems can execute UI gestures. The repository includes a demo UI and instructions to run locally with npm so developers can experiment with the agent behavior and extend capabilities.

Links

App Details

Features
The README documents several core features and current limitations. Feature highlights include a JS DOM auto-labeler with COCO export, integration of GPT-4V(ision) and Set-of-Mark Prompting, and support for clicking and typing character sequences. The project provides a defined JSON response schema for agent actions and example demonstration prompts showing how the assistant returns a brief explanation and a nextAction object. Partial capabilities noted are limited vision features, basic typing for letters and numbers, and COCO export. Missing or incomplete features explicitly called out are an AI auto-labeler, special keycode typing, scrolling, user prompting, and memory for task-relevant information. The README also includes quick start steps for cloning, npm install, and npm start to run the demo.
Use Cases
This repository is helpful for developers and researchers who want a practical starting point for building UI-aware agents that operate a browser like a human operator. It shows how to pair visual input (screenshots) with instruction prompts to produce structured UI actions, which can accelerate creation of accessibility tools, automated UI workflows, and end-to-end UI test scripts. The numeric auto-labeling approach enables precise targeting of elements for clicks and typing. The provided demo, example prompts, and response schema serve as a template to prototype agent behavior and integration into automation pipelines. The README also lists current gaps so contributors can prioritize implementing AI labeling, scrolling, keycodes, and memory for richer interactions.

Please fill the required fields*