SeeAct
Basic Information
SeeAct is a research codebase and framework for building and running generalist web agents that autonomously carry out tasks on live websites using large multimodal models. It provides a Playwright-based tool that connects an LMM-driven agent to a real browser, translating model-predicted actions into browser events and piping browser observations back to the model. The repository bundles a Python package, example scripts and configuration files, demo and auto modes for running tasks, a crawler mode for exploration, and utilities for generating webpage screenshots and annotations. It also publishes a multimodal dataset aligned with web HTML and screenshots for training and evaluation. The project emphasizes safe human-in-the-loop monitoring and configurable grounding strategies for element selection. The code and data are released under Open RAIL licenses and the project includes an open-sourced Chrome extension and references for reproducing published experiments.