Report Abuse

Basic Information

SeeAct is a research codebase and framework for building and running generalist web agents that autonomously carry out tasks on live websites using large multimodal models. It provides a Playwright-based tool that connects an LMM-driven agent to a real browser, translating model-predicted actions into browser events and piping browser observations back to the model. The repository bundles a Python package, example scripts and configuration files, demo and auto modes for running tasks, a crawler mode for exploration, and utilities for generating webpage screenshots and annotations. It also publishes a multimodal dataset aligned with web HTML and screenshots for training and evaluation. The project emphasizes safe human-in-the-loop monitoring and configurable grounding strategies for element selection. The code and data are released under Open RAIL licenses and the project includes an open-sourced Chrome extension and references for reproducing published experiments.

Links

Categorization

App Details

Features
A Playwright-based SeeAct tool that mediates between an LMM agent and a live browser to execute clicks, typing and selection events. A SeeActAgent Python API with configurable inputs such as model, grounding strategy, default task and website, temperature, crawler settings, and save directory. Demo, auto and crawler modes for interactive trials, batch evaluation and randomized web exploration. Support for multiple multimodal models including OpenAI GPT-4V/GPT-4-turbo/GPT-4o, Google Gemini and Ollama LLaVA. TOML configuration files for reproducible runs and experiment parameters. Utilities for screenshot generation and overlay annotation to produce multimodal data. Human monitoring and intervention options before action execution to mitigate safety risks. Published Multimodal-Mind2Web dataset aligned with HTML and screenshots for training and evaluation.
Use Cases
SeeAct helps researchers and developers evaluate and prototype web agents that leverage vision-enabled LLMs to perform tasks on real websites. It lowers integration overhead by providing a Playwright interface, ready-to-use agent loop code, configuration-driven experiments and packaged demo scripts to reproduce online evaluations. The included Multimodal-Mind2Web dataset and screenshot generation tools simplify multimodal training and inference by aligning HTML with rendered images. Built-in modes allow interactive debugging, batch benchmarking and crawler-driven data collection. Model compatibility and configurable grounding strategies let users test different LMMs and element-selection approaches. Safety and monitoring features support human oversight during development and experiments. The repository and open-sourced extension further enable replication of published results and extension to new web tasks.

Please fill the required fields*