Report Abuse

Basic Information

Mind2Web provides the dataset, code, and pretrained models used to develop and evaluate generalist web agents that follow language instructions to complete complex tasks on real websites. It is a research-focused repository that publishes a large, diverse benchmark of human-annotated tasks and action traces collected from real webpages, along with raw Playwright traces, HAR network captures, video recordings, DOM snapshots, and screenshots to support replay and analysis. The repo includes candidate generation and action prediction code, evaluation and fine-tuning scripts, example prompts for LLM-based action selection, and pointers to trained models on Huggingface. It is intended for researchers and developers who want to train, fine-tune, or benchmark models that perform web interactions, experiment with LLM or seq2seq approaches for element selection and action prediction, and reproduce results from the paper.

Links

App Details

Features
The repository contains a multi-part dataset with over 2,000 open-ended tasks collected from 137 websites across 31 domains and organized into train and three test splits (cross-task, cross-website, cross-domain). It provides detailed data fields per task including raw and cleaned HTML, action sequences, positive and negative element candidates, and unique identifiers for actions and elements. A raw dump includes Playwright trace files, network HARs, per-page videos, DOM snapshot files, and mhtml snapshots. The codebase offers candidate generation implemented with a DeBERTa cross-encoder, action prediction implemented with T5-based seq2seq models, hydra configuration files for training and evaluation, example LLM prompts, evaluation scripts that output predictions and metrics, and an inspector notebook for exploring traces. Licensing: dataset CC BY 4.0, code MIT.
Use Cases
Mind2Web is useful for building and assessing web automation and agent systems because it exposes realistic, diverse interaction patterns on real websites rather than simulations, enabling evaluation of generalization across tasks, unseen websites, and unseen domains. Researchers can use the provided candidate generation and action prediction modules and pretrained checkpoints to reproduce baseline results, fine-tune models on the training split, and evaluate performance on held-out test splits. The raw traces, HAR files, DOM snapshots, and videos support replay, debugging, and multimodal experiments. The repo also includes example prompts and an LLM evaluation pathway to compare traditional models with instruction-following LLMs. Overall it lowers the effort to benchmark and iterate on models that need to locate page elements and execute multi-step web tasks in realistic settings.

Please fill the required fields*