Report Abuse

Basic Information

Bananalyzer is an open source evaluation framework and dataset for testing AI agents on web tasks using Playwright. It provides a CLI that runs structured evaluation examples defined in a examples.json file and serves historic static snapshots of pages as MHTML so tests remain reproducible despite site changes, latency, or anti-bot protections. Users plug in their agent by implementing an AgentRunner interface and exposing an agent instance, and the tool dynamically constructs pytest test suites to execute examples. The repo includes utilities and a notebook to capture pages, a basic FastAPI server to expose example data and API docs, a schema inspired by existing web datasets, and a roadmap to add multi-step interactions and translate other web evaluation datasets.

Links

App Details

Features
Key features include a Playwright-based mechanism to serve local MHTML snapshots for reproducible page state, an examples.json dataset schema with ExampleType enums for listing, detail and listing_detail tests, and a CLI wrapper that converts examples into pytest tests. The project defines an AgentRunner interface for user agents, supports filtering tests by intent, category, and id, and provides options such as headless mode and worker count. It also ships a FastAPI server to browse example data, a notebook used to capture examples, utilities for MHTML line ending conversion, and a documented roadmap for pagination, multi-step navigation, and dataset expansion.
Use Cases
Bananalyzer helps developers and researchers evaluate web-capable agents by providing reproducible, versioned web test scenarios and a standardized way to measure agent output against expected structured JSON. By using static MHTML snapshots and a common example schema, the project reduces flakiness from live site changes and enables consistent comparisons across agents and over time. The AgentRunner contract and pytest integration make it straightforward to plug in custom agents and run selective test suites via the CLI. The FastAPI server and documented examples streamline inspection of dataset contents, and planned features aim to broaden coverage to navigation, clicks, sign-in flows and integration of other web evaluation datasets.

Please fill the required fields*