Report Abuse

Basic Information

Tarsier is a developer-focused Python library that provides visual perception utilities for web interaction agents. It is designed to convert web pages and screenshots into structured, LLM-friendly representations and to tag interactable page elements with stable IDs so an LLM can reference and act on them. The project addresses common problems when using LLMs to automate browser tasks, including how to represent page structure, how to map natural-language actions back to DOM elements, and how to convey visual layout to text-only models. The README includes usage examples that integrate with Playwright and shows how to obtain a text representation and a mapping from tags to xpaths. The project is distributed on PyPI and intended for use inside agent stacks such as LangChain and LlamaIndex.

Links

Categorization

App Details

Features
Tarsier tags visible interactable elements on a page with bracketed IDs so agents can perform actions like CLICK [23]. It provides distinct tag semantics for text inputs, links and other controls using markers such as [#ID], [@ID], [$ID] and optional plain text tagging. It includes an OCR-based converter that turns page screenshots into a whitespace-structured text representation that text-only LLMs can interpret. The library exposes adapters for OCR services including Google Cloud Vision and has classes for Microsoft Azure OCR service in examples. It provides a page_to_text API that yields printable page text and a tag-to-xpath mapping, examples for LangChain and LlamaIndex agents, Playwright integration, a pip installable package, and development tooling such as TypeScript build scripts, pytest testing, and formatting scripts.
Use Cases
Tarsier helps bridge the gap between web pages and language models by producing representations an LLM can reason about and by making element references actionable. Tagging creates a stable mapping so language models can specify UI actions that can be translated to DOM operations. The OCR text output enables unimodal LLMs to understand visual layout and text context without relying solely on vision-capable models, which the README reports improved performance in internal benchmarks. Examples and cookbooks demonstrate how to embed Tarsier into web agents so developers can prototype autonomous web tasks, capture page text, and receive a tag-to-xpath mapping for execution. The library reduces the engineering overhead of building perception layers for web automation and agent integration.

Please fill the required fields*