Report Abuse

Basic Information

ViDoRAG is a research and engineering repository for building and evaluating retrieval-augmented generation systems over visually rich documents. It provides a multi-agent RAG framework that uses iterative actor-critic style reasoning agents to handle complex multi-hop queries against large document collections. The project includes the ViDoSeek benchmark dataset designed for retrieval-reason-answer tasks on visually rich documents, tooling to preprocess PDFs into image pages, optional OCR or vision-language model text extraction, and scripts to build an index and run end-to-end retrieval and generation. The README documents dependency setup, ingestion and embedding steps, dynamic retrieval options, a multi-agent generation entry point, and an evaluation pipeline. The codebase is aimed at researchers and developers who want to reproduce experiments, experiment with multimodal retrievers, or extend agent-based RAG for visual documents.

Links

App Details

Features
The repository bundles a labeled ViDoSeek dataset with a JSON query-answer format and metadata for reference pages. It implements a GMM-based multi-modal hybrid retrieval strategy and supports single-modal and hybrid retrievers via SearchEngine and HybridSearchEngine classes. Tools include vl_embedding.py for embedding checks, ingestion.py for building index databases based on Llama-Index, pdf2images and OCR scripts to convert and extract text from documents, and optional VLM-based OCR. A multi-agent generation module (vidorag_agents.py) wraps an LLM interface to run ViDoRAG agents. The project provides evaluation code (eval.py) for LLM-based assessment and configuration options for different retrieval and generation experiment types. Dependency and environment setup instructions are included.
Use Cases
ViDoRAG helps researchers and developers develop, test, and evaluate multimodal retrieval-augmented generation workflows on visually rich document collections. It supplies a benchmark dataset and end-to-end scripts to ingest documents, produce multi-modal embeddings, and index content for retrieval. The hybrid retrieval implementation and tunable GMM dynamic recall let users compare single-modal and multi-modal strategies. The multi-agent generation component demonstrates iterative reasoning with LLMs and can be integrated into other pipelines. The evaluation tooling enables reproducible assessment of retrieval and generation performance using LLM-based metrics. Installation guidance and example usage make it straightforward to reproduce experiments or adapt components for novel datasets and embedding models.

Please fill the required fields*