OpenContracts

Report Abuse

Basic Information

OpenContracts is a free, open source GPL-3 document analytics platform designed to ingest, analyze and annotate unstructured documents, primarily PDFs and text formats. It provides an enterprise-oriented workspace and APIs to manage document corpuses, extract layout-aware text blocks, generate vector embeddings, and run LLM-backed queries. The project exposes a pluggable parsing and microservice analyzer architecture so teams can add new ingestion engines, custom parsers, embedders and thumbnail generators. It includes a web-based human annotation interface, tooling to perform bulk data extraction across many documents, and integrations with vector search tooling such as a Django-backed pgvector store and LlamaIndex wrappers. Documentation and quickstart guides are provided and the project is structured to be deployable locally or in container environments.

Links

Categorization

App Details

Features
The repository documents and implements a modular pipeline system with clear component types: parsers to extract text and structure, embedders to generate vector representations, and thumbnailers to create previews. Core features include PDF layout parsing that maps text to visual blocks, automatic generation of vector embeddings for blocks and documents, a pluggable microservice analyzer architecture for automatic annotations, a web-based human annotation UI supporting multi-page annotations, LlamaIndex integration for LLM retrieval and QA, a data-extract grid for bulk querying, and a documented API and walkthroughs for writing custom data extractors and analyzers. The stack highlights a Django + pgvector hybrid vector database approach and sample integrations with LlamaIndex and Marvin.
Use Cases
OpenContracts helps teams build searchable, LLM-augmented document workflows by standardizing how textual and layout data are represented and stored, making annotations and extracted data portable. It reduces the engineering effort needed to add new document formats or custom analytics by using a pluggable pipeline model and base classes for processors. Legal, compliance, research or analytics teams can use the annotation UI to create training data and the vector-backed store plus LlamaIndex wrappers to implement intelligent question answering and bulk data extraction across corpuses. The platform supports rapid prototyping of bespoke analyzers and reusable extractors and provides deployment and developer documentation to ease local or containerized setup. Limitations noted include current support focused on PDF and plain text formats.

Please fill the required fields*