DataFlow
Basic Information
DataFlow is an open source data-centric AI system and toolkit for preparing, generating, processing and evaluating high-quality training and retrieval data for large language models. It is designed to take noisy sources such as PDFs, plain text and low-quality QA datasets and transform them into curated datasets suitable for pre-training, supervised fine-tuning, reinforcement learning and retrieval-augmented generation. The repository provides a modular operator abstraction, composable pipelines and an intelligent DataFlow-agent that can author operators and dynamically assemble pipelines to meet task objectives. The system is positioned for domain-oriented model improvement and has reported empirical validation in domains such as healthcare, finance and law. The project includes a CLI, Gradio web interfaces for operators and the agent, package installation instructions, and references to a managed SaaS deployment on an intelligent data platform.