curator
Basic Information
Bespoke Curator is a Python library and toolset for building scalable synthetic data pipelines and performing bulk inference for post-training workflows. It is designed to generate, curate, and monitor large labeled datasets for fine-tuning and distillation, to extract structured outputs reliably, and to prepare data for retrieval-augmented or domain-specific training. The project provides a programmatic LLM block abstraction for prompt construction and structured parsing, built-in caching and retries, asynchronous and parallel execution, and optional hosted viewer integration to stream and inspect generated responses. Curator supports a wide range of model providers and backends including OpenAI-compatible APIs, LiteLLM, vLLM, Ollama, DeepSeek, and specialized batch APIs to cut inference costs. The repository also includes examples for reasoning datasets, multimodal generation, code execution pipelines, and finetuning data preparation, plus CLI utilities and documented environment variables for reproducible large-scale data generation.