curator

Report Abuse

Basic Information

Bespoke Curator is a Python library and toolset for building scalable synthetic data pipelines and performing bulk inference for post-training workflows. It is designed to generate, curate, and monitor large labeled datasets for fine-tuning and distillation, to extract structured outputs reliably, and to prepare data for retrieval-augmented or domain-specific training. The project provides a programmatic LLM block abstraction for prompt construction and structured parsing, built-in caching and retries, asynchronous and parallel execution, and optional hosted viewer integration to stream and inspect generated responses. Curator supports a wide range of model providers and backends including OpenAI-compatible APIs, LiteLLM, vLLM, Ollama, DeepSeek, and specialized batch APIs to cut inference costs. The repository also includes examples for reasoning datasets, multimodal generation, code execution pipelines, and finetuning data preparation, plus CLI utilities and documented environment variables for reproducible large-scale data generation.

Links

Categorization

App Details

Features
Curator exposes a typed LLM block API with prompt and parse hooks and pydantic-based structured outputs to produce machine-readable labels. It supports batch mode for cost-efficient bulk inference, provider backends such as OpenAI, Anthropic, Gemini via LiteLLM, vLLM, Ollama, DeepSeek and kluster.ai, and configurable backend parameters for rate limiting and retries. Performance features include asynchronous execution, caching, fault recovery, and parallelism across CPUs or clusters. A hosted Curator viewer streams generation progress and visualizes responses while optional authentication links datasets to a Bespoke Labs account. Code execution is supported with multiple backends (local multiprocessing, Ray, Docker, e2b) to run and validate generated code. The repo includes many examples—finetuning/distillation, reasoning dataset generation, multimodal recipes, synthetic charts, ungrounded QA, function calling—and installation via pip.
Use Cases
Curator helps researchers and engineers rapidly create high-quality datasets at scale by automating prompt orchestration, parallel inference, result parsing, caching, and retries. Batch mode support and provider integrations reduce token costs and simplify using batch APIs from major providers. Structured output support increases label consistency for finetuning and RAG workflows, and the viewer enables visual inspection and collaboration on generated datasets. Code execution backends let teams validate and render generated artifacts such as charts or animations. Environment variables and optional hosted authentication enable reproducible runs, dataset access control, and cost reporting when linked to a Bespoke Labs account. Built-in telemetry is minimal and opt-outable, and comprehensive examples plus documentation speed onboarding for production dataset pipelines.

Please fill the required fields*