synthetic data generator

Report Abuse

Basic Information

The Synthetic Data Generator (SDG) is an open source framework for creating high-quality synthetic tabular data that preserves statistical properties of original datasets while avoiding sensitive information. It is designed for use cases such as data sharing, model training, debugging, system development and testing where privacy-safe replicas are needed. The project provides models, data connectors, a Synthesizer API, example workflows and Colab demos. It supports generation from metadata when no training data is available and includes tools to handle large-scale datasets with memory optimizations. Distribution options include a prebuilt Docker image and a PyPI package, and the codebase is accompanied by documentation, benchmarks and contribution guidance. The project is licensed under Apache-2.0.

Links

Categorization

App Details

Features
SDG ships multiple synthesis approaches including CTGAN and CTGAN optimized for billion-level data, references to GAN and VAE based methods, and a GaussianCopula model integrated into a Data Processor system. It integrates an LLM-based model SingleTableGPTModel that can generate synthetic tables from metadata and perform off-table feature inference. The Data Processor module converts and restores column formats, handles nulls, infers and records metadata for single and multi-table schemas, and exposes a plugin system for extensibility. The repository includes data connectors such as CsvConnector, a Synthesizer class example, demo utilities like download_demo_data, Colab examples, Docker image idsteam/sdgx:latest, PyPI package sdgx, and documentation and benchmarks comparing memory use against other tools.
Use Cases
SDG helps teams and researchers obtain privacy-preserving tabular datasets for experimentation, model training and software testing without exposing original sensitive records. It enables synthetic data generation when upstream data is unavailable by using metadata-driven LLM synthesis, and it can infer new columns via off-table feature inference. The Data Processor ensures correct handling of types such as datetime and discrete fields and reduces memory usage for large categorical datasets, enabling large-scale training. Plugin and connector support make it adaptable to custom pipelines, while Docker and PyPI distribution plus Colab examples lower the barrier to trial and deployment. Documentation, benchmarks and an Apache-2.0 license support adoption and contribution.

Please fill the required fields*