synthetic data generator

The Synthetic Data Generator (SDG) is an open source framework for creating high-quality synthetic tabular data that preserves statistical properties of original datasets while avoiding sensitive information. It is designed for use cases such as data sharing, model training, debugging, system development and testing where privacy-safe replicas are needed. The project provides models, data connectors, a Synthesizer API, example workflows and Colab demos. It supports generation from metadata when no training data is available and includes tools to handle large-scale datasets with memory optimizations. Distribution options include a prebuilt Docker image and a PyPI package, and the codebase is accompanied by documentation, benchmarks and contribution guidance. The project is licensed under Apache-2.0.

Stars

2373

App URL

https://github.com/hitsz-ids/synthetic-data-generator

Github Repository

https://github.com/hitsz-ids/synthetic-data-generator/blob/main/README.md

Features

SDG ships multiple synthesis approaches including CTGAN and CTGAN optimized for billion-level data, references to GAN and VAE based methods, and a GaussianCopula model integrated into a Data Processor system. It integrates an LLM-based model SingleTableGPTModel that can generate synthetic tables from metadata and perform off-table feature inference. The Data Processor module converts and restores column formats, handles nulls, infers and records metadata for single and multi-table schemas, and exposes a plugin system for extensibility. The repository includes data connectors such as CsvConnector, a Synthesizer class example, demo utilities like download_demo_data, Colab examples, Docker image idsteam/sdgx:latest, PyPI package sdgx, and documentation and benchmarks comparing memory use against other tools.

Use Cases

SDG helps teams and researchers obtain privacy-preserving tabular datasets for experimentation, model training and software testing without exposing original sensitive records. It enables synthetic data generation when upstream data is unavailable by using metadata-driven LLM synthesis, and it can infer new columns via off-table feature inference. The Data Processor ensures correct handling of types such as datetime and discrete fields and reduces memory usage for large categorical datasets, enabling large-scale training. Plugin and connector support make it adaptable to custom pipelines, while Docker and PyPI distribution plus Colab examples lower the barrier to trial and deployment. Documentation, benchmarks and an Apache-2.0 license support adoption and contribution.

synthetic data generator

Basic Information

Links

Categorization

App Details

Categories

Similar Listings

yutu

cyber-doctor

LLM-Powered-RAG-System

xpert

Curie

Featured Listings

Terry Bison Ranch

The Singapore Flyer

Tags

More Filters

synthetic data generator

Categories

Similar Listings

yutu

cyber-doctor

LLM-Powered-RAG-System

xpert

Curie

Featured Listings

Terry Bison Ranch

The Singapore Flyer

Tags