ChatTTS

Report Abuse

Basic Information

ChatTTS is an open source text-to-speech project focused on dialogue applications and conversational speech synthesis. The repository contains algorithm infrastructure, example scripts, a WebUI and command-line inference examples, and instructions to install via PyPI, GitHub or local editable install. It provides pretrained models and supporting files, with the main research model trained on a very large multilingual corpus and an open-source 40,000-hour pretrained checkpoint available on Hugging Face. The codebase is licensed under AGPLv3+ and the model under CC BY-NC 4.0 for academic and research use. The README documents supported languages (English and Chinese), usage examples for basic and advanced inference, and contact channels and community resources for discussion. The project also includes a disclaimer and technical notes about streaming audio generation and planned future releases.

Links

Categorization

App Details

Features
The repository emphasizes conversational TTS optimized for dialogue, with multi-speaker capability and fine-grained prosodic control such as laughter, pauses, and interjections. It supports token-level control units like [laugh], [uv_break] and [lbreak] and exposes APIs to sample speakers and set decoding parameters such as temperature, top_P and top_K. Features include streaming audio generation, a DVAE encoder and zero-shot inference code mentioned as open-sourced or planned, pretrained vocoder integration, example notebooks and a WebUI for live testing. Installation is supported via pip, conda environment notes are provided, and optional GPU-related packages are documented. The README includes usage snippets for basic CLI and Python programmatic inference, advanced word- and sentence-level controls, and examples of saving outputs with torchaudio.
Use Cases
ChatTTS helps researchers and developers who need an expressive, dialogue-oriented TTS system for prototyping, research experiments and demos. The pretrained models reduce the cost and time of training from scratch and enable exploration of speaker timbre sampling, prosody control and token-level behavior in conversational contexts. Provided examples and a WebUI allow quick evaluation and integration into downstream systems such as LLM assistants or conversational agents. Streaming support and documented inference parameters let users tune generation speed and quality on available hardware. Licensing and safety notes clarify non-commercial research use and describe mitigations included during training to deter misuse. Community channels and detailed installation steps make it straightforward to reproduce results or extend the codebase.

Please fill the required fields*