Report Abuse

Basic Information

CleanS2S is a prototype Speech-to-Speech (S2S) agent designed to demonstrate a high-quality, streaming, interactive Chinese voice interface implemented in a compact single-file pipeline. The repository aims to provide researchers and developers a readable reference implementation of an end-to-end S2S pipeline that combines Automatic Speech Recognition (ASR), a Large Language Model (LLM) handler, and Text-to-Speech (TTS) into a real-time conversational agent. It emphasizes a Linguistic User Interface (LUI) style experience with features for proactive action initiation and subjective action judgement. The project includes demo conversations, backend server scripts for running the streaming pipeline, optional retrieval-augmented generation (RAG) and web search extensions, and a frontend client to try interactions in a browser. The design targets quick exploration, validation of ideas, and easy customization of models and components.

Links

App Details

Features
The repository highlights a single-file implementation that consolidates an S2S pipeline for easy reading and modification. It supports real-time streaming via WebSockets, full-duplex interaction allowing simultaneous speaking and listening, and interruption handling so user input can preempt the agent. The pipeline composes ASR, LLM, and TTS components and uses multithreading and queueing for asynchronous non-blocking processing. Optional integrations include web search and RAG for grounding responses, configurable LLM API or local models, and voice conversion/timbre transfer using reference audio. The project supplies backend scripts, requirements files including RAG dependencies, setup instructions for recommended ASR and TTS models, and frontend instructions with a Docker-based development workflow.
Use Cases
CleanS2S helps researchers and developers quickly prototype and experience an end-to-end speech conversational system without extensive project overhead. The single-file design lowers the barrier to inspect and modify pipeline logic, swap ASR/LLM/TTS models, and experiment with interruption strategies and proactive behaviors. Real-time streaming and full-duplex support enable human-like interactive testing, while optional web search and RAG add access to external information for more informed replies. Provided examples, demo videos, and runnable backend and frontend instructions simplify reproducing experiments and evaluating conversational behaviors. The repo also documents required model downloads and environment steps, making it practical for validating S2S ideas and extending the pipeline for research or demonstration purposes.

Please fill the required fields*