Report Abuse

Basic Information

AI00 RWKV Server is an inference API server implementation for the RWKV family of language models, built on the web-rwkv inference engine. It provides a compact, Rust-based server that exposes OpenAI ChatGPT-compatible API endpoints for chat completions, text completions and embeddings. The project targets GPU acceleration via Vulkan so it can run on AMD cards and integrated GPUs without CUDA or PyTorch. The repository supplies pre-built executables and instructions to build from source with Rust, a model conversion tool for converting PyTorch .pth models to safetensors .st, and configuration files for model paths and quantization. Typical use cases documented include chatbots, text generation, translation and Q&A, and it includes a WebUI served on port 65530 for interactive use.

Links

Categorization

App Details

Features
The README highlights Vulkan-based parallel and concurrent batched inference to accelerate models on non‚ÄëNVIDIA GPUs. It is OpenAI API compatible and exposes endpoints for models, chat/completions and embeddings. The server is compact and does not require PyTorch or CUDA. It supports int8 and NF4 quantization, LoRA and tuned initial states, SSE push for streaming, and batch serving for parallel inference. A model converter is provided (Python script or standalone binary) to convert .pth models to safetensors .st files. BNF sampling is included to constrain model outputs to specified formats. Distribution includes pre-built binaries, configuration via assets/configs/Config.toml, and build instructions using cargo for release builds.
Use Cases
This project helps developers and deployers run RWKV language models locally or on non‚ÄëCUDA GPUs by providing a ready-to-run inference server with OpenAI-compatible APIs. It lowers deployment friction by removing dependencies on PyTorch and CUDA, offering pre-built binaries, a WebUI for quick testing, and a converter for model format compatibility. Quantization, LoRA and parallel batch serving reduce memory and latency, making it practical to serve large models on commodity hardware. The BNF sampling feature enables structured outputs for downstream integrations. The documented API endpoints and Python example make it straightforward to integrate the server into applications for chatbots, text generation, translation, Q&A and embedding-based workflows.

Please fill the required fields*