This guide assumes you have already installed Fish Audio locally or set up Docker deployment.
Download Weights
Before running inference, download the required model weights from Hugging Face:OpenAudio S1-mini is the open-source distilled version (0.5B parameters) optimized for local deployment. The full S1 model (4B parameters) is available exclusively on Fish Audio cloud.
Command Line Inference
Command line inference provides maximum control and is ideal for scripting and batch processing.Step 1: Extract VQ Tokens from Reference Audio
First, encode your reference audio to get voice characteristics:fake.npy
- VQ tokens representing voice characteristicsfake.wav
- Reconstructed audio for verification
Skip this step if you want random voice generation - the model can generate speech without reference audio.
Step 2: Generate Semantic Tokens from Text
Convert your text to semantic tokens using the language model:--text
: The text to synthesize--prompt-text
: Transcription of the reference audio (for voice cloning)--prompt-tokens
: Path to VQ tokens from Step 1 (for voice cloning)--compile
: Enable kernel fusion for faster inference (~10x speedup on RTX 4090)
For random voice generation, omit
--prompt-text
and --prompt-tokens
parameters.codes_N.npy
(where N starts from 0) containing semantic tokens.
For GPUs that don’t support bf16 (bfloat16), add the
--half
flag to use fp16 instead.Step 3: Generate Audio from Semantic Tokens
Finally, convert semantic tokens to audio:Full Example
Here’s a complete workflow for voice cloning:HTTP API Inference
The HTTP API provides a programmatic interface for integrations and production deployments.Start API Server
Add the
--compile
flag to enable torch.compile optimization for faster inference.Access API Documentation
Once the server is running, access the interactive API documentation at:- Text-to-speech synthesis
- Voice cloning with reference audio
- Batch processing
- Model information
Example API Request
WebUI Inference
The WebUI provides an intuitive interface for interactive testing and development.Start WebUI
Add the
--compile
flag for faster inference during interactive sessions.Access WebUI
The WebUI starts on port 7860 by default. Access it at:Configure with Environment Variables
Customize the WebUI using Gradio environment variables:Using Reference Audio Library
For faster workflow, pre-save reference audio:- Create a
references/
directory in the project root - Create subdirectories named by voice ID:
references/<voice_id>/
- Place files in each subdirectory:
sample.wav
- Reference audio filesample.lab
- Text transcription of the audio
GUI Inference
For users who prefer a native desktop application, a PyQt6-based GUI is available.Download GUI Client
Download the latest release from the Fish Speech GUI repository. Supported platforms:- Linux
- Windows
- macOS
Connect to API Server
The GUI client connects to a running API server (see HTTP API Inference above).- Start the API server
- Launch the GUI client
- Configure the API endpoint (default:
http://localhost:8080
)
Docker Inference
If you’re using Docker deployment, refer to the Docker Deployment guide for detailed instructions on:- Running pre-built WebUI containers
- Running pre-built API server containers
- Customizing container configuration
- Volume mounts for models and references
Performance Optimization
Enable Compilation
Torch compilation provides ~10x speedup on compatible GPUs:Compilation requires:
- CUDA-compatible GPU
- Triton library (not supported on Windows/macOS)
- First run will be slow due to compilation overhead
Use Mixed Precision
For GPUs without bf16 support, use fp16:Batch Processing
For multiple audio generations, use batch processing to amortize model loading overhead:Emotion Control
OpenAudio S1 supports emotional markers for expressive speech synthesis:Basic Emotions
Advanced Emotions
Tone Markers
Special Effects
Example Usage
Emotion control is currently supported for English, Chinese, and Japanese. More languages coming soon!
Troubleshooting
Out of Memory Errors
If you encounter CUDA out of memory errors:- Reduce input text length
- Use
--half
flag for fp16 inference - Close other GPU applications
- Use a smaller batch size
Slow Inference
To improve speed:- Enable
--compile
flag - Verify GPU is being used (check with
nvidia-smi
) - Ensure CUDA version matches PyTorch installation
- Use fp16 instead of bf16 on older GPUs
Poor Audio Quality
For better quality:- Use high-quality reference audio (clear, no background noise)
- Ensure reference text accurately matches reference audio
- Use 10-30 seconds of reference audio
- See Voice Cloning Best Practices
Model Loading Errors
If models fail to load:- Verify model weights are downloaded completely
- Check checkpoint paths are correct
- Ensure sufficient disk space
- Re-download weights if corrupted
Next Steps
- Emotion Control Best Practices - Master expressive speech
- Voice Cloning Best Practices - Optimize voice cloning quality
- API Reference - Integrate with your applications
- Cloud API - Compare with managed service performance