Running Inference - Fish Audio

Fish Audio supports multiple inference methods: command line, HTTP API, WebUI, and GUI. Choose the method that best fits your workflow.

This guide assumes you have already installed Fish Audio locally or set up Docker deployment.

Download Weights

Before running inference, download the required model weights from Hugging Face:

# Install Hugging Face CLI (if not already installed)
pip install huggingface_hub[cli]
# or
uv tool install huggingface_hub[cli]

# Download OpenAudio S1-mini weights
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

OpenAudio S1-mini is the open-source distilled version (0.5B parameters) optimized for local deployment. The full S1 model (4B parameters) is available exclusively on Fish Audio cloud.

Command Line Inference

Command line inference provides maximum control and is ideal for scripting and batch processing.

Step 1: Extract VQ Tokens from Reference Audio

First, encode your reference audio to get voice characteristics:

python fish_speech/models/dac/inference.py \
    -i "reference_audio.wav" \
    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"

This generates two files:

fake.npy - VQ tokens representing voice characteristics
fake.wav - Reconstructed audio for verification

Skip this step if you want random voice generation - the model can generate speech without reference audio.

Step 2: Generate Semantic Tokens from Text

Convert your text to semantic tokens using the language model:

python fish_speech/models/text2semantic/inference.py \
    --text "The text you want to convert to speech" \
    --prompt-text "Transcription of your reference audio" \
    --prompt-tokens "fake.npy" \
    --compile

Parameters:

--text: The text to synthesize
--prompt-text: Transcription of the reference audio (for voice cloning)
--prompt-tokens: Path to VQ tokens from Step 1 (for voice cloning)
--compile: Enable kernel fusion for faster inference (~10x speedup on RTX 4090)

For random voice generation, omit --prompt-text and --prompt-tokens parameters.

This creates a file named codes_N.npy (where N starts from 0) containing semantic tokens.

For GPUs that don’t support bf16 (bfloat16), add the --half flag to use fp16 instead.

Step 3: Generate Audio from Semantic Tokens

Finally, convert semantic tokens to audio:

python fish_speech/models/dac/inference.py \
    -i "codes_0.npy"

This generates the final audio file.

Full Example

Here’s a complete workflow for voice cloning:

# 1. Encode reference audio
python fish_speech/models/dac/inference.py \
    -i "my_voice.wav" \
    --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"

# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py \
    --text "Hello, this is a test of voice cloning." \
    --prompt-text "This is my reference voice recording." \
    --prompt-tokens "fake.npy" \
    --compile

# 3. Generate final audio
python fish_speech/models/dac/inference.py \
    -i "codes_0.npy"

HTTP API Inference

The HTTP API provides a programmatic interface for integrations and production deployments.

Start API Server

# With local installation
python -m tools.api_server \
    --listen 0.0.0.0:8080 \
    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
    --decoder-config-name modded_dac_vq

# With UV
uv run tools/api_server.py \
    --listen 0.0.0.0:8080 \
    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
    --decoder-config-name modded_dac_vq

Add the --compile flag to enable torch.compile optimization for faster inference.

Access API Documentation

Once the server is running, access the interactive API documentation at:

http://localhost:8080/docs

The API provides endpoints for:

Text-to-speech synthesis
Voice cloning with reference audio
Batch processing
Model information

Example API Request

curl -X POST "http://localhost:8080/v1/tts" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, this is a test",
    "reference_audio": "base64_encoded_audio",
    "reference_text": "Reference transcription"
  }'

WebUI Inference

The WebUI provides an intuitive interface for interactive testing and development.

Start WebUI

# With all parameters
python -m tools.run_webui \
    --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
    --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
    --decoder-config-name modded_dac_vq

# Or use defaults (auto-detects models in checkpoints/)
python -m tools.run_webui

Add the --compile flag for faster inference during interactive sessions.

Access WebUI

The WebUI starts on port 7860 by default. Access it at:

http://localhost:7860

Configure with Environment Variables

Customize the WebUI using Gradio environment variables:

# Enable public sharing
GRADIO_SHARE=1 python -m tools.run_webui

# Change server port
GRADIO_SERVER_PORT=8080 python -m tools.run_webui

# Change server name
GRADIO_SERVER_NAME=0.0.0.0 python -m tools.run_webui

Using Reference Audio Library

For faster workflow, pre-save reference audio:

Create a references/ directory in the project root
Create subdirectories named by voice ID: references/<voice_id>/
Place files in each subdirectory:
- sample.wav - Reference audio file
- sample.lab - Text transcription of the audio

Example structure:

references/
├── alice/
│   ├── sample.wav
│   └── sample.lab
└── bob/
    ├── sample.wav
    └── sample.lab

These references will appear as selectable options in the WebUI.

GUI Inference

For users who prefer a native desktop application, a PyQt6-based GUI is available.

Download GUI Client

Download the latest release from the Fish Speech GUI repository. Supported platforms:

Linux
Windows
macOS

Connect to API Server

The GUI client connects to a running API server (see HTTP API Inference above).

Start the API server
Launch the GUI client
Configure the API endpoint (default: http://localhost:8080)

Docker Inference

If you’re using Docker deployment, refer to the Docker Deployment guide for detailed instructions on:

Running pre-built WebUI containers
Running pre-built API server containers
Customizing container configuration
Volume mounts for models and references

Quick example:

# Start WebUI with Docker
docker run -d \
    --name fish-speech-webui \
    --gpus all \
    -p 7860:7860 \
    -v ./checkpoints:/app/checkpoints \
    -v ./references:/app/references \
    -e COMPILE=1 \
    fishaudio/fish-speech:latest-webui-cuda

Performance Optimization

Enable Compilation

Torch compilation provides ~10x speedup on compatible GPUs:

# Add --compile flag to any inference command
python -m tools.api_server --compile ...

Compilation requires:

CUDA-compatible GPU
Triton library (not supported on Windows/macOS)
First run will be slow due to compilation overhead

Use Mixed Precision

For GPUs without bf16 support, use fp16:

python fish_speech/models/text2semantic/inference.py --half ...

Batch Processing

For multiple audio generations, use batch processing to amortize model loading overhead:

# Example batch processing script
import fish_speech

model = fish_speech.load_model("checkpoints/openaudio-s1-mini")

texts = ["First sentence", "Second sentence", "Third sentence"]
for text in texts:
    audio = model.synthesize(text)
    audio.save(f"output_{texts.index(text)}.wav")

Emotion Control

OpenAudio S1 supports emotional markers for expressive speech synthesis:

Basic Emotions

(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)

Advanced Emotions

(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)

Tone Markers

(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)

Special Effects

(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)

Example Usage

python fish_speech/models/text2semantic/inference.py \
    --text "(excited)This is amazing! (laughing)Ha ha ha!" \
    --compile

Emotion control is currently supported for English, Chinese, and Japanese. More languages coming soon!

For more details, see the Emotion Reference.

Troubleshooting

Out of Memory Errors

If you encounter CUDA out of memory errors:

Reduce input text length
Use --half flag for fp16 inference
Close other GPU applications
Use a smaller batch size

Slow Inference

To improve speed:

Enable --compile flag
Verify GPU is being used (check with nvidia-smi)
Ensure CUDA version matches PyTorch installation
Use fp16 instead of bf16 on older GPUs

Poor Audio Quality

For better quality:

Use high-quality reference audio (clear, no background noise)
Ensure reference text accurately matches reference audio
Use 10-30 seconds of reference audio
See Voice Cloning Best Practices

Model Loading Errors

If models fail to load:

Verify model weights are downloaded completely
Check checkpoint paths are correct
Ensure sufficient disk space
Re-download weights if corrupted

Next Steps

Emotion Control Best Practices - Master expressive speech
Voice Cloning Best Practices - Optimize voice cloning quality
API Reference - Integrate with your applications
Cloud API - Compare with managed service performance

Getting Started

Models & Pricing

Core Features

API Reference

SDKs & Tools

Self-Hosting

Other

​Download Weights

​Command Line Inference

​Step 1: Extract VQ Tokens from Reference Audio

​Step 2: Generate Semantic Tokens from Text

​Step 3: Generate Audio from Semantic Tokens

​Full Example

​HTTP API Inference

​Start API Server

​Access API Documentation

​Example API Request

​WebUI Inference

​Start WebUI

​Access WebUI

​Configure with Environment Variables

​Using Reference Audio Library

​GUI Inference

​Download GUI Client

​Connect to API Server

​Docker Inference

​Performance Optimization

​Enable Compilation

​Use Mixed Precision

​Batch Processing

​Emotion Control

​Basic Emotions

​Advanced Emotions

​Tone Markers

​Special Effects

​Example Usage

​Troubleshooting

​Out of Memory Errors

​Slow Inference

​Poor Audio Quality

​Model Loading Errors

​Next Steps

Download Weights

Command Line Inference

Step 1: Extract VQ Tokens from Reference Audio

Step 2: Generate Semantic Tokens from Text

Step 3: Generate Audio from Semantic Tokens

Full Example

HTTP API Inference

Start API Server

Access API Documentation

Example API Request

WebUI Inference

Start WebUI

Access WebUI

Configure with Environment Variables

Using Reference Audio Library

GUI Inference

Download GUI Client

Connect to API Server

Docker Inference

Performance Optimization

Enable Compilation

Use Mixed Precision

Batch Processing

Emotion Control

Basic Emotions

Advanced Emotions

Tone Markers

Special Effects

Example Usage

Troubleshooting

Out of Memory Errors

Slow Inference

Poor Audio Quality

Model Loading Errors

Next Steps