For better speech quality and lower latency, upload reference audio via the create model endpoint. This method uses the Fish Audio SDK and provides a more streamlined approach.
Using the Fish Audio SDK
First, make sure you have the Fish Audio SDK installed. You can install it from GitHub or PyPI.Example Usage
-
Using a
reference_id
: This option uses a model that you’ve previously uploaded or chosen from the playground. Replace"MODEL_ID_UPLOADED_OR_CHOSEN_FROM_PLAYGROUND"
with the actual model ID. - Using reference audio: This option allows you to provide a reference audio file and its corresponding text directly in the request.
-
Using a specific TTS model: You can specify which model to use with the
backend
parameter when calling thetts
method. Available options include:"speech-1.5"
(default)"speech-1.6"
"s1"
-
Controlling speech generation:
temperature
(default: 0.7): Controls randomness in the speech generation. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.1) make it more deterministic.top_p
(default: 0.7): Controls diversity via nucleus sampling. Lower values (e.g., 0.1) make the output more focused, while higher values (e.g., 1.0) allow more diversity.
"your_api_key"
with your actual API key, and adjust the file paths as needed.
Raw WebSocket API Usage
The WebSocket API provides real-time, bidirectional communication for Text-to-Speech streaming. Here’s how the protocol works:WebSocket Protocol
-
Connection Endpoint:
- URL:
wss://api.fish.audio/v1/tts/live
- URL:
-
Connection Headers:
Authorization
: Bearer token authentication with your API keymodel
(optional): Specify which TTS model to use. Available options include:speech-1.5
(default)speech-1.6
s1
-
Events:
a.
start
- Initializes the TTS session:b.text
- Sends text chunks:c.There is a text buffer on the server side. Only when this buffer reaches a certain size will an audio event be generated.Sending a stop event will force the buffer to be flushed, return an audio event, and end the session.audio
- Receives audio data (server response):d.stop
- Ends the session:e.flush
- Flushes the text buffer: This immediately generates the audio and returns it, if text is too short, it may lead to under-quality audio.f.finish
- Ends the session (server side):g.log
- Logs messages from the server if debug is true: - Message Format: All messages use MessagePack encoding
Example Usage with OpenAI + MPV
websocket_example.py
- Real-time text streaming with WebSocket connection
- Handling audio chunks as they arrive
- Using MPV player for real-time audio playback
- Reference audio support for voice cloning
- Proper connection handling and cleanup
- Linux:
apt-get install mpv
- macOS:
brew install mpv
- Windows: Download from mpv.io