Text to Speech (WebSocket)

For better speech quality and lower latency, upload reference audio via the create model endpoint. This method uses the Fish Audio SDK and provides a more streamlined approach.

Using the Fish Audio SDK

First, make sure you have the Fish Audio SDK installed. You can install it from GitHub or PyPI.

Example Usage

from fish_audio_sdk import WebSocketSession, TTSRequest, ReferenceAudio

sync_websocket = WebSocketSession("your_api_key")

def stream():
    text = "Well, you know, machine learning is like, um, this really fascinating field that's basically teaching computers to, eh, figure things out on their own."
    for line in text.split():
        yield line + " "

tts_request = TTSRequest(
    text="",  # Initial text or empty string
    reference_id="MODEL_ID_UPLOADED_OR_CHOSEN_FROM_PLAYGROUND",
    temperature=0.7,  # Controls randomness in speech generation
    top_p=0.7,  # Controls diversity via nucleus sampling
)

# Or you can use reference audio
# tts_request = TTSRequest(
#     text="",
#     references=[
#         ReferenceAudio(
#             audio=open("lengyue.wav", "rb").read(),
#             text="Text in reference AUDIO",
#         )
#     ],
#     temperature=0.7,
#     top_p=0.7,
# )

with open("output.mp3", "wb") as f:
    for chunk in sync_websocket.tts(
        tts_request,
        stream(), # Stream the text
        backend="speech-1.6"  # Specify which TTS model to use
    ):
        f.write(chunk)

This example demonstrates how to use the Text-to-Speech WebSocket API:

Using a reference_id: This option uses a model that you’ve previously uploaded or chosen from the playground. Replace "MODEL_ID_UPLOADED_OR_CHOSEN_FROM_PLAYGROUND" with the actual model ID.
Using reference audio: This option allows you to provide a reference audio file and its corresponding text directly in the request.
Using a specific TTS model: You can specify which model to use with the backend parameter when calling the tts method. Available options include:
- "speech-1.5" (default)
- "speech-1.6"
- "s1"
Controlling speech generation:
- temperature (default: 0.7): Controls randomness in the speech generation. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.1) make it more deterministic.
- top_p (default: 0.7): Controls diversity via nucleus sampling. Lower values (e.g., 0.1) make the output more focused, while higher values (e.g., 1.0) allow more diversity.

Make sure to replace "your_api_key" with your actual API key, and adjust the file paths as needed.

Raw WebSocket API Usage

The WebSocket API provides real-time, bidirectional communication for Text-to-Speech streaming. Here’s how the protocol works:

WebSocket Protocol

Connection Endpoint:
- URL: wss://api.fish.audio/v1/tts/live
Connection Headers:
- Authorization: Bearer token authentication with your API key
- model (optional): Specify which TTS model to use. Available options include:
  - speech-1.5 (default)
  - speech-1.6
  - s1

Events: a. start - Initializes the TTS session:

{
  "event": "start",
  "request": {
    "text": "",  // Initial empty text
    "latency": "normal",  // "normal" or "balanced"
    "format": "opus",  // "opus", "mp3", or "wav"
    "temperature": 0.7,  // Controls randomness in speech generation
    "top_p": 0.7,  // Controls diversity via nucleus sampling
    // Optional: Use prosody to control speech speed and volume
    "prosody": {
      "speed": 1.0,  // Speech speed (0.5-2.0)
      "volume": 0    // Volume adjustment in dB
    },
    "reference_id": "MODEL_ID_UPLOADED_OR_CHOSEN_FROM_PLAYGROUND"
    // Optional: Use reference audio instead of reference_id
    "references": [{
      "audio": "<binary_audio_data>",
      "text": "Reference text for the audio"
    }],
  }
}

b. text - Sends text chunks:

{
  "event": "text",
  "text": "Hello world " // Don't forget the space since all text is concatenated
}

There is a text buffer on the server side. Only when this buffer reaches a certain size will an audio event be generated.Sending a stop event will force the buffer to be flushed, return an audio event, and end the session.

c. audio - Receives audio data (server response):

{
  "event": "audio",
  "audio": "<binary_audio_data>",
  "time": 3.012 // Time taken in milliseconds
}

d. stop - Ends the session:

{
  "event": "stop"
}

e. flush - Flushes the text buffer: This immediately generates the audio and returns it, if text is too short, it may lead to under-quality audio.

{
  "event": "flush"
}

f. finish - Ends the session (server side):

{
  "event": "finish",
  "reason": "stop" // or "error"
}

g. log - Logs messages from the server if debug is true:

{
  "event": "log",
  "message": "Log message from server"
}

Message Format: All messages use MessagePack encoding

Example Usage with OpenAI + MPV

websocket_example.py

import asyncio
import websockets
import ormsgpack
import subprocess
import shutil
from openai import AsyncOpenAI

aclient = AsyncOpenAI()


def is_installed(lib_name):
    """Check if a system command is available"""
    return shutil.which(lib_name) is not None


async def stream_audio(audio_stream):
    """
    Stream audio data using mpv player
    Args:
        audio_stream: Async iterator yielding audio chunks
    """
    if not is_installed("mpv"):
        raise ValueError(
            "mpv not found, necessary to stream audio. "
            "Install instructions: https://mpv.io/installation/"
        )

    # Initialize mpv process for real-time audio playback
    mpv_process = subprocess.Popen(
        ["mpv", "--no-cache", "--no-terminal", "--", "fd://0"],
        stdin=subprocess.PIPE,
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
    )

    async for chunk in audio_stream:
        if chunk:
            mpv_process.stdin.write(chunk)
            mpv_process.stdin.flush()

    if mpv_process.stdin:
        mpv_process.stdin.close()
    mpv_process.wait()


async def text_to_speech_stream(text_iterator):
    """
    Stream text to speech using WebSocket API
    Args:
        text_iterator: Async iterator yielding text chunks
    """
    uri = "wss://api.fish.audio/v1/tts/live"  # Updated URI

    async with websockets.connect(
        uri, extra_headers={"Authorization": f"Bearer YOUR_API_KEY"}
    ) as websocket:
        # Send initial configuration
        await websocket.send(
            ormsgpack.packb(
                {
                    "event": "start",
                    "request": {
                        "text": "",
                        "latency": "normal",
                        "format": "opus",
                        "reference_id": "MODEL_ID_UPLOADED_OR_CHOSEN_FROM_PLAYGROUND",
                    },
                    "debug": True,  # Added debug flag
                }
            )
        )

        # Handle incoming audio data
        async def listen():
            while True:
                try:
                    message = await websocket.recv()
                    data = ormsgpack.unpackb(message)
                    if data["event"] == "audio":
                        yield data["audio"]
                except websockets.exceptions.ConnectionClosed:
                    break

        # Start audio streaming task
        listen_task = asyncio.create_task(stream_audio(listen()))

        # Stream text chunks
        async for text in text_iterator:
            if text:
                await websocket.send(ormsgpack.packb({"event": "text", "text": text}))

        # Send stop signal
        await websocket.send(ormsgpack.packb({"event": "stop"}))
        await listen_task


async def chat_completion(query):
    """Retrieve text from OpenAI and pass it to the text-to-speech function."""
    response = await aclient.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
        max_completion_tokens=512,
        temperature=1,
        stream=True,
    )

    async def text_iterator():
        async for chunk in response:
            delta = chunk.choices[0].delta
            yield delta.content

    await text_to_speech_stream(text_iterator())  # Updated function name


# Main execution
if __name__ == "__main__":
    user_query = "Hello, tell me a very short story, including filler words, don't use * or #."
    asyncio.run(chat_completion(user_query))

This example demonstrates:

Real-time text streaming with WebSocket connection
Handling audio chunks as they arrive
Using MPV player for real-time audio playback
Reference audio support for voice cloning
Proper connection handling and cleanup

Make sure to install required dependencies:

pip install websockets ormsgpack openai

And install MPV player for audio playback (optional):

Linux: apt-get install mpv
macOS: brew install mpv
Windows: Download from mpv.io

Get Started

Text to Speech

Emotion Control

Speech to Text

Text to Speech (WebSocket)

Using the Fish Audio SDK

Example Usage

Raw WebSocket API Usage

WebSocket Protocol

Example Usage with OpenAI + MPV

Get Started

Text to Speech

Emotion Control

Speech to Text

​Using the Fish Audio SDK

​Example Usage

​Raw WebSocket API Usage

​WebSocket Protocol

​Example Usage with OpenAI + MPV

Using the Fish Audio SDK

Example Usage

Raw WebSocket API Usage

WebSocket Protocol

Example Usage with OpenAI + MPV