Text-to-Speech

Prerequisites

Create a Fish Audio account

Go to fish.audio/auth/signup
Fill in your details to create an account, complete steps to verify your account.
Log in to your account and navigate to the API section

Get your API key

Once you have an account, you’ll need an API key to authenticate your requests.

Log in to your Fish Audio Dashboard
Navigate to the API Keys section
Click “Create New Key” and give it a descriptive name, set a expiration if desired
Copy your key and store it securely

Keep your API key secret! Never commit it to version control or share it publicly.

Understanding TTS Methods

The SDK provides three methods for text-to-speech generation, each optimized for different use cases:

Method	Returns	Best For
`convert()`	Complete audio bytes	Most use cases - simple, gets full audio at once
`stream()`	`AudioStream`	Chunk-by-chunk processing, memory-efficient transfer
`stream_websocket()`	Audio bytes iterator	Real-time streaming with dynamic text (LLM responses, conversational AI)

Use convert() for most use cases. Use stream() for memory efficiency when handling large files. Use stream_websocket() when text is generated dynamically in real-time.

Basic Usage

Generate speech from text with a single function call:

from fishaudio import FishAudio
from fishaudio.utils import save, play

client = FishAudio()

# Generate speech (returns bytes)
audio = client.tts.convert(text="Hello, welcome to Fish Audio!")

# Play or save the audio
play(audio)
save(audio, "output.mp3")

Using Voice Models

Specify a voice model for consistent voice characteristics:

from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Use a specific voice
audio = client.tts.convert(
    text="This uses a specific voice model",
    reference_id="bf322df2096a46f18c579d0baa36f41d"  # Adrian
)
play(audio)

Finding Voice Models

Get voice model IDs from the Fish Audio website or programmatically:

from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# List available voices
voices = client.voices.list(language="en", tags="male")

for voice in voices.items:
    print(f"{voice.title}: {voice.id}")

# Use a voice from the list
audio = client.tts.convert(
    text="Generated with discovered voice",
    reference_id=voices.items[0].id
)
play(audio)

Learn more in the Voice Cloning guide.

Emotions and Expressions

Add emotional expressions to make speech more natural:

from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

text = """
(happy) I'm excited to announce this!
(sad) Unfortunately, it didn't work out.
(angry) This is so frustrating!
(calm) Let me explain the details.
"""

audio = client.tts.convert(
    text=text,
    reference_id="933563129e564b19a115bedd57b7406a"  # Sarah
)
play(audio)

See the Emotion Reference for all available emotions and Fine-grained Control for advanced usage.

Audio Formats

Choose the output format based on your needs:

from fishaudio import FishAudio

client = FishAudio()

# MP3 (default) - good balance of quality and size
audio = client.tts.convert(
    text="MP3 format",
    format="mp3"
)

# WAV - uncompressed, highest quality
audio = client.tts.convert(
    text="WAV format",
    format="wav"
)

# PCM - raw audio data for streaming
audio = client.tts.convert(
    text="PCM format",
    format="pcm"
)

Prosody Control

Adjust speech speed and volume for natural-sounding output:

from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Simple speed adjustment
audio = client.tts.convert(
    text="This will be spoken faster",
    speed=1.5  # 1.5x speed (range: 0.5-2.0)
)
play(audio)

For combined speed and volume control, use TTSConfig with Prosody:

from fishaudio import FishAudio
from fishaudio.types import TTSConfig, Prosody
from fishaudio.utils import play

client = FishAudio()

# Configure prosody with TTSConfig
audio = client.tts.convert(
    text="Adjusted speech with custom speed and volume",
    config=TTSConfig(
        prosody=Prosody(
            speed=1.2,   # 20% faster
            volume=5     # Louder (range: -20 to 20)
        )
    )
)
play(audio)

Reusable TTS Configuration

Create a configuration once and reuse it across multiple generations:

from fishaudio import FishAudio
from fishaudio.types import TTSConfig, Prosody

client = FishAudio()

# Define config once
my_config = TTSConfig(
    prosody=Prosody(speed=1.2, volume=-5),
    reference_id="bf322df2096a46f18c579d0baa36f41d",  # Adrian
    format="wav",
    latency="balanced"
)

# Reuse across multiple generations
audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config)
audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config)
audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config)

Chunk-by-Chunk Streaming

Use stream() for memory-efficient transfer and progressive download. Chunks are network transmission units (not semantic audio segments):

from fishaudio import FishAudio

client = FishAudio()

# Collect all chunks efficiently
audio_stream = client.tts.stream(text="Long text here")
audio = audio_stream.collect()  # Returns complete audio as bytes

For streaming to files or network without buffering in memory:

from fishaudio import FishAudio

client = FishAudio()

# Stream directly to file (memory efficient for large audio)
audio_stream = client.tts.stream(text="Very long text...")
with open("output.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)  # Write each chunk as it arrives

Use stream() when you have complete text upfront. For real-time streaming with dynamically generated text (LLMs, live captions), use stream_websocket() instead.

Real-time WebSocket Streaming

For real-time applications where text is generated dynamically, use stream_websocket(). This is perfect for LLM integrations, conversational AI, and live captions:

Basic WebSocket Streaming

from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Stream dynamically generated text
def text_chunks():
    yield "Hello, "
    yield "this is "
    yield "streaming text!"

audio_stream = client.tts.stream_websocket(
    text_chunks(),
    latency="balanced"
)

play(audio_stream)

Understanding `FlushEvent`

The FlushEvent forces the TTS engine to immediately generate audio from the accumulated text buffer. This is useful when you want to ensure audio is generated at specific points, even if the buffer hasn’t reached the optimal chunk size.

from fishaudio import FishAudio
from fishaudio.types import FlushEvent

client = FishAudio()

# Use FlushEvent to force immediate generation
def text_with_flush():
    yield "This is the first sentence. "
    yield "This is the second sentence. "
    yield FlushEvent()  # Force audio generation NOW
    yield "This starts a new segment. "
    yield "And continues here."
    yield FlushEvent()  # Force final generation

audio_stream = client.tts.stream_websocket(text_with_flush())

# Process each audio chunk as it arrives
for chunk in audio_stream:
    print(f"Received audio chunk: {len(chunk)} bytes")

Without FlushEvent, the engine automatically generates audio when the buffer reaches an optimal size. Use FlushEvent to control exactly when audio should be generated, which can reduce perceived latency in interactive applications.

`TextEvent` vs Plain Strings

You can yield plain strings (recommended for simplicity) or use TextEvent for explicit control:

from fishaudio import FishAudio
from fishaudio.types import TextEvent

client = FishAudio()

# Both approaches are equivalent
def text_as_strings():
    yield "Hello, "
    yield "world!"

def text_as_events():
    yield TextEvent(text="Hello, ")
    yield TextEvent(text="world!")

# Use whichever style you prefer
audio1 = client.tts.stream_websocket(text_as_strings())
audio2 = client.tts.stream_websocket(text_as_events())

LLM Integration Pattern

WebSocket streaming shines when integrating with LLM streaming responses. The TTS engine acts as an accumulator, buffering text until it has enough to generate natural-sounding audio:

from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Simulate streaming LLM response
def llm_stream():
    """Simulates text chunks from an LLM"""
    tokens = [
        "The ", "weather ", "today ", "is ", "sunny ",
        "with ", "clear ", "skies. ", "Perfect ",
        "for ", "outdoor ", "activities!"
    ]
    for token in tokens:
        yield token

# Stream to speech in real-time
audio_stream = client.tts.stream_websocket(llm_stream())
play(audio_stream)

The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don’t need to manually batch tokens unless you want to force generation at specific points using FlushEvent.

Learn more in the WebSocket Streaming guide.

Advanced Configuration

Comprehensive TTSConfig with all available parameters:

from fishaudio.types import TTSConfig, Prosody

# All TTSConfig parameters
config = TTSConfig(
    # Audio output settings
    format="mp3",
    sample_rate=44100,         # Custom sample rate (optional)
    mp3_bitrate=192,           # 64, 128, or 192 kbps
    opus_bitrate=64,           # For Opus format: -1000, 24, 32, 48, or 64
    normalize=True,            # Normalize audio levels

    # Generation settings
    chunk_length=200,          # Characters per chunk (100-300)
    latency="balanced",        # "normal" or "balanced"

    # Voice/style settings
    reference_id="bf322df2096a46f18c579d0baa36f41d",  # Adrian
    prosody=Prosody(speed=1.1, volume=0),
    # references=[ReferenceAudio(...)]  # For instant cloning

    # Model parameters
    temperature=0.7,           # Randomness (0.0-1.0)
    top_p=0.7                  # Token selection (0.0-1.0)
)

# Use with any client
audio = client.tts.convert(text="Your text here", config=config)

TTSConfig works the same for both sync and async clients. See TTSConfig API Reference for detailed documentation on each parameter and their defaults.

Error Handling

Handle common TTS errors gracefully:

from fishaudio import FishAudio
from fishaudio.exceptions import (
    RateLimitError,
    ValidationError,
    NotFoundError,
    FishAudioError
)
import time

client = FishAudio()

try:
    audio = client.tts.convert(
        text="Your text here",
        reference_id="voice_id"
    )
except RateLimitError:
    print("Rate limit exceeded. Please wait before retrying.")
    time.sleep(60)  # Wait before retry
except NotFoundError:
    print("Voice model not found. Check the reference_id")
except ValidationError as e:
    print(f"Invalid request: {e}")
except FishAudioError as e:
    print(f"API error: {e}")

Common exceptions include RateLimitError, ValidationError, NotFoundError, and FishAudioError.

Best Practices

Chunk long text appropriately

For long texts, adjust chunk_length in TTSConfig:

from fishaudio import FishAudio
from fishaudio.types import TTSConfig

client = FishAudio()

audio = client.tts.convert(
    text="Very long text...",
    config=TTSConfig(chunk_length=250)  # Larger chunks for efficiency
)

Cache frequently used audio

If you generate the same speech repeatedly, cache the results:

import os
from fishaudio import FishAudio
from fishaudio.utils import save

client = FishAudio()

def get_or_generate_speech(text, cache_file):
    if os.path.exists(cache_file):
        with open(cache_file, "rb") as f:
            return f.read()

    audio = client.tts.convert(text=text)
    save(audio, cache_file)
    return audio

Handle rate limits gracefully

Implement exponential backoff for rate limits:

from fishaudio import FishAudio
from fishaudio.exceptions import RateLimitError
import time

client = FishAudio()

def generate_with_retry(text, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.tts.convert(text=text)
        except RateLimitError as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

Use appropriate latency modes

Balance speed vs quality based on your use case:

from fishaudio import FishAudio

client = FishAudio()

# For real-time applications
audio = client.tts.convert(text="Fast response", latency="balanced")

# For highest quality
audio = client.tts.convert(text="Best quality", latency="normal")

Next Steps

Voice Cloning

Create custom voice models

WebSocket Streaming

Real-time audio streaming

Fine-grained Control

Phoneme-level control and paralanguage

Best Practices

Production tips and optimization

TTS API Reference - Complete API documentation
Audio Formats Guide - Format comparison
Emotion Reference - All available emotions
Utils Reference - Audio utilities

Getting Started

Models & Pricing

Core Features

Developer SDKs

Best Practices

Product Guides

Self-Hosting

Integrations

Tutorials

Resources

Prerequisites

Understanding TTS Methods

Basic Usage

Using Voice Models

Finding Voice Models

Emotions and Expressions

Audio Formats

Prosody Control

Reusable TTS Configuration

Chunk-by-Chunk Streaming

Real-time WebSocket Streaming

Basic WebSocket Streaming

Understanding `FlushEvent`

`TextEvent` vs Plain Strings

LLM Integration Pattern

Advanced Configuration

Error Handling

Best Practices

Next Steps

Voice Cloning

WebSocket Streaming

Fine-grained Control

Best Practices

Getting Started

Models & Pricing

Core Features

Developer SDKs

Best Practices

Product Guides

Self-Hosting

Integrations

Tutorials

Resources

​Prerequisites

​Understanding TTS Methods

​Basic Usage

​Using Voice Models

​Finding Voice Models

​Emotions and Expressions

​Audio Formats

​Prosody Control

​Reusable TTS Configuration

​Chunk-by-Chunk Streaming

​Real-time WebSocket Streaming

​Basic WebSocket Streaming

​Understanding FlushEvent

​TextEvent vs Plain Strings

​LLM Integration Pattern

​Advanced Configuration

​Error Handling

​Best Practices

​Next Steps

Voice Cloning

WebSocket Streaming

Fine-grained Control

Best Practices

​Related Resources

Prerequisites

Understanding TTS Methods

Basic Usage

Using Voice Models

Finding Voice Models

Emotions and Expressions

Audio Formats

Prosody Control

Reusable TTS Configuration

Chunk-by-Chunk Streaming

Real-time WebSocket Streaming

Basic WebSocket Streaming

Understanding `FlushEvent`

`TextEvent` vs Plain Strings

LLM Integration Pattern

Advanced Configuration

Error Handling

Best Practices

Next Steps

Related Resources