Skip to main content

Prerequisites

Sign up for a free Fish Audio account to get started with our API.
  1. Go to fish.audio/auth/signup
  2. Fill in your details to create an account, complete steps to verify your account.
  3. Log in to your account and navigate to the API section
Once you have an account, you’ll need an API key to authenticate your requests.
  1. Log in to your Fish Audio Dashboard
  2. Navigate to the API Keys section
  3. Click “Create New Key” and give it a descriptive name, set a expiration if desired
  4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.

Understanding TTS Methods

The SDK provides three methods for text-to-speech generation, each optimized for different use cases:
MethodReturnsBest For
convert()Complete audio bytesMost use cases - simple, gets full audio at once
stream()AudioStreamChunk-by-chunk processing, memory-efficient transfer
stream_websocket()Audio bytes iteratorReal-time streaming with dynamic text (LLM responses, conversational AI)
Use convert() for most use cases. Use stream() for memory efficiency when handling large files. Use stream_websocket() when text is generated dynamically in real-time.

Basic Usage

Generate speech from text with a single function call:
from fishaudio import FishAudio
from fishaudio.utils import save, play

client = FishAudio()

# Generate speech (returns bytes)
audio = client.tts.convert(text="Hello, welcome to Fish Audio!")

# Play or save the audio
play(audio)
save(audio, "output.mp3")

Using Voice Models

Specify a voice model for consistent voice characteristics:
from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Use a specific voice
audio = client.tts.convert(
    text="This uses a specific voice model",
    reference_id="bf322df2096a46f18c579d0baa36f41d"  # Adrian
)
play(audio)

Finding Voice Models

Get voice model IDs from the Fish Audio website or programmatically:
from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# List available voices
voices = client.voices.list(language="en", tags="male")

for voice in voices.items:
    print(f"{voice.title}: {voice.id}")

# Use a voice from the list
audio = client.tts.convert(
    text="Generated with discovered voice",
    reference_id=voices.items[0].id
)
play(audio)
Learn more in the Voice Cloning guide.

Emotions and Expressions

Add emotional expressions to make speech more natural:
from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

text = """
(happy) I'm excited to announce this!
(sad) Unfortunately, it didn't work out.
(angry) This is so frustrating!
(calm) Let me explain the details.
"""

audio = client.tts.convert(
    text=text,
    reference_id="933563129e564b19a115bedd57b7406a"  # Sarah
)
play(audio)
See the Emotion Reference for all available emotions and Fine-grained Control for advanced usage.

Audio Formats

Choose the output format based on your needs:
from fishaudio import FishAudio

client = FishAudio()

# MP3 (default) - good balance of quality and size
audio = client.tts.convert(
    text="MP3 format",
    format="mp3"
)

# WAV - uncompressed, highest quality
audio = client.tts.convert(
    text="WAV format",
    format="wav"
)

# PCM - raw audio data for streaming
audio = client.tts.convert(
    text="PCM format",
    format="pcm"
)

Prosody Control

Adjust speech speed and volume for natural-sounding output:
from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Simple speed adjustment
audio = client.tts.convert(
    text="This will be spoken faster",
    speed=1.5  # 1.5x speed (range: 0.5-2.0)
)
play(audio)
For combined speed and volume control, use TTSConfig with Prosody:
from fishaudio import FishAudio
from fishaudio.types import TTSConfig, Prosody
from fishaudio.utils import play

client = FishAudio()

# Configure prosody with TTSConfig
audio = client.tts.convert(
    text="Adjusted speech with custom speed and volume",
    config=TTSConfig(
        prosody=Prosody(
            speed=1.2,   # 20% faster
            volume=5     # Louder (range: -20 to 20)
        )
    )
)
play(audio)

Reusable TTS Configuration

Create a configuration once and reuse it across multiple generations:
from fishaudio import FishAudio
from fishaudio.types import TTSConfig, Prosody

client = FishAudio()

# Define config once
my_config = TTSConfig(
    prosody=Prosody(speed=1.2, volume=-5),
    reference_id="bf322df2096a46f18c579d0baa36f41d",  # Adrian
    format="wav",
    latency="balanced"
)

# Reuse across multiple generations
audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config)
audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config)
audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config)

Chunk-by-Chunk Streaming

Use stream() for memory-efficient transfer and progressive download. Chunks are network transmission units (not semantic audio segments):
from fishaudio import FishAudio

client = FishAudio()

# Collect all chunks efficiently
audio_stream = client.tts.stream(text="Long text here")
audio = audio_stream.collect()  # Returns complete audio as bytes
For streaming to files or network without buffering in memory:
from fishaudio import FishAudio

client = FishAudio()

# Stream directly to file (memory efficient for large audio)
audio_stream = client.tts.stream(text="Very long text...")
with open("output.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)  # Write each chunk as it arrives
Use stream() when you have complete text upfront. For real-time streaming with dynamically generated text (LLMs, live captions), use stream_websocket() instead.

Real-time WebSocket Streaming

For real-time applications where text is generated dynamically, use stream_websocket(). This is perfect for LLM integrations, conversational AI, and live captions:

Basic WebSocket Streaming

from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Stream dynamically generated text
def text_chunks():
    yield "Hello, "
    yield "this is "
    yield "streaming text!"

audio_stream = client.tts.stream_websocket(
    text_chunks(),
    latency="balanced"
)

play(audio_stream)

Understanding FlushEvent

The FlushEvent forces the TTS engine to immediately generate audio from the accumulated text buffer. This is useful when you want to ensure audio is generated at specific points, even if the buffer hasn’t reached the optimal chunk size.
from fishaudio import FishAudio
from fishaudio.types import FlushEvent

client = FishAudio()

# Use FlushEvent to force immediate generation
def text_with_flush():
    yield "This is the first sentence. "
    yield "This is the second sentence. "
    yield FlushEvent()  # Force audio generation NOW
    yield "This starts a new segment. "
    yield "And continues here."
    yield FlushEvent()  # Force final generation

audio_stream = client.tts.stream_websocket(text_with_flush())

# Process each audio chunk as it arrives
for chunk in audio_stream:
    print(f"Received audio chunk: {len(chunk)} bytes")
Without FlushEvent, the engine automatically generates audio when the buffer reaches an optimal size. Use FlushEvent to control exactly when audio should be generated, which can reduce perceived latency in interactive applications.

TextEvent vs Plain Strings

You can yield plain strings (recommended for simplicity) or use TextEvent for explicit control:
from fishaudio import FishAudio
from fishaudio.types import TextEvent

client = FishAudio()

# Both approaches are equivalent
def text_as_strings():
    yield "Hello, "
    yield "world!"

def text_as_events():
    yield TextEvent(text="Hello, ")
    yield TextEvent(text="world!")

# Use whichever style you prefer
audio1 = client.tts.stream_websocket(text_as_strings())
audio2 = client.tts.stream_websocket(text_as_events())

LLM Integration Pattern

WebSocket streaming shines when integrating with LLM streaming responses. The TTS engine acts as an accumulator, buffering text until it has enough to generate natural-sounding audio:
from fishaudio import FishAudio
from fishaudio.utils import play

client = FishAudio()

# Simulate streaming LLM response
def llm_stream():
    """Simulates text chunks from an LLM"""
    tokens = [
        "The ", "weather ", "today ", "is ", "sunny ",
        "with ", "clear ", "skies. ", "Perfect ",
        "for ", "outdoor ", "activities!"
    ]
    for token in tokens:
        yield token

# Stream to speech in real-time
audio_stream = client.tts.stream_websocket(llm_stream())
play(audio_stream)
The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don’t need to manually batch tokens unless you want to force generation at specific points using FlushEvent.
Learn more in the WebSocket Streaming guide.

Advanced Configuration

Comprehensive TTSConfig with all available parameters:
from fishaudio.types import TTSConfig, Prosody

# All TTSConfig parameters
config = TTSConfig(
    # Audio output settings
    format="mp3",
    sample_rate=44100,         # Custom sample rate (optional)
    mp3_bitrate=192,           # 64, 128, or 192 kbps
    opus_bitrate=64,           # For Opus format: -1000, 24, 32, 48, or 64
    normalize=True,            # Normalize audio levels

    # Generation settings
    chunk_length=200,          # Characters per chunk (100-300)
    latency="balanced",        # "normal" or "balanced"

    # Voice/style settings
    reference_id="bf322df2096a46f18c579d0baa36f41d",  # Adrian
    prosody=Prosody(speed=1.1, volume=0),
    # references=[ReferenceAudio(...)]  # For instant cloning

    # Model parameters
    temperature=0.7,           # Randomness (0.0-1.0)
    top_p=0.7                  # Token selection (0.0-1.0)
)

# Use with any client
audio = client.tts.convert(text="Your text here", config=config)
TTSConfig works the same for both sync and async clients. See TTSConfig API Reference for detailed documentation on each parameter and their defaults.

Error Handling

Handle common TTS errors gracefully:
from fishaudio import FishAudio
from fishaudio.exceptions import (
    RateLimitError,
    ValidationError,
    NotFoundError,
    FishAudioError
)
import time

client = FishAudio()

try:
    audio = client.tts.convert(
        text="Your text here",
        reference_id="voice_id"
    )
except RateLimitError:
    print("Rate limit exceeded. Please wait before retrying.")
    time.sleep(60)  # Wait before retry
except NotFoundError:
    print("Voice model not found. Check the reference_id")
except ValidationError as e:
    print(f"Invalid request: {e}")
except FishAudioError as e:
    print(f"API error: {e}")
Common exceptions include RateLimitError, ValidationError, NotFoundError, and FishAudioError.

Best Practices

For long texts, adjust chunk_length in TTSConfig:
from fishaudio import FishAudio
from fishaudio.types import TTSConfig

client = FishAudio()

audio = client.tts.convert(
    text="Very long text...",
    config=TTSConfig(chunk_length=250)  # Larger chunks for efficiency
)
If you generate the same speech repeatedly, cache the results:
import os
from fishaudio import FishAudio
from fishaudio.utils import save

client = FishAudio()

def get_or_generate_speech(text, cache_file):
    if os.path.exists(cache_file):
        with open(cache_file, "rb") as f:
            return f.read()

    audio = client.tts.convert(text=text)
    save(audio, cache_file)
    return audio
Implement exponential backoff for rate limits:
from fishaudio import FishAudio
from fishaudio.exceptions import RateLimitError
import time

client = FishAudio()

def generate_with_retry(text, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.tts.convert(text=text)
        except RateLimitError as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise
Balance speed vs quality based on your use case:
from fishaudio import FishAudio

client = FishAudio()

# For real-time applications
audio = client.tts.convert(text="Fast response", latency="balanced")

# For highest quality
audio = client.tts.convert(text="Best quality", latency="normal")

Next Steps