Text to Speech Guide

Overview

Transform any text into natural, expressive speech using Fish Audio’s advanced TTS models. Choose from pre-made voices or use your own cloned voices.

Quick Start

Web Interface

The easiest way to generate speech:

Visit Playground

Go to fish.audio and log in

Enter Your Text

Type or paste the text you want to convert

Choose a Voice

Select from available voices or use your own

Generate

Click “Generate” and download your audio

Using the SDK

Installation

Install the Fish Audio SDK:

pip install fish-audio-sdk

Basic Usage

Generate speech with just a few lines of code:

from fish_audio_sdk import Session, TTSRequest

# Initialize session
session = Session("your_api_key")

# Generate speech
with open("output.mp3", "wb") as f:
    for chunk in session.tts(TTSRequest(
        text="Hello, world!",
        reference_id="your_voice_model_id"
    )):
        f.write(chunk)

Voice Options

Using Pre-made Voices

Browse and select voices from the playground:

# Use a voice from the playground
session.tts(TTSRequest(
    text="Welcome to Fish Audio!",
    reference_id="7f92f8afb8ec43bf81429cc1c9199cb1"
))

Using Your Cloned Voice

Use voices you’ve created:

# Use your own cloned voice
session.tts(TTSRequest(
    text="This is my custom voice speaking",
    reference_id="your_model_id"
))

Using Reference Audio

Provide reference audio directly:

from fish_audio_sdk import ReferenceAudio

# Use reference audio on-the-fly
session.tts(TTSRequest(
    text="Hello from reference audio",
    references=[
        ReferenceAudio(
            audio=open("voice_sample.wav", "rb").read(),
            text="Sample text from the audio"
        )
    ]
))

Model Selection

Choose the right model for your needs:

Model	Best For	Quality	Speed
s1	Latest features	Excellent	Fast
speech-1.6	Stable production	Very Good	Fast
speech-1.5	Legacy support	Good	Fastest

Specify a model in your request:

# Using the latest S1 model
session.tts(
    TTSRequest(text="Hello world"),
    backend="s1"
)

Advanced Options

Audio Formats

Choose your output format:

session.tts(TTSRequest(
    text="Your text here",
    format="mp3",  # Options: "mp3", "wav", "pcm"
    mp3_bitrate=128  # For MP3: 64, 128, or 192
))

Chunk Length

Control text processing chunks:

session.tts(TTSRequest(
    text="Long text content...",
    chunk_length=200  # 100-300 characters per chunk
))

Latency Mode

Optimize for speed or quality:

session.tts(TTSRequest(
    text="Quick response needed",
    latency="balanced"  # "normal" or "balanced"
))

Balanced mode reduces latency to ~300ms but may slightly decrease stability.

Direct API Usage

For direct API calls without the SDK:

import httpx
import ormsgpack

# Prepare request
request_data = {
    "text": "Hello, world!",
    "reference_id": "your_model_id",
    "format": "mp3"
}

# Make API call
with httpx.Client() as client:
    response = client.post(
        "https://api.fish.audio/v1/tts",
        content=ormsgpack.packb(request_data),
        headers={
            "authorization": "Bearer YOUR_API_KEY",
            "content-type": "application/msgpack",
            "model": "s1"
        }
    )
    
    # Save audio
    with open("output.mp3", "wb") as f:
        f.write(response.content)

Streaming Audio

Stream audio for real-time applications:

# Stream audio chunks
with open("stream_output.mp3", "wb") as f:
    for chunk in session.tts(TTSRequest(
        text="Streaming this text in real-time",
        reference_id="model_id"
    )):
        f.write(chunk)
        # Process chunk immediately for real-time playback

Adding Emotions

Make your speech more expressive:

# Add emotion markers to your text
emotional_text = """
(excited) I just won the lottery!
(sad) But then I lost the ticket.
(laughing) Just kidding, I found it!
"""

session.tts(TTSRequest(
    text=emotional_text,
    reference_id="model_id"
))

Available emotions:

Basic: (happy), (sad), (angry), (excited), (calm)
Tones: (shouting), (whispering), (soft tone)
Effects: (laughing), (sighing), (crying)

For more precise control over pronunciation and additional paralanguage features like pauses and breathing, see Fine-grained Control.

Best Practices

Text Preparation

Do:

Use proper punctuation for natural pauses
Add emotion markers for expression
Break long texts into paragraphs
Use consistent formatting

Don’t:

Use ALL CAPS (unless shouting)
Mix multiple languages randomly
Include special characters unnecessarily
Forget punctuation

Performance Tips

Batch Processing: Process multiple texts efficiently
Cache Models: Store frequently used model IDs
Optimize Chunk Size: Use 200 characters for best balance
Handle Errors: Implement retry logic for network issues

Quality Optimization

For best results:

Use high-quality reference audio for cloning
Choose appropriate emotion markers
Test different latency modes
Monitor API rate limits

Troubleshooting

Common Issues

No audio output:

Check API key validity
Verify model ID exists
Ensure proper audio format

Poor quality:

Use better reference audio
Try normal latency mode
Check text formatting

Slow generation:

Use balanced latency mode
Reduce chunk length
Check network connection

Code Examples

Batch Processing

texts = [
    "First announcement",
    "Second announcement",
    "Third announcement"
]

for i, text in enumerate(texts):
    with open(f"output_{i}.mp3", "wb") as f:
        for chunk in session.tts(TTSRequest(
            text=text,
            reference_id="model_id"
        )):
            f.write(chunk)

Error Handling

import time

def generate_with_retry(text, max_retries=3):
    for attempt in range(max_retries):
        try:
            audio_data = b""
            for chunk in session.tts(TTSRequest(
                text=text,
                reference_id="model_id"
            )):
                audio_data += chunk
            return audio_data
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise e

API Reference

Request Parameters

Parameter	Type	Description	Default
text	string	Text to convert	Required
reference_id	string	Model/voice ID	None
format	string	Audio format	”mp3”
chunk_length	integer	Characters per chunk	200
normalize	boolean	Normalize text	true
latency	string	Speed vs quality	”normal”

Response

Returns audio data in the specified format as binary stream.

Get Support

Need help with text-to-speech?

API Documentation: Developer Docs
Discord Community: Join our Discord
Email Support: support@fish.audio

Tutorials

Reference

Best Practices

​Overview

​Quick Start

​Web Interface

​Using the SDK

​Installation

​Basic Usage

​Voice Options

​Using Pre-made Voices

​Using Your Cloned Voice

​Using Reference Audio

​Model Selection

​Advanced Options

​Audio Formats

​Chunk Length

​Latency Mode

​Direct API Usage

​Streaming Audio

​Adding Emotions

​Best Practices

​Text Preparation

​Performance Tips

​Quality Optimization

​Troubleshooting

​Common Issues

​Code Examples

​Batch Processing

​Error Handling

​API Reference

​Request Parameters

​Response

​Get Support

Overview

Quick Start

Web Interface

Using the SDK

Installation

Basic Usage

Voice Options

Using Pre-made Voices

Using Your Cloned Voice

Using Reference Audio

Model Selection

Advanced Options

Audio Formats

Chunk Length

Latency Mode

Direct API Usage

Streaming Audio

Adding Emotions

Best Practices

Text Preparation

Performance Tips

Quality Optimization

Troubleshooting

Common Issues

Code Examples

Batch Processing

Error Handling

API Reference

Request Parameters

Response

Get Support