Overview

Transform any text into natural, expressive speech using Fish Audio’s advanced TTS models. Choose from pre-made voices or use your own cloned voices.

Quick Start

Web Interface

The easiest way to generate speech:
1

Visit Playground

Go to fish.audio and log in
2

Enter Your Text

Type or paste the text you want to convert
3

Choose a Voice

Select from available voices or use your own
4

Generate

Click “Generate” and download your audio

Using the SDK

Installation

Install the Fish Audio SDK:
pip install fish-audio-sdk

Basic Usage

Generate speech with just a few lines of code:
from fish_audio_sdk import Session, TTSRequest

# Initialize session
session = Session("your_api_key")

# Generate speech
with open("output.mp3", "wb") as f:
    for chunk in session.tts(TTSRequest(
        text="Hello, world!",
        reference_id="your_voice_model_id"
    )):
        f.write(chunk)

Voice Options

Using Pre-made Voices

Browse and select voices from the playground:
# Use a voice from the playground
session.tts(TTSRequest(
    text="Welcome to Fish Audio!",
    reference_id="7f92f8afb8ec43bf81429cc1c9199cb1"
))

Using Your Cloned Voice

Use voices you’ve created:
# Use your own cloned voice
session.tts(TTSRequest(
    text="This is my custom voice speaking",
    reference_id="your_model_id"
))

Using Reference Audio

Provide reference audio directly:
from fish_audio_sdk import ReferenceAudio

# Use reference audio on-the-fly
session.tts(TTSRequest(
    text="Hello from reference audio",
    references=[
        ReferenceAudio(
            audio=open("voice_sample.wav", "rb").read(),
            text="Sample text from the audio"
        )
    ]
))

Model Selection

Choose the right model for your needs:
ModelBest ForQualitySpeed
s1Latest featuresExcellentFast
speech-1.6Stable productionVery GoodFast
speech-1.5Legacy supportGoodFastest
Specify a model in your request:
# Using the latest S1 model
session.tts(
    TTSRequest(text="Hello world"),
    backend="s1"
)

Advanced Options

Audio Formats

Choose your output format:
session.tts(TTSRequest(
    text="Your text here",
    format="mp3",  # Options: "mp3", "wav", "pcm"
    mp3_bitrate=128  # For MP3: 64, 128, or 192
))

Chunk Length

Control text processing chunks:
session.tts(TTSRequest(
    text="Long text content...",
    chunk_length=200  # 100-300 characters per chunk
))

Latency Mode

Optimize for speed or quality:
session.tts(TTSRequest(
    text="Quick response needed",
    latency="balanced"  # "normal" or "balanced"
))
Balanced mode reduces latency to ~300ms but may slightly decrease stability.

Direct API Usage

For direct API calls without the SDK:
import httpx
import ormsgpack

# Prepare request
request_data = {
    "text": "Hello, world!",
    "reference_id": "your_model_id",
    "format": "mp3"
}

# Make API call
with httpx.Client() as client:
    response = client.post(
        "https://api.fish.audio/v1/tts",
        content=ormsgpack.packb(request_data),
        headers={
            "authorization": "Bearer YOUR_API_KEY",
            "content-type": "application/msgpack",
            "model": "s1"
        }
    )
    
    # Save audio
    with open("output.mp3", "wb") as f:
        f.write(response.content)

Streaming Audio

Stream audio for real-time applications:
# Stream audio chunks
with open("stream_output.mp3", "wb") as f:
    for chunk in session.tts(TTSRequest(
        text="Streaming this text in real-time",
        reference_id="model_id"
    )):
        f.write(chunk)
        # Process chunk immediately for real-time playback

Adding Emotions

Make your speech more expressive:
# Add emotion markers to your text
emotional_text = """
(excited) I just won the lottery!
(sad) But then I lost the ticket.
(laughing) Just kidding, I found it!
"""

session.tts(TTSRequest(
    text=emotional_text,
    reference_id="model_id"
))
Available emotions:
  • Basic: (happy), (sad), (angry), (excited), (calm)
  • Tones: (shouting), (whispering), (soft tone)
  • Effects: (laughing), (sighing), (crying)

Best Practices

Text Preparation

Do:
  • Use proper punctuation for natural pauses
  • Add emotion markers for expression
  • Break long texts into paragraphs
  • Use consistent formatting
Don’t:
  • Use ALL CAPS (unless shouting)
  • Mix multiple languages randomly
  • Include special characters unnecessarily
  • Forget punctuation

Performance Tips

  1. Batch Processing: Process multiple texts efficiently
  2. Cache Models: Store frequently used model IDs
  3. Optimize Chunk Size: Use 200 characters for best balance
  4. Handle Errors: Implement retry logic for network issues

Quality Optimization

For best results:
  • Use high-quality reference audio for cloning
  • Choose appropriate emotion markers
  • Test different latency modes
  • Monitor API rate limits

Troubleshooting

Common Issues

No audio output:
  • Check API key validity
  • Verify model ID exists
  • Ensure proper audio format
Poor quality:
  • Use better reference audio
  • Try normal latency mode
  • Check text formatting
Slow generation:
  • Use balanced latency mode
  • Reduce chunk length
  • Check network connection

Code Examples

Batch Processing

texts = [
    "First announcement",
    "Second announcement",
    "Third announcement"
]

for i, text in enumerate(texts):
    with open(f"output_{i}.mp3", "wb") as f:
        for chunk in session.tts(TTSRequest(
            text=text,
            reference_id="model_id"
        )):
            f.write(chunk)

Error Handling

import time

def generate_with_retry(text, max_retries=3):
    for attempt in range(max_retries):
        try:
            audio_data = b""
            for chunk in session.tts(TTSRequest(
                text=text,
                reference_id="model_id"
            )):
                audio_data += chunk
            return audio_data
        except Exception as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise e

API Reference

Request Parameters

ParameterTypeDescriptionDefault
textstringText to convertRequired
reference_idstringModel/voice IDNone
formatstringAudio format”mp3”
chunk_lengthintegerCharacters per chunk200
normalizebooleanNormalize texttrue
latencystringSpeed vs quality”normal”

Response

Returns audio data in the specified format as binary stream.

Get Support

Need help with text-to-speech?