Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.fish.audio/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Transform any text into natural, expressive speech using Fish Audio’s advanced TTS models. Choose from pre-made voices or use your own cloned voices. Discover the world’s best cloned voices models on our Discovery page.

Quick Start

Web Interface

The easiest way to generate speech:
1

Visit Playground

Go to fish.audio and log in
2

Enter Your Text

Type or paste the text you want to convert
3

Choose a Voice

Select from available voices or use your own
4

Generate

Click “Generate” and download your audio

Using the SDK

1

Install the SDK

pip install fish-audio-sdk
2

Basic Usage

Generate speech with just a few lines of code:
from fishaudio import FishAudio
from fishaudio.utils import save

# Initialize client
client = FishAudio(api_key="your_api_key_here")

# Generate speech
audio = client.tts.convert(
    text="Hello, world!",
    reference_id="your_voice_model_id"
)
save(audio, "output.mp3")

print("✓ Audio saved to output.mp3")

Voice Options

Using Pre-made Voices

Browse and select voices from the playground:
# Use a voice from the playground
audio = client.tts.convert(
    text="Welcome to Fish Audio!",
    reference_id="7f92f8afb8ec43bf81429cc1c9199cb1"
)

Using Your Cloned Voice

Use voices you’ve created:
# Use your own cloned voice
audio = client.tts.convert(
    text="This is my custom voice speaking",
    reference_id="your_model_id"
)

Using Reference Audio

Provide reference audio directly:
from fishaudio.types import ReferenceAudio

# Use reference audio on-the-fly
with open("voice_sample.wav", "rb") as f:
    audio = client.tts.convert(
        text="Hello from reference audio",
        references=[
            ReferenceAudio(
                audio=f.read(),
                text="Sample text from the audio"
            )
        ]
    )

Model Selection

Choose the right model for your needs:
ModelBest ForQualitySpeed
s1PrototypingExcellentFast
s2-proLatest featuresExcellentFastest
Specify a model in your request:
# Using the latest model (default)
audio = client.tts.convert(text="Hello world")

Advanced Options

Audio Formats

Choose your output format:
audio = client.tts.convert(
    text="Your text here",
    format="mp3",  # Options: "mp3", "wav", "pcm", "opus"
    mp3_bitrate=128  # For MP3: 64, 128, or 192
)

Chunk Length

Control text processing chunks:
audio = client.tts.convert(
    text="Long text content...",
    chunk_length=200  # 100-300 characters per chunk
)

Latency Mode

Optimize for speed or quality:
audio = client.tts.convert(
    text="Quick response needed",
    latency="balanced"  # "normal" or "balanced"
)
Balanced mode reduces latency to ~300ms but may slightly decrease stability.

Direct API Usage

For direct API calls without the SDK:
import httpx
import ormsgpack

# Prepare request
request_data = {
    "text": "Hello, world!",
    "reference_id": "your_model_id",
    "format": "mp3"
}

# Make API call
with httpx.Client() as client:
    response = client.post(
        "https://api.fish.audio/v1/tts",
        content=ormsgpack.packb(request_data),
        headers={
            "authorization": "Bearer YOUR_API_KEY",
            "content-type": "application/msgpack",
            "model": "s2-pro"
        }
    )
    
    # Save audio
    with open("output.mp3", "wb") as f:
        f.write(response.content)

Streaming Audio

Stream audio for real-time applications:
# Stream audio chunks
audio_stream = client.tts.stream(
    text="Streaming this text in real-time",
    reference_id="model_id"
)

with open("stream_output.mp3", "wb") as f:
    for chunk in audio_stream:
        f.write(chunk)
        # Process chunk immediately for real-time playback

Streaming with Timestamps

Use the Text to Speech Stream with Timestamps API when you need generated audio and alignment data in the same stream. This endpoint returns Server-Sent Events where each event includes an audio_base64 chunk and, when available, the latest cumulative alignment snapshot for a chunk_seq. Clients should concatenate audio chunks in arrival order and replace stored alignment snapshots by chunk_seq.
Timestamped streaming is best for karaoke-style highlighting, synchronized captions, phrase progress indicators, and timeline editing. For this endpoint, prefer opus over mp3 when possible because Opus provides cleaner streaming boundaries for alignment.

Adding Emotions

The (parenthesis) syntax below applies to the S1 model. S2 uses [bracket] syntax with natural language descriptions and is not limited to a fixed set of tags. See the Models Overview for details.
Make your speech more expressive:
# Add emotion markers to your text
emotional_text = """
(excited) I just won the lottery!
(sad) But then I lost the ticket.
(laughing) Just kidding, I found it!
"""

audio = client.tts.convert(
    text=emotional_text,
    reference_id="model_id"
)
Available emotions:
  • Basic: (happy), (sad), (angry), (excited), (calm)
  • Tones: (shouting), (whispering), (soft tone)
  • Effects: (laughing), (sighing), (crying)
For more precise control over pronunciation and additional paralanguage features like pauses and breathing, see Fine-grained Control.

Best Practices

Text Preparation

Do:
  • Use proper punctuation for natural pauses
  • Add emotion markers for expression
  • Break long texts into paragraphs
  • Use consistent formatting
Don’t:
  • Use ALL CAPS (unless shouting)
  • Mix multiple languages randomly
  • Include special characters unnecessarily
  • Forget punctuation

Performance Tips

  1. Batch Processing: Process multiple texts efficiently
  2. Cache Models: Store frequently used model IDs
  3. Optimize Chunk Size: Use 200 characters for best balance
  4. Handle Errors: Implement retry logic for network issues

Quality Optimization

For best results:
  • Use high-quality reference audio for cloning
  • Choose appropriate emotion markers
  • Test different latency modes
  • Monitor API rate limits

Troubleshooting

Common Issues

No audio output:
  • Check API key validity
  • Verify model ID exists
  • Ensure proper audio format
Poor quality:
  • Use better reference audio
  • Try normal latency mode
  • Check text formatting
Slow generation:
  • Use balanced latency mode
  • Reduce chunk length
  • Check network connection

Code Examples

Batch Processing

from fishaudio.utils import save

texts = [
    "First announcement",
    "Second announcement",
    "Third announcement"
]

for i, text in enumerate(texts):
    audio = client.tts.convert(
        text=text,
        reference_id="model_id"
    )
    save(audio, f"output_{i}.mp3")

Error Handling

import time
from fishaudio.exceptions import FishAudioError

def generate_with_retry(text, max_retries=3):
    for attempt in range(max_retries):
        try:
            audio = client.tts.convert(
                text=text,
                reference_id="model_id"
            )
            return audio
        except FishAudioError as e:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise e

API Reference

Request Parameters

ParameterTypeDescriptionDefault
textstringText to convertRequired
reference_idstringModel/voice IDNone
formatstringAudio format”mp3”
chunk_lengthintegerCharacters per chunk200
normalizebooleanNormalize texttrue
latencystringSpeed vs quality”normal”

Response

Returns audio data in the specified format as binary stream.

Get Support

Need help with text-to-speech?