Overview - Fish Audio

Core features

Text to Speech

Convert text into lifelike speech with the s2.1-pro, s2-pro, and s1 models.

Speech to Text

Transcribe audio to text with per-segment timestamps.

Voice Cloning

Clone a voice instantly from a clip, or train a persistent model.

Realtime Streaming

Stream audio as it generates — for voice agents and live apps.

Manage Voices

List, inspect, update, and delete your voice models.

Also in the web app

These run in the browser, no code required — see the Platform guide.

Voice Changer

Transform existing audio into a different voice.

Story Studio

Produce multi-speaker, long-form audio — audiobooks and narration.

Music & Sound Effects

Generate music and cinematic sound effects from a prompt.

Audio Separation

Split audio into stems, and related processing utilities.

Models

These text-to-speech models power most capabilities:

s2.1-pro — the recommended production model, with improved quality, latency, and throughput over S2-Pro.

s2.1-pro-free — the same model at $0 for testing, prototyping, development, and smaller businesses, without TTFA or DPA guarantees.

s2-pro — the previous-generation S2 model, with multi-speaker and natural-language expression control.

s1 — the previous generation, with (parenthesis) emotion tags.

See Models Overview and Choosing a Model for the full lineup, languages, and limits.

Pick your path

Use the web app

No code — generate audio, clone voices, and produce projects in your browser.

Build with the SDK

The Python library for your application.

Call the API

Raw REST and WebSocket endpoints for any language.

Use your AI coding agent

Install the Fish Audio skill so your agent writes correct code.

​Core features