Skip to main content
POST
/
v1
/
tts
/
stream
/
with-timestamp
Stream With Timestamps
curl --no-buffer --request POST \
  --url https://api.fish.audio/v1/tts/stream/with-timestamp \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: s2-pro' \
  --data '{
    "text": "[happy] I can’t believe it’s been this long. It feels like forever since we last really talked. I’ve missed hearing your voice, your stories, even the little things you used to say. How have you been? I’ve thought about calling you so many times, but I never knew where to start. Seeing you again now makes me realize just how much I’ve missed you. We have so much to catch up on, and I don’t even know which part of my life to tell you about first.",
    "format": "opus",
    "normalize": true,
    "temperature": 0.9,
    "chunk_length": 100,
    "top_p": 0.9,
    "latency": "balanced",
    "sample_rate": 48000,
    "reference_id": "fbe02f8306fc4d3d915e9871722a39d5"
  }'
"data: {\"audio_base64\": \"SUQzBAAAAAAA...\", \"content\": \"I can’t believe it’s been this long. It feels like forever since we last really talked. I’ve missed hearing your voice, your stories, even the little things you used to say. How have you been? I’ve thought about calling you so many times, but I never knew where to start.\", \"alignment\": {\"segments\": [{\"text\": \"I\", \"start\": 0.0, \"end\": 0.16}, {\"text\": \"can't\", \"start\": 0.16, \"end\": 0.48}, {\"text\": \"believe\", \"start\": 0.48, \"end\": 0.8}, {\"text\": \"its\", \"start\": 0.8, \"end\": 1.12}, {\"text\": \"been\", \"start\": 1.2, \"end\": 1.44}, {\"text\": \"this\", \"start\": 1.44, \"end\": 1.76}, {\"text\": \"long\", \"start\": 1.76, \"end\": 2.48}, {\"text\": \"It\", \"start\": 2.56, \"end\": 2.64}, {\"text\": \"feels\", \"start\": 2.72, \"end\": 3.04}, {\"text\": \"like\", \"start\": 3.12, \"end\": 3.28}, {\"text\": \"forever\", \"start\": 3.36, \"end\": 4.0}, {\"text\": \"since\", \"start\": 4.0, \"end\": 4.32}, {\"text\": \"we\", \"start\": 4.32, \"end\": 4.48}, {\"text\": \"last\", \"start\": 4.48, \"end\": 4.96}, {\"text\": \"really\", \"start\": 4.96, \"end\": 5.28}, {\"text\": \"talked\", \"start\": 5.28, \"end\": 5.84}, {\"text\": \"Ive\", \"start\": 6.0, \"end\": 6.24}, {\"text\": \"missed\", \"start\": 6.24, \"end\": 6.64}, {\"text\": \"hearing\", \"start\": 6.64, \"end\": 6.96}, {\"text\": \"your\", \"start\": 6.96, \"end\": 7.2}, {\"text\": \"voice\", \"start\": 7.2, \"end\": 7.76}, {\"text\": \"your\", \"start\": 7.76, \"end\": 7.92}, {\"text\": \"stories\", \"start\": 7.92, \"end\": 8.48}, {\"text\": \"even\", \"start\": 8.48, \"end\": 8.72}, {\"text\": \"the\", \"start\": 8.72, \"end\": 8.8}, {\"text\": \"little\", \"start\": 8.8, \"end\": 9.2}, {\"text\": \"things\", \"start\": 9.2, \"end\": 9.52}, {\"text\": \"you\", \"start\": 9.52, \"end\": 9.68}, {\"text\": \"used\", \"start\": 9.68, \"end\": 10.0}, {\"text\": \"to\", \"start\": 10.0, \"end\": 10.08}, {\"text\": \"say\", \"start\": 10.08, \"end\": 10.64}, {\"text\": \"How\", \"start\": 10.64, \"end\": 10.96}, {\"text\": \"have\", \"start\": 10.96, \"end\": 11.12}, {\"text\": \"you\", \"start\": 11.12, \"end\": 11.36}, {\"text\": \"been\", \"start\": 11.36, \"end\": 11.92}, {\"text\": \"Ive\", \"start\": 12.0, \"end\": 12.24}, {\"text\": \"thought\", \"start\": 12.24, \"end\": 12.48}, {\"text\": \"about\", \"start\": 12.48, \"end\": 12.8}, {\"text\": \"calling\", \"start\": 12.8, \"end\": 13.2}, {\"text\": \"you\", \"start\": 13.2, \"end\": 13.36}, {\"text\": \"so\", \"start\": 13.36, \"end\": 13.68}, {\"text\": \"many\", \"start\": 13.68, \"end\": 13.92}, {\"text\": \"times\", \"start\": 13.92, \"end\": 14.56}, {\"text\": \"but\", \"start\": 14.56, \"end\": 14.72}, {\"text\": \"I\", \"start\": 14.72, \"end\": 14.88}, {\"text\": \"never\", \"start\": 14.88, \"end\": 15.2}, {\"text\": \"knew\", \"start\": 15.2, \"end\": 15.36}, {\"text\": \"where\", \"start\": 15.36, \"end\": 15.6}, {\"text\": \"to\", \"start\": 15.6, \"end\": 15.6}, {\"text\": \"start\", \"start\": 15.68, \"end\": 16.24}], \"audio_duration\": 16.24}}\n\n"

Documentation Index

Fetch the complete documentation index at: https://docs.fish.audio/llms.txt

Use this file to discover all available pages before exploring further.

This endpoint returns text/event-stream. Each SSE message event contains one JSON payload with a base64-encoded audio chunk.
Use this endpoint when you need both progressive audio delivery and text-to-audio alignment data, such as karaoke-style highlighting, word or phrase progress indicators, captions synchronized to generated speech, or timeline editing.

How the Stream Works

The response is a Server-Sent Events stream. Every event includes:
FieldTypeDescription
audio_base64stringOne base64-encoded audio chunk. Concatenate all chunks in arrival order to reconstruct the complete audio.
contentstringThe text covered by this event’s generated audio chunk. Long input can be split into multiple content chunks.
alignmentobject | nullTimestamp alignment for this content chunk. Audio-only continuation events can return null.
When latency is set to balanced, long input can be split into several text chunks. Each text chunk may produce one non-null alignment event, followed by one or more audio-only events where alignment is null.
Collect every non-null alignment in stream order. Do not keep only the first or last alignment event.

Alignment Shape

Each non-null alignment contains the generated audio duration and ordered timing segments:
{
  "alignment": {
    "audio_duration": 16.24,
    "segments": [
      {
        "text": "Hello",
        "start": 0,
        "end": 0.42
      },
      {
        "text": "world",
        "start": 0.42,
        "end": 0.86
      }
    ]
  }
}
start and end are measured in seconds from the start of that content chunk’s generated audio. Use audio_duration to offset later chunks when you need a single global timeline.

Minimal Request

curl --no-buffer --request POST \
  --url https://api.fish.audio/v1/tts/stream/with-timestamp \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: s2-pro' \
  --data '{
    "text": "Hello! Welcome to Fish Audio.",
    "reference_id": "model-id",
    "format": "opus",
    "latency": "balanced"
  }'

Parsing the Stream

The stream payload uses standard SSE framing. Parse each data: line as JSON, append every decoded audio_base64 chunk to your audio buffer, and store non-null alignments separately.
import base64
import json
import requests

response = requests.post(
    "https://api.fish.audio/v1/tts/stream/with-timestamp",
    headers={
        "Authorization": "Bearer <token>",
        "Content-Type": "application/json",
        "model": "s2-pro",
    },
    json={
        "text": "Hello! Welcome to Fish Audio.",
        "reference_id": "model-id",
        "format": "opus",
        "latency": "balanced",
    },
    stream=True,
)

audio_chunks = []
alignments = []

for line in response.iter_lines(decode_unicode=True):
    if not line or not line.startswith("data: "):
        continue

    event = json.loads(line.removeprefix("data: "))
    audio_chunks.append(base64.b64decode(event["audio_base64"]))

    if event["alignment"] is not None:
        alignments.append(event["alignment"])

audio = b"".join(audio_chunks)

Handling Split Content Chunks

Long input can produce multiple content chunks. Treat audio and alignment as two related streams:
  1. Append every decoded audio_base64 chunk in event order. Do this even when alignment is null.
  2. Keep only non-null alignment objects for timing data.
  3. Convert each alignment’s local segment times into global times by adding the duration of all previous aligned content chunks.
audio_base64 chunks are transport chunks, not sentence or word boundaries. Do not try to align each audio chunk individually. Use alignment.segments for text timing, and use alignment.audio_duration to offset later aligned content chunks.
For example, if the first aligned content chunk has audio_duration: 16.24, add 16.24 seconds to every segment in the next non-null alignment before rendering it on the complete audio timeline.
def build_global_timeline(alignments):
    timeline = []
    offset_seconds = 0.0

    for alignment in alignments:
        for segment in alignment["segments"]:
            timeline.append({
                "text": segment["text"],
                "start": segment["start"] + offset_seconds,
                "end": segment["end"] + offset_seconds,
            })

        offset_seconds += alignment["audio_duration"]

    return timeline

Format Guidance

For timestamped streaming, we recommend opus with the default 48 kHz sample rate when your client supports it. Opus is designed for streaming and gives the best balance of quality, latency, and bandwidth for this endpoint. wav and pcm avoid lossy codec artifacts and are straightforward to align, but they produce much larger payloads. Use them when you need uncompressed audio, direct sample-level processing, or a playback pipeline that already expects raw audio.
Use mp3 only when broad playback compatibility is more important than the cleanest streaming boundaries. MP3 encoding uses overlapping audio windows, so this endpoint must flush complete sentence audio before emitting alignment data. Around sentence boundaries, that flush can introduce a small quality loss or discontinuity compared with opus.
This endpoint accepts the same TTS request fields as the Text to Speech API, including reference_id, references, prosody, temperature, top_p, chunk_length, format, and latency.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Headers

model
enum<string>
default:s2-pro
required

Specify which TTS model to use. We recommend s2-pro.

Available options:
s1,
s2-pro

Body

Request body for text-to-speech synthesis. Supports single-speaker synthesis on all compatible TTS models. Multi-speaker dialogue synthesis is only available with the S2-Pro model.

Single Speaker

Provide either reference_id (string) pointing to a voice model, or references (array of ReferenceAudio) for zero-shot cloning.

Multiple Speakers (Dialogue, S2-Pro only)

For multi-speaker synthesis, provide:

  • reference_id: array of voice model IDs, e.g., ["speaker-0-id", "speaker-1-id"]
  • text: use speaker tags <|speaker:0|>, <|speaker:1|>, etc. to indicate speaker changes, e.g., "<|speaker:0|>Hello!<|speaker:1|>Hi there!"

Alternatively, for zero-shot multi-speaker:

  • references: 2D array where each inner array contains references for one speaker
  • reference_id: array of identifiers (can be arbitrary strings for zero-shot)

Example (Multi-Speaker with Model IDs)

{
"text": "<|speaker:0|>Good morning!<|speaker:1|>Good morning! How are you?<|speaker:0|>I'm great, thanks!",
"reference_id": ["model-id-alice", "model-id-bob"]
}
text
string
required

Text to convert to speech.

temperature
number
default:0.7

Controls expressiveness. Higher is more varied, lower is more consistent.

Required range: 0 <= x <= 1
top_p
number
default:0.7

Controls diversity via nucleus sampling.

Required range: 0 <= x <= 1
references

Single speaker: array of reference audio samples

reference_id

Single speaker: voice model ID string

prosody
ProsodyControl · object

Speed and volume adjustments for the output.

chunk_length
integer
default:300

Text segment size for processing.

Required range: 100 <= x <= 300
normalize
boolean
default:true

Normalizes text for English and Chinese, improving stability for numbers.

format
enum<string>
default:mp3

Output audio format.

Available options:
wav,
pcm,
mp3,
opus
sample_rate
integer | null

Audio sample rate in Hz. When null, uses the format's default (44100 Hz for most formats, 48000 Hz for opus).

mp3_bitrate
enum<integer>
default:128

MP3 bitrate in kbps. Only applies when format is mp3.

Available options:
64,
128,
192
opus_bitrate
enum<integer>
default:-1000

Opus bitrate in bps. -1000 for automatic. Only applies when format is opus.

Available options:
-1000,
24000,
32000,
48000,
64000
latency
enum<string>
default:normal

Latency-quality trade-off. normal: best quality, balanced: reduced latency, low: lowest latency.

Available options:
low,
normal,
balanced
max_new_tokens
integer
default:1024

Maximum audio tokens to generate per text chunk.

repetition_penalty
number
default:1.2

Penalty for repeating audio patterns. Values above 1.0 reduce repetition.

min_chunk_length
integer
default:50

Minimum characters before splitting into a new chunk.

Required range: 0 <= x <= 100
condition_on_previous_chunks
boolean
default:true

Use previous audio as context for voice consistency.

early_stop_threshold
number
default:1

Early stopping threshold for batch processing.

Required range: 0 <= x <= 1

Response

Server-Sent Events stream. Each message event contains a JSON payload with one base64 audio chunk. Concatenate every audio_base64 chunk in arrival order to reconstruct the complete audio. In balanced streaming, long input can be split into multiple text chunks. Each text chunk may produce a non-null alignment event, followed by one or more audio-only events for the same content where alignment is null. Clients should collect every non-null alignment in order instead of keeping only the first or last event.

One Server-Sent Events message payload for streaming TTS with timestamps. Each event contains one audio chunk. Concatenate all audio_base64 chunks in arrival order to reconstruct the complete audio. Long input may be split into multiple content chunks. Each chunk can have its own non-null alignment, followed by additional audio-only events for that chunk where alignment is null. Collect every non-null alignment in order.

audio_base64
string
required

Base64 encoded audio chunk. Concatenate every chunk in event order to reconstruct the full audio.

content
string
required

Text content covered by this event's text chunk. Long input may be split into multiple content chunks in one stream.

alignment
TTSTimestampAlignment · object
required

Timestamp information for this content chunk. Balanced streaming can produce multiple non-null alignments, one for each text chunk. Additional audio events for the same content chunk may return null.