Skip to main content
POST
/
v1
/
tts
/
stream
/
with-timestamp
Stream With Timestamps
curl --no-buffer --request POST \
  --url https://api.fish.audio/v1/tts/stream/with-timestamp \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: s2-pro' \
  --data '{
    "text": "[happy] I can’t believe it’s been this long. It feels like forever since we last really talked. I’ve missed hearing your voice, your stories, even the little things you used to say. How have you been? I’ve thought about calling you so many times, but I never knew where to start. Seeing you again now makes me realize just how much I’ve missed you. We have so much to catch up on, and I don’t even know which part of my life to tell you about first.",
    "format": "opus",
    "normalize": true,
    "temperature": 0.9,
    "chunk_length": 100,
    "top_p": 0.9,
    "latency": "balanced",
    "sample_rate": 48000,
    "reference_id": "fbe02f8306fc4d3d915e9871722a39d5"
  }'
"data: {\"audio_base64\": \"SUQzBAAAAAAA...\", \"content\": \"I can’t believe it’s been this long. It feels like forever since we last really talked. I’ve missed hearing your voice, your stories, even the little things you used to say. How have you been? I’ve thought about calling you so many times, but I never knew where to start.\", \"alignment\": {\"segments\": [{\"text\": \"I\", \"start\": 0.0, \"end\": 0.16}, {\"text\": \"can't\", \"start\": 0.16, \"end\": 0.48}, {\"text\": \"believe\", \"start\": 0.48, \"end\": 0.8}, {\"text\": \"its\", \"start\": 0.8, \"end\": 1.12}, {\"text\": \"been\", \"start\": 1.2, \"end\": 1.44}, {\"text\": \"this\", \"start\": 1.44, \"end\": 1.76}, {\"text\": \"long\", \"start\": 1.76, \"end\": 2.48}, {\"text\": \"It\", \"start\": 2.56, \"end\": 2.64}, {\"text\": \"feels\", \"start\": 2.72, \"end\": 3.04}, {\"text\": \"like\", \"start\": 3.12, \"end\": 3.28}, {\"text\": \"forever\", \"start\": 3.36, \"end\": 4.0}, {\"text\": \"since\", \"start\": 4.0, \"end\": 4.32}, {\"text\": \"we\", \"start\": 4.32, \"end\": 4.48}, {\"text\": \"last\", \"start\": 4.48, \"end\": 4.96}, {\"text\": \"really\", \"start\": 4.96, \"end\": 5.28}, {\"text\": \"talked\", \"start\": 5.28, \"end\": 5.84}, {\"text\": \"Ive\", \"start\": 6.0, \"end\": 6.24}, {\"text\": \"missed\", \"start\": 6.24, \"end\": 6.64}, {\"text\": \"hearing\", \"start\": 6.64, \"end\": 6.96}, {\"text\": \"your\", \"start\": 6.96, \"end\": 7.2}, {\"text\": \"voice\", \"start\": 7.2, \"end\": 7.76}, {\"text\": \"your\", \"start\": 7.76, \"end\": 7.92}, {\"text\": \"stories\", \"start\": 7.92, \"end\": 8.48}, {\"text\": \"even\", \"start\": 8.48, \"end\": 8.72}, {\"text\": \"the\", \"start\": 8.72, \"end\": 8.8}, {\"text\": \"little\", \"start\": 8.8, \"end\": 9.2}, {\"text\": \"things\", \"start\": 9.2, \"end\": 9.52}, {\"text\": \"you\", \"start\": 9.52, \"end\": 9.68}, {\"text\": \"used\", \"start\": 9.68, \"end\": 10.0}, {\"text\": \"to\", \"start\": 10.0, \"end\": 10.08}, {\"text\": \"say\", \"start\": 10.08, \"end\": 10.64}, {\"text\": \"How\", \"start\": 10.64, \"end\": 10.96}, {\"text\": \"have\", \"start\": 10.96, \"end\": 11.12}, {\"text\": \"you\", \"start\": 11.12, \"end\": 11.36}, {\"text\": \"been\", \"start\": 11.36, \"end\": 11.92}, {\"text\": \"Ive\", \"start\": 12.0, \"end\": 12.24}, {\"text\": \"thought\", \"start\": 12.24, \"end\": 12.48}, {\"text\": \"about\", \"start\": 12.48, \"end\": 12.8}, {\"text\": \"calling\", \"start\": 12.8, \"end\": 13.2}, {\"text\": \"you\", \"start\": 13.2, \"end\": 13.36}, {\"text\": \"so\", \"start\": 13.36, \"end\": 13.68}, {\"text\": \"many\", \"start\": 13.68, \"end\": 13.92}, {\"text\": \"times\", \"start\": 13.92, \"end\": 14.56}, {\"text\": \"but\", \"start\": 14.56, \"end\": 14.72}, {\"text\": \"I\", \"start\": 14.72, \"end\": 14.88}, {\"text\": \"never\", \"start\": 14.88, \"end\": 15.2}, {\"text\": \"knew\", \"start\": 15.2, \"end\": 15.36}, {\"text\": \"where\", \"start\": 15.36, \"end\": 15.6}, {\"text\": \"to\", \"start\": 15.6, \"end\": 15.6}, {\"text\": \"start\", \"start\": 15.68, \"end\": 16.24}], \"audio_duration\": 16.24}, \"chunk_seq\": 0, \"chunk_audio_offset_sec\": 0.0}\n\n"

Documentation Index

Fetch the complete documentation index at: https://docs.fish.audio/llms.txt

Use this file to discover all available pages before exploring further.

This endpoint returns text/event-stream. Each SSE message event contains one JSON payload with a base64-encoded audio chunk.
Use this endpoint when you need both progressive audio delivery and text-to-audio alignment data, such as karaoke-style highlighting, word or phrase progress indicators, captions synchronized to generated speech, or timeline editing.

How the Stream Works

The response is a Server-Sent Events stream. Every event includes:
FieldTypeDescription
audio_base64stringOne base64-encoded audio chunk. Concatenate all chunks in arrival order to reconstruct the complete audio.
contentstringText content described by this event’s latest alignment snapshot. Long input can be split into multiple content chunks.
alignmentobject | nullLatest cumulative timestamp snapshot for chunk_seq. When present, replace the previous snapshot for that chunk_seq; do not append segments.
chunk_seqintegerSequence number of the text chunk described by alignment. Bucket alignment snapshots by this value.
chunk_audio_offset_secnumberAbsolute start time of this text chunk within the full audio, in seconds. Add this to segment-local start and end values for a global audio timeline.
audio_base64 is the transport stream. alignment is a metadata snapshot for chunk_seq. They are delivered together in the same SSE event, but the alignment is not a per-audio-packet delta. When latency is set to balanced, long input can be split into several text chunks. A chunk may produce multiple non-null alignment snapshots as more audio is rendered. Each newer snapshot supersedes the previous snapshot for the same chunk_seq.
Store alignments in a map keyed by chunk_seq. On every non-null alignment, replace the stored value for that key. Do not collect every non-null alignment as a separate final result.

Alignment Shape

Each non-null alignment contains the current cumulative timing segments for a single text chunk:
{
  "audio_base64": "SUQzBAAAAAAA...",
  "content": "Hello world",
  "chunk_seq": 0,
  "chunk_audio_offset_sec": 0.0,
  "alignment": {
    "audio_duration": 0.86,
    "segments": [
      {
        "text": "Hello",
        "start": 0,
        "end": 0.42
      },
      {
        "text": "world",
        "start": 0.42,
        "end": 0.86
      }
    ]
  }
}
start and end are measured in seconds from the start of that text chunk’s generated audio. Add chunk_audio_offset_sec to get timestamps on the complete audio timeline. alignment can be null before the first snapshot is available or when alignment is unavailable. After a snapshot exists, later audio events may repeat the latest snapshot so clients can continue using a simple latest-wins update model.

Minimal Request

curl --no-buffer --request POST \
  --url https://api.fish.audio/v1/tts/stream/with-timestamp \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: s2-pro' \
  --data '{
    "text": "Hello! Welcome to Fish Audio.",
    "reference_id": "model-id",
    "format": "opus",
    "latency": "balanced"
  }'

Parsing the Stream

The stream payload uses standard SSE framing. Parse each data: line as JSON, append every decoded audio_base64 chunk to your audio buffer, and replace the latest alignment snapshot for chunk_seq whenever alignment is non-null.
import base64
import json
import requests

response = requests.post(
    "https://api.fish.audio/v1/tts/stream/with-timestamp",
    headers={
        "Authorization": "Bearer <token>",
        "Content-Type": "application/json",
        "model": "s2-pro",
    },
    json={
        "text": "Hello! Welcome to Fish Audio.",
        "reference_id": "model-id",
        "format": "opus",
        "latency": "balanced",
    },
    stream=True,
)

audio_chunks = []
alignment_by_chunk = {}

for line in response.iter_lines(decode_unicode=True):
    if not line or not line.startswith("data: "):
        continue

    event = json.loads(line.removeprefix("data: "))
    audio_chunks.append(base64.b64decode(event["audio_base64"]))

    if event["alignment"] is not None:
        alignment_by_chunk[event["chunk_seq"]] = {
            "content": event["content"],
            "offset": event["chunk_audio_offset_sec"],
            "alignment": event["alignment"],
        }

audio = b"".join(audio_chunks)

Handling Split Content Chunks

Long input can produce multiple text chunks. Treat audio and alignment as two related streams:
  1. Append every decoded audio_base64 chunk in event order. Do this even when alignment is null.
  2. For non-null alignment, replace the stored snapshot for chunk_seq.
  3. Convert each snapshot’s local segment times into global times by adding chunk_audio_offset_sec.
audio_base64 chunks are transport chunks, not sentence or word boundaries. Do not try to align each audio chunk individually. Use alignment.segments plus chunk_audio_offset_sec for text timing.
For example, if an event has chunk_audio_offset_sec: 16.24, add 16.24 seconds to every segment in that event’s alignment before rendering it on the complete audio timeline.
def build_global_timeline(alignment_by_chunk):
    timeline = []

    for chunk_seq, item in sorted(alignment_by_chunk.items()):
        offset_seconds = item["offset"]
        alignment = item["alignment"]

        for segment in alignment["segments"]:
            timeline.append({
                "text": segment["text"],
                "start": segment["start"] + offset_seconds,
                "end": segment["end"] + offset_seconds,
                "chunk_seq": chunk_seq,
            })

    return timeline

Format Guidance

For timestamped streaming, we recommend opus with the default 48 kHz sample rate when your client supports it. Opus is designed for streaming and gives the best balance of quality, latency, and bandwidth for this endpoint. wav and pcm avoid lossy codec artifacts and are straightforward to align, but they produce much larger payloads. Use them when you need uncompressed audio, direct sample-level processing, or a playback pipeline that already expects raw audio.
Use mp3 only when broad playback compatibility is more important than the cleanest streaming boundaries. MP3 encoding uses overlapping audio windows, so its encoded chunks may not line up as neatly with timestamp snapshot updates as Opus.
This endpoint accepts the same TTS request fields as the Text to Speech API, including reference_id, references, prosody, temperature, top_p, chunk_length, format, and latency.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Headers

model
enum<string>
default:s2-pro
required

Specify which TTS model to use. We recommend s2-pro.

Available options:
s1,
s2-pro

Body

Request body for streaming text-to-speech synthesis with timestamp alignment. The request fields match the standard TTS endpoint, but the response is delivered as a Server-Sent Events stream. Each SSE payload includes an audio chunk and, when available, the latest cumulative alignment snapshot for a chunk_seq. Clients should concatenate audio_base64 chunks in arrival order and replace the stored alignment for each chunk_seq whenever a newer snapshot is received.

text
string
required

Text to convert to speech.

temperature
number
default:0.7

Controls expressiveness. Higher is more varied, lower is more consistent.

Required range: 0 <= x <= 1
top_p
number
default:0.7

Controls diversity via nucleus sampling.

Required range: 0 <= x <= 1
references

Single speaker: array of reference audio samples

reference_id

Single speaker: voice model ID string

prosody
ProsodyControl · object

Speed and volume adjustments for the output.

chunk_length
integer
default:300

Text segment size for processing.

Required range: 100 <= x <= 300
normalize
boolean
default:true

Normalizes text for English and Chinese, improving stability for numbers.

format
enum<string>
default:mp3

Output audio format.

Available options:
wav,
pcm,
mp3,
opus
sample_rate
integer | null

Audio sample rate in Hz. When null, uses the format's default (44100 Hz for most formats, 48000 Hz for opus).

mp3_bitrate
enum<integer>
default:128

MP3 bitrate in kbps. Only applies when format is mp3.

Available options:
64,
128,
192
opus_bitrate
enum<integer>
default:-1000

Opus bitrate in bps. -1000 for automatic. Only applies when format is opus.

Available options:
-1000,
24000,
32000,
48000,
64000
latency
enum<string>
default:normal

Latency-quality trade-off. normal: best quality, balanced: reduced latency, low: lowest latency.

Available options:
low,
normal,
balanced
max_new_tokens
integer
default:1024

Maximum audio tokens to generate per text chunk.

repetition_penalty
number
default:1.2

Penalty for repeating audio patterns. Values above 1.0 reduce repetition.

min_chunk_length
integer
default:50

Minimum characters before splitting into a new chunk.

Required range: 0 <= x <= 100
condition_on_previous_chunks
boolean
default:true

Use previous audio as context for voice consistency.

early_stop_threshold
number
default:1

Early stopping threshold for batch processing.

Required range: 0 <= x <= 1

Response

Server-Sent Events stream. Each message event contains a JSON payload with one base64 audio chunk. Concatenate every audio_base64 chunk in arrival order to reconstruct the complete audio. alignment is the latest cumulative timestamp snapshot for chunk_seq; clients should replace the previous snapshot for that chunk instead of appending segments. chunk_audio_offset_sec can be added to segment times to derive absolute timestamps in the full audio.

One Server-Sent Events message payload for streaming TTS with timestamps. Each event contains one audio chunk. Concatenate all audio_base64 chunks in arrival order to reconstruct the complete audio. alignment is the latest cumulative timestamp snapshot for the reported chunk_seq; clients should replace the previous snapshot for that chunk instead of appending segments.

audio_base64
string
required

Base64 encoded audio chunk. Concatenate every chunk in event order to reconstruct the full audio.

content
string
required

Text content described by this event's latest alignment snapshot. Long input may be split into multiple content chunks in one stream.

alignment
TTSTimestampAlignment · object
required

Latest cumulative timestamp snapshot for chunk_seq. When present, replace the previous alignment for the same chunk_seq; do not append segments. Null means no alignment snapshot has been produced yet or alignment is unavailable.

chunk_seq
integer
required

Sequence number of the text chunk described by alignment. Clients should bucket alignment snapshots by this value.

Required range: x >= 0
chunk_audio_offset_sec
number
required

Absolute start time of this text chunk within the full audio, in seconds.

Required range: x >= 0