Text to Speech Stream with Timestamps
Stream generated speech with timestamp alignment snapshots
Documentation Index
Fetch the complete documentation index at: https://docs.fish.audio/llms.txt
Use this file to discover all available pages before exploring further.
How the Stream Works
The response is a Server-Sent Events stream. Every event includes:| Field | Type | Description |
|---|---|---|
audio_base64 | string | One base64-encoded audio chunk. Concatenate all chunks in arrival order to reconstruct the complete audio. |
content | string | Text content described by this event’s latest alignment snapshot. Long input can be split into multiple content chunks. |
alignment | object | null | Latest cumulative timestamp snapshot for chunk_seq. When present, replace the previous snapshot for that chunk_seq; do not append segments. |
chunk_seq | integer | Sequence number of the text chunk described by alignment. Bucket alignment snapshots by this value. |
chunk_audio_offset_sec | number | Absolute start time of this text chunk within the full audio, in seconds. Add this to segment-local start and end values for a global audio timeline. |
audio_base64 is the transport stream. alignment is a metadata snapshot for
chunk_seq. They are delivered together in the same SSE event, but the
alignment is not a per-audio-packet delta.
When latency is set to balanced, long input can be split into several text
chunks. A chunk may produce multiple non-null alignment snapshots as more audio
is rendered. Each newer snapshot supersedes the previous snapshot for the same
chunk_seq.
Alignment Shape
Each non-nullalignment contains the current cumulative timing segments for a
single text chunk:
start and end are measured in seconds from the start of that text chunk’s
generated audio. Add chunk_audio_offset_sec to get timestamps on the complete
audio timeline.
alignment can be null before the first snapshot is available or when
alignment is unavailable. After a snapshot exists, later audio events may repeat
the latest snapshot so clients can continue using a simple latest-wins update
model.
Minimal Request
Parsing the Stream
The stream payload uses standard SSE framing. Parse eachdata: line as JSON,
append every decoded audio_base64 chunk to your audio buffer, and replace the
latest alignment snapshot for chunk_seq whenever alignment is non-null.
- Python
- Node.js
Handling Split Content Chunks
Long input can produce multiple text chunks. Treat audio and alignment as two related streams:- Append every decoded
audio_base64chunk in event order. Do this even whenalignmentisnull. - For non-null
alignment, replace the stored snapshot forchunk_seq. - Convert each snapshot’s local segment times into global times by adding
chunk_audio_offset_sec.
audio_base64 chunks are transport chunks, not sentence or word boundaries.
Do not try to align each audio chunk individually. Use alignment.segments
plus chunk_audio_offset_sec for text timing.chunk_audio_offset_sec: 16.24, add 16.24
seconds to every segment in that event’s alignment before rendering it on the
complete audio timeline.
- Python
- Node.js
Format Guidance
For timestamped streaming, we recommendopus with the default 48 kHz sample
rate when your client supports it. Opus is designed for streaming and gives the
best balance of quality, latency, and bandwidth for this endpoint.
wav and pcm avoid lossy codec artifacts and are straightforward to align,
but they produce much larger payloads. Use them when you need uncompressed
audio, direct sample-level processing, or a playback pipeline that already
expects raw audio.
This endpoint accepts the same TTS request fields as the Text to Speech API, including reference_id, references, prosody, temperature, top_p, chunk_length, format, and latency.Authorizations
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Headers
Specify which TTS model to use. We recommend s2-pro.
s1, s2-pro Body
Request body for streaming text-to-speech synthesis with timestamp alignment. The request fields match the standard TTS endpoint, but the response is delivered as a Server-Sent Events stream. Each SSE payload includes an audio chunk and, when available, the latest cumulative alignment snapshot for a chunk_seq. Clients should concatenate audio_base64 chunks in arrival order and replace the stored alignment for each chunk_seq whenever a newer snapshot is received.
Text to convert to speech.
Controls expressiveness. Higher is more varied, lower is more consistent.
0 <= x <= 1Controls diversity via nucleus sampling.
0 <= x <= 1Single speaker: array of reference audio samples
Single speaker: voice model ID string
Speed and volume adjustments for the output.
Text segment size for processing.
100 <= x <= 300Normalizes text for English and Chinese, improving stability for numbers.
Output audio format.
wav, pcm, mp3, opus Audio sample rate in Hz. When null, uses the format's default (44100 Hz for most formats, 48000 Hz for opus).
MP3 bitrate in kbps. Only applies when format is mp3.
64, 128, 192 Opus bitrate in bps. -1000 for automatic. Only applies when format is opus.
-1000, 24000, 32000, 48000, 64000 Latency-quality trade-off. normal: best quality, balanced: reduced latency, low: lowest latency.
low, normal, balanced Maximum audio tokens to generate per text chunk.
Penalty for repeating audio patterns. Values above 1.0 reduce repetition.
Minimum characters before splitting into a new chunk.
0 <= x <= 100Use previous audio as context for voice consistency.
Early stopping threshold for batch processing.
0 <= x <= 1Response
Server-Sent Events stream. Each message event contains a JSON payload with one base64 audio chunk. Concatenate every audio_base64 chunk in arrival order to reconstruct the complete audio. alignment is the latest cumulative timestamp snapshot for chunk_seq; clients should replace the previous snapshot for that chunk instead of appending segments. chunk_audio_offset_sec can be added to segment times to derive absolute timestamps in the full audio.
One Server-Sent Events message payload for streaming TTS with timestamps. Each event contains one audio chunk. Concatenate all audio_base64 chunks in arrival order to reconstruct the complete audio. alignment is the latest cumulative timestamp snapshot for the reported chunk_seq; clients should replace the previous snapshot for that chunk instead of appending segments.
Base64 encoded audio chunk. Concatenate every chunk in event order to reconstruct the full audio.
Text content described by this event's latest alignment snapshot. Long input may be split into multiple content chunks in one stream.
Latest cumulative timestamp snapshot for chunk_seq. When present, replace the previous alignment for the same chunk_seq; do not append segments. Null means no alignment snapshot has been produced yet or alignment is unavailable.
Sequence number of the text chunk described by alignment. Clients should bucket alignment snapshots by this value.
x >= 0Absolute start time of this text chunk within the full audio, in seconds.
x >= 0
