Stream generated speech and timestamp alignment events
Documentation Index
Fetch the complete documentation index at: https://docs.fish.audio/llms.txt
Use this file to discover all available pages before exploring further.
| Field | Type | Description |
|---|---|---|
audio_base64 | string | One base64-encoded audio chunk. Concatenate all chunks in arrival order to reconstruct the complete audio. |
content | string | The text covered by this event’s generated audio chunk. Long input can be split into multiple content chunks. |
alignment | object | null | Timestamp alignment for this content chunk. Audio-only continuation events can return null. |
latency is set to balanced, long input can be split into several text chunks. Each text chunk may produce one non-null alignment event, followed by one or more audio-only events where alignment is null.
alignment contains the generated audio duration and ordered timing segments:
start and end are measured in seconds from the start of that content chunk’s generated audio. Use audio_duration to offset later chunks when you need a single global timeline.
data: line as JSON, append every decoded audio_base64 chunk to your audio buffer, and store non-null alignments separately.
content chunks. Treat audio and alignment as two related streams:
audio_base64 chunk in event order. Do this even when alignment is null.alignment objects for timing data.audio_base64 chunks are transport chunks, not sentence or word boundaries.
Do not try to align each audio chunk individually. Use alignment.segments
for text timing, and use alignment.audio_duration to offset later aligned
content chunks.audio_duration: 16.24, add 16.24 seconds to every segment in the next non-null alignment before rendering it on the complete audio timeline.
opus with the default 48 kHz sample rate when your client supports it. Opus is designed for streaming and gives the best balance of quality, latency, and bandwidth for this endpoint.
wav and pcm avoid lossy codec artifacts and are straightforward to align, but they produce much larger payloads. Use them when you need uncompressed audio, direct sample-level processing, or a playback pipeline that already expects raw audio.
This endpoint accepts the same TTS request fields as the Text to Speech API, including reference_id, references, prosody, temperature, top_p, chunk_length, format, and latency.Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Specify which TTS model to use. We recommend s2-pro.
s1, s2-pro Request body for text-to-speech synthesis. Supports single-speaker synthesis on all compatible TTS models. Multi-speaker dialogue synthesis is only available with the S2-Pro model.
Provide either reference_id (string) pointing to a voice model, or references (array of ReferenceAudio) for zero-shot cloning.
For multi-speaker synthesis, provide:
reference_id: array of voice model IDs, e.g., ["speaker-0-id", "speaker-1-id"]text: use speaker tags <|speaker:0|>, <|speaker:1|>, etc. to indicate speaker changes, e.g., "<|speaker:0|>Hello!<|speaker:1|>Hi there!"Alternatively, for zero-shot multi-speaker:
references: 2D array where each inner array contains references for one speakerreference_id: array of identifiers (can be arbitrary strings for zero-shot){
"text": "<|speaker:0|>Good morning!<|speaker:1|>Good morning! How are you?<|speaker:0|>I'm great, thanks!",
"reference_id": ["model-id-alice", "model-id-bob"]
}Text to convert to speech.
Controls expressiveness. Higher is more varied, lower is more consistent.
0 <= x <= 1Controls diversity via nucleus sampling.
0 <= x <= 1Single speaker: array of reference audio samples
Single speaker: voice model ID string
Speed and volume adjustments for the output.
Text segment size for processing.
100 <= x <= 300Normalizes text for English and Chinese, improving stability for numbers.
Output audio format.
wav, pcm, mp3, opus Audio sample rate in Hz. When null, uses the format's default (44100 Hz for most formats, 48000 Hz for opus).
MP3 bitrate in kbps. Only applies when format is mp3.
64, 128, 192 Opus bitrate in bps. -1000 for automatic. Only applies when format is opus.
-1000, 24000, 32000, 48000, 64000 Latency-quality trade-off. normal: best quality, balanced: reduced latency, low: lowest latency.
low, normal, balanced Maximum audio tokens to generate per text chunk.
Penalty for repeating audio patterns. Values above 1.0 reduce repetition.
Minimum characters before splitting into a new chunk.
0 <= x <= 100Use previous audio as context for voice consistency.
Early stopping threshold for batch processing.
0 <= x <= 1Server-Sent Events stream. Each message event contains a JSON payload with one base64 audio chunk. Concatenate every audio_base64 chunk in arrival order to reconstruct the complete audio. In balanced streaming, long input can be split into multiple text chunks. Each text chunk may produce a non-null alignment event, followed by one or more audio-only events for the same content where alignment is null. Clients should collect every non-null alignment in order instead of keeping only the first or last event.
One Server-Sent Events message payload for streaming TTS with timestamps. Each event contains one audio chunk. Concatenate all audio_base64 chunks in arrival order to reconstruct the complete audio. Long input may be split into multiple content chunks. Each chunk can have its own non-null alignment, followed by additional audio-only events for that chunk where alignment is null. Collect every non-null alignment in order.
Base64 encoded audio chunk. Concatenate every chunk in event order to reconstruct the full audio.
Text content covered by this event's text chunk. Long input may be split into multiple content chunks in one stream.
Timestamp information for this content chunk. Balanced streaming can produce multiple non-null alignments, one for each text chunk. Additional audio events for the same content chunk may return null.