Convert text to speech
Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Specify which TTS model to use. We recommend s2
s1, s2-pro Request body for text-to-speech synthesis. Supports both single-speaker and multi-speaker synthesis.
Provide either reference_id (string) pointing to a voice model, or references (array of ReferenceAudio) for zero-shot cloning.
For multi-speaker synthesis, provide:
reference_id: array of voice model IDs, e.g., ["speaker-0-id", "speaker-1-id"]text: use speaker tags <|speaker:0|>, <|speaker:1|>, etc. to indicate speaker changes, e.g., "<|speaker:0|>Hello!<|speaker:1|>Hi there!"Alternatively, for zero-shot multi-speaker:
references: 2D array where each inner array contains references for one speakerreference_id: array of identifiers (can be arbitrary strings for zero-shot){
"text": "<|speaker:0|>Good morning!<|speaker:1|>Good morning! How are you?<|speaker:0|>I'm great, thanks!",
"reference_id": ["model-id-alice", "model-id-bob"]
}Text to convert to speech.
Controls expressiveness. Higher is more varied, lower is more consistent.
0 <= x <= 1Controls diversity via nucleus sampling.
0 <= x <= 1Single speaker: array of reference audio samples
Single speaker: voice model ID string
Speed and volume adjustments for the output.
Text segment size for processing.
100 <= x <= 300Normalizes text for English and Chinese, improving stability for numbers.
Output audio format.
wav, pcm, mp3, opus Audio sample rate in Hz. When null, uses the format's default (44100 Hz for most formats, 48000 Hz for opus).
MP3 bitrate in kbps. Only applies when format is mp3.
64, 128, 192 Opus bitrate in bps. -1000 for automatic. Only applies when format is opus.
-1000, 24, 32, 48, 64 Latency-quality trade-off. normal: best quality, balanced: reduced latency, low: lowest latency.
low, normal, balanced Maximum audio tokens to generate per text chunk.
Penalty for repeating audio patterns. Values above 1.0 reduce repetition.
Minimum characters before splitting into a new chunk.
0 <= x <= 100Use previous audio as context for voice consistency.
Early stopping threshold for batch processing.
0 <= x <= 1Request fulfilled, document follows