API reference
Every parameter for
POST /v1/voice-design.Pricing
Voice Design is billed per successful generation request.
Voice cloning
Create a reusable voice model from reference audio.
When to use it
Explore voice concepts
Generate several candidate voices from a short creative brief.
Preview narration styles
Provide preview text to hear how a generated voice reads a specific line.
Seed creative workflows
Use generated candidates to choose a voice direction before longer TTS
production.
Stateless API calls
Get generated audio directly without creating batches, samples, or voice
models.
Quick start
Send a JSON request with a prompt and receive generated candidates. The current candidate audio payload is WAV bytes encoded as base64.Prompt and preview text
instruction is the main voice design prompt. Describe the voice, age, delivery, tone, accent, pacing, and context in natural language.
reference_text is optional. When you provide it, candidates read that text so you can compare voices on the same line. Keep it short; the API accepts up to 300 characters.
Parameters
| Field | Default | Notes |
|---|---|---|
instruction | Required | Voice design prompt. 1 to 2000 characters. |
reference_text | null | Optional preview text. Up to 300 characters. |
language | null | Optional language hint such as en, zh, or ja. |
n | 2 | Number of candidates to generate. Range: 1 to 4. |
speed | 1.0 | Speaking speed multiplier. Must be greater than 0 and at most 3. |
num_step | 32 | Diffusion steps. Range: 1 to 128. |
guidance_scale | 2.0 | Higher values follow the prompt more strongly. Must be at least 0. |
instruct_guidance_scale | 0.0 | Prompt conditioning guidance. Must be at least 0. |
seed | null | Optional deterministic seed for candidate generation. |
Response
The response contains one or more generated candidates:index to preserve the order returned by the model. id is a stable candidate identifier for this response. Optional fields such as text, instruct, and language appear only when available.

