Skip to main content
Voice Design creates short voice candidates from a natural-language prompt. Use it when you want to explore a voice direction before building a longer text-to-speech workflow or creating a persistent voice model.

API reference

Every parameter for POST /v1/voice-design.

Pricing

Voice Design is billed per successful generation request.

Voice cloning

Create a reusable voice model from reference audio.

When to use it

Explore voice concepts

Generate several candidate voices from a short creative brief.

Preview narration styles

Provide preview text to hear how a generated voice reads a specific line.

Seed creative workflows

Use generated candidates to choose a voice direction before longer TTS production.

Stateless API calls

Get generated audio directly without creating batches, samples, or voice models.

Quick start

Send a JSON request with a prompt and receive generated candidates. The current candidate audio payload is WAV bytes encoded as base64.
curl --request POST https://api.fish.audio/v1/voice-design \
  --header "Authorization: Bearer $FISH_API_KEY" \
  --header "Content-Type: application/json" \
  --header "model: voice-design-1" \
  --data '{
    "instruction": "Warm, confident studio narrator with a natural tone",
    "reference_text": "Welcome to Fish Audio.",
    "language": "en",
    "n": 2
  }' | jq -r '.candidates[0].audio_base64' | base64 --decode > voice.wav

Prompt and preview text

instruction is the main voice design prompt. Describe the voice, age, delivery, tone, accent, pacing, and context in natural language.
{
  "instruction": "Energetic young presenter, bright tone, crisp diction, friendly but not cartoonish",
  "reference_text": "Here is your weekly product update.",
  "language": "en",
  "n": 3
}
reference_text is optional. When you provide it, candidates read that text so you can compare voices on the same line. Keep it short; the API accepts up to 300 characters.

Parameters

FieldDefaultNotes
instructionRequiredVoice design prompt. 1 to 2000 characters.
reference_textnullOptional preview text. Up to 300 characters.
languagenullOptional language hint such as en, zh, or ja.
n2Number of candidates to generate. Range: 1 to 4.
speed1.0Speaking speed multiplier. Must be greater than 0 and at most 3.
num_step32Diffusion steps. Range: 1 to 128.
guidance_scale2.0Higher values follow the prompt more strongly. Must be at least 0.
instruct_guidance_scale0.0Prompt conditioning guidance. Must be at least 0.
seednullOptional deterministic seed for candidate generation.
Voice Design accepts JSON only. Do not send MessagePack, multipart form data, inline reference audio, or service-internal fields such as features, features_json_file, or include_audio_base64.

Response

The response contains one or more generated candidates:
{
  "candidates": [
    {
      "id": "candidate-id",
      "index": 0,
      "audio_base64": "UklGRg...",
      "sample_rate": 44100,
      "duration_ms": 3100,
      "text": "Welcome to Fish Audio.",
      "language": "en"
    }
  ]
}
Use index to preserve the order returned by the model. id is a stable candidate identifier for this response. Optional fields such as text, instruct, and language appear only when available.

Billing and errors

Voice Design is billed once per successful generation request, not once per candidate. Authentication errors, validation errors, insufficient API credit, concurrency limits, upstream service errors, and empty candidate responses are not billed. For the full error format and retry guidance, see Errors.