Skip to main content
POST
/
v1
/
tts
Text to Speech
curl --request POST \
  --url https://api.fish.audio/v1/tts \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'model: <model>' \
  --data '{
  "text": "<string>",
  "temperature": 0.9,
  "top_p": 0.9,
  "references": [
    {
      "text": "<string>"
    }
  ],
  "reference_id": "<string>",
  "prosody": {
    "speed": 1,
    "volume": 0
  },
  "chunk_length": 200,
  "normalize": true,
  "format": "mp3",
  "sample_rate": 123,
  "mp3_bitrate": 128,
  "opus_bitrate": 32,
  "latency": "normal"
}'
This response does not have an example.
This endpoint only accepts application/json and application/msgpack.For best results, upload reference audio using the create model before using this one. This improves speech quality and reduces latency.To upload audio clips directly, without pre-uploading, serialize the request body with MessagePack as per the instructions.
Audio formats supported:
  • WAV / PCM
    • Sample Rate: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
    • Default Sample Rate: 44.1kHz
    • 16-bit, mono
  • MP3
    • Sample Rate: 32kHz, 44.1kHz
    • Default Sample Rate: 44.1kHz
    • mono
    • Bitrate: 64kbps, 128kbps (default), 192kbps
  • Opus
    • Sample Rate: 48kHz
    • Default Sample Rate: 48kHz
    • mono
    • Bitrate: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Headers

model
enum<string>
default:s1
required

Specify which TTS model to use. We recommend s1

Available options:
s1,
speech-1.6,
speech-1.5

Body

text
string
required

Text to be converted to speech

temperature
number
default:0.9

Controls randomness in the speech generation. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.1) make it more deterministic. We recommend 0.9 for s1 model

Required range: 0 <= x <= 1
top_p
number
default:0.9

Controls diversity via nucleus sampling. Lower values (e.g., 0.1) make the output more focused, while higher values (e.g., 1.0) allow more diversity. We recommend 0.9 for s1 model

Required range: 0 <= x <= 1
references
ReferenceAudio · object[] | null

References to be used for the speech, this requires MessagePack serialization, this will override reference_voices and reference_texts

reference_id
string | null

ID of the reference model o be used for the speech

prosody
object | null

Prosody to be used for the speech

chunk_length
integer
default:200

Chunk length to be used for the speech

Required range: 100 <= x <= 300
normalize
boolean
default:true

Whether to normalize the speech, this will reduce the latency but may reduce performance on numbers and dates

format
enum<string>
default:mp3

Format to be used for the speech

Available options:
wav,
pcm,
mp3,
opus
sample_rate
integer | null

Sample rate to be used for the speech

mp3_bitrate
enum<integer>
default:128

MP3 Bitrate to be used for the speech

Available options:
64,
128,
192
opus_bitrate
enum<integer>
default:32

Opus Bitrate to be used for the speech

Available options:
-1000,
24,
32,
48,
64
latency
enum<string>
default:normal

Latency to be used for the speech, balanced will reduce the latency but may lead to performance degradation

Available options:
normal,
balanced

Response

Request fulfilled, document follows