# Emotion Reference
Source: https://docs.fish.audio/api-reference/emotion-reference
Complete reference guide for all 64+ emotional expressions in Fish Audio
## Complete Emotion List
This reference guide provides a comprehensive list of all 64+ supported emotional expressions and voice styles available in Fish Audio's S1 TTS model. The latest S2-Pro model supports free-form natural language emotion tags.
The `(parenthesis)` syntax on this page applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details.
## Basic Emotions (24)
| Emotion | Tag | Description | Example Context |
| ----------- | --------------- | ----------------------- | --------------------------- |
| Happy | `(happy)` | Cheerful, upbeat tone | Good news, greetings |
| Sad | `(sad)` | Melancholic, downcast | Sympathy, bad news |
| Angry | `(angry)` | Frustrated, aggressive | Complaints, warnings |
| Excited | `(excited)` | Energetic, enthusiastic | Announcements, celebrations |
| Calm | `(calm)` | Peaceful, relaxed | Instructions, meditation |
| Nervous | `(nervous)` | Anxious, uncertain | Disclaimers, apologies |
| Confident | `(confident)` | Assertive, self-assured | Presentations, sales |
| Surprised | `(surprised)` | Shocked, amazed | Reactions, discoveries |
| Satisfied | `(satisfied)` | Content, pleased | Confirmations, reviews |
| Delighted | `(delighted)` | Very pleased, joyful | Celebrations, compliments |
| Scared | `(scared)` | Frightened, fearful | Warnings, horror stories |
| Worried | `(worried)` | Concerned, troubled | Concerns, questions |
| Upset | `(upset)` | Disturbed, distressed | Complaints, problems |
| Frustrated | `(frustrated)` | Annoyed, exasperated | Technical issues, delays |
| Depressed | `(depressed)` | Very sad, hopeless | Serious topics |
| Empathetic | `(empathetic)` | Understanding, caring | Support, counseling |
| Embarrassed | `(embarrassed)` | Ashamed, awkward | Apologies, mistakes |
| Disgusted | `(disgusted)` | Repelled, revolted | Negative reviews |
| Moved | `(moved)` | Emotionally touched | Heartfelt moments |
| Proud | `(proud)` | Accomplished, satisfied | Achievements, praise |
| Relaxed | `(relaxed)` | At ease, casual | Casual conversation |
| Grateful | `(grateful)` | Thankful, appreciative | Thanks, appreciation |
| Curious | `(curious)` | Inquisitive, interested | Questions, exploration |
| Sarcastic | `(sarcastic)` | Ironic, mocking | Humor, criticism |
## Advanced Emotions (25)
| Emotion | Tag | Description | Example Context |
| ------------- | ----------------- | ------------------------ | ---------------------- |
| Disdainful | `(disdainful)` | Contemptuous, scornful | Criticism, rejection |
| Unhappy | `(unhappy)` | Discontent, dissatisfied | Complaints, feedback |
| Anxious | `(anxious)` | Very worried, uneasy | Urgent matters |
| Hysterical | `(hysterical)` | Uncontrollably emotional | Extreme reactions |
| Indifferent | `(indifferent)` | Uncaring, neutral | Neutral responses |
| Uncertain | `(uncertain)` | Doubtful, unsure | Speculation, questions |
| Doubtful | `(doubtful)` | Skeptical, questioning | Disbelief, questioning |
| Confused | `(confused)` | Puzzled, perplexed | Clarification requests |
| Disappointed | `(disappointed)` | Let down, dissatisfied | Unmet expectations |
| Regretful | `(regretful)` | Sorry, remorseful | Apologies, mistakes |
| Guilty | `(guilty)` | Culpable, responsible | Confessions, apologies |
| Ashamed | `(ashamed)` | Deeply embarrassed | Serious mistakes |
| Jealous | `(jealous)` | Envious, resentful | Comparisons |
| Envious | `(envious)` | Wanting what others have | Admiration with desire |
| Hopeful | `(hopeful)` | Optimistic about future | Future plans |
| Optimistic | `(optimistic)` | Positive outlook | Encouragement |
| Pessimistic | `(pessimistic)` | Negative outlook | Warnings, doubts |
| Nostalgic | `(nostalgic)` | Longing for the past | Memories, stories |
| Lonely | `(lonely)` | Isolated, alone | Emotional content |
| Bored | `(bored)` | Uninterested, weary | Disinterest |
| Contemptuous | `(contemptuous)` | Showing contempt | Strong criticism |
| Sympathetic | `(sympathetic)` | Showing sympathy | Condolences |
| Compassionate | `(compassionate)` | Showing deep care | Support, help |
| Determined | `(determined)` | Resolved, decided | Goals, commitments |
| Resigned | `(resigned)` | Accepting defeat | Giving up, acceptance |
## Tone Markers (5)
| Tone | Tag | Description | When to Use |
| ---------- | ------------------- | -------------------- | -------------------------- |
| Hurried | `(in a hurry tone)` | Rushed, urgent | Time-sensitive information |
| Shouting | `(shouting)` | Loud, calling out | Getting attention |
| Screaming | `(screaming)` | Very loud, panicked | Emergencies, fear |
| Whispering | `(whispering)` | Very soft, secretive | Secrets, quiet scenes |
| Soft | `(soft tone)` | Gentle, quiet | Comfort, lullabies |
## Audio Effects (10)
| Effect | Tag | Description | Suggested Text |
| ------------- | ----------------- | ---------------------------- | -------------- |
| Laughing | `(laughing)` | Full laughter | Ha, ha, ha |
| Chuckling | `(chuckling)` | Light laugh | Heh, heh |
| Sobbing | `(sobbing)` | Crying heavily | (optional) |
| Crying Loudly | `(crying loudly)` | Intense crying | (optional) |
| Sighing | `(sighing)` | Exhale of relief/frustration | sigh |
| Groaning | `(groaning)` | Sound of frustration | ugh |
| Panting | `(panting)` | Out of breath | huff, puff |
| Gasping | `(gasping)` | Sharp intake of breath | gasp |
| Yawning | `(yawning)` | Tired sound | yawn |
| Snoring | `(snoring)` | Sleep sound | zzz |
## Special Effects
| Effect | Tag | Description |
| ------------------- | ----------------------- | ------------------------ |
| Audience Laughter | `(audience laughing)` | Crowd laughing sound |
| Background Laughter | `(background laughter)` | Ambient laughter |
| Crowd Laughter | `(crowd laughing)` | Large group laughing |
| Short Pause | `(break)` | Brief pause in speech |
| Long Pause | `(long-break)` | Extended pause in speech |
## Usage Examples
### Single Emotion
```
(happy) What a beautiful day!
(sad) I'm sorry for your loss.
(excited) We won the championship!
```
### Combined Effects
```
(sad)(whispering) I'll miss you so much.
(angry)(shouting) Get out of here now!
(excited)(laughing) We did it! Ha ha ha!
```
### Natural Expressions
```
That's hilarious! Ha ha ha! // Natural laughter
(sighing) Sigh... what a long day.
(panting) Huff... puff... almost there!
```
## Quick Selection Guide
### For Customer Service
* **Greetings**: `(friendly)`, `(cheerful)`, `(helpful)`
* **Understanding**: `(empathetic)`, `(concerned)`, `(sympathetic)`
* **Problem-solving**: `(confident)`, `(determined)`, `(professional)`
* **Apologies**: `(apologetic)`, `(regretful)`, `(sincere)`
### For Storytelling
* **Narration**: `(narrator)`, `(calm)`, `(mysterious)`
* **Character emotions**: Any from basic/advanced lists
* **Atmosphere**: `(whispering)`, `(dramatic)`, background effects
* **Action**: `(shouting)`, `(panting)`, `(struggling)`
### For Educational Content
* **Introduction**: `(enthusiastic)`, `(welcoming)`, `(friendly)`
* **Explanations**: `(calm)`, `(clear)`, `(patient)`
* **Questions**: `(curious)`, `(encouraging)`, `(thoughtful)`
* **Praise**: `(proud)`, `(delighted)`, `(impressed)`
### For Marketing
* **Excitement**: `(excited)`, `(enthusiastic)`, `(energetic)`
* **Trust**: `(confident)`, `(professional)`, `(sincere)`
* **Urgency**: `(urgent)`, `(in a hurry tone)`, `(important)`
* **Celebration**: `(celebrating)`, `(triumphant)`, `(joyful)`
## Emotion Categories
### Positive Emotions
`(happy)` `(excited)` `(delighted)` `(satisfied)` `(proud)` `(grateful)` `(confident)` `(relaxed)` `(hopeful)` `(optimistic)` `(moved)` `(compassionate)`
### Negative Emotions
`(sad)` `(angry)` `(frustrated)` `(depressed)` `(upset)` `(worried)` `(scared)` `(nervous)` `(disappointed)` `(regretful)` `(guilty)` `(ashamed)` `(lonely)` `(bored)`
### Neutral/Complex Emotions
`(calm)` `(curious)` `(surprised)` `(confused)` `(uncertain)` `(doubtful)` `(indifferent)` `(nostalgic)` `(sarcastic)` `(determined)` `(resigned)`
### Social/Interpersonal Emotions
`(empathetic)` `(sympathetic)` `(embarrassed)` `(jealous)` `(envious)` `(disdainful)` `(contemptuous)` `(disgusted)`
## Model Support Matrix
| Model | Basic | Advanced | Tones | Effects | Intensity |
| ----------------- | ----- | -------- | ----- | ------- | --------- |
| Fish Speech 1.5 | ✓ | Limited | ✓ | 6/10 | No |
| Fish Audio S1 | ✓ | ✓ | ✓ | ✓ | ✓ |
| Fish Audio S2-Pro | ✓ | ✓ | ✓ | ✓ | ✓ |
## Tips for Natural Speech
1. **Start Simple**: Begin with basic emotions before combining
2. **Test Variations**: Different voices handle emotions differently
3. **Context Matters**: Match emotions to content logically
4. **Less is More**: Avoid overusing emotions in short text
5. **Natural Flow**: Space out emotional changes
6. **Sound Effects**: Include appropriate text after audio tags
7. **Preview Often**: Test how emotions sound with your voice
## Common Mistakes to Avoid
* ❌ Placing emotion tags mid-sentence in English
* ❌ Forgetting parentheses around tags
* ❌ Using unsupported custom tags
* ❌ Mixing conflicting emotions
* ❌ Overusing effects in short text
* ❌ Missing text for sound effects
* ❌ Using wrong language placement rules
## See Also
* [Emotion Control Guide](/developer-guide/core-features/emotions) - Technical implementation
* [Text-to-Speech Best Practices](/developer-guide/core-features/text-to-speech)
* [API Reference](/api-reference/introduction)
* [Try it live](https://fish.audio) - Test emotions in the playground
# Create Model
Source: https://docs.fish.audio/api-reference/endpoint/model/create-model
post /model
Create a new voice model
Since this endpoint requires uploading file, it only accepts `multipart/form-data` and `application/msgpack`.
# Delete Model
Source: https://docs.fish.audio/api-reference/endpoint/model/delete-model
delete /model/{id}
Delete an existing model
# Get Model
Source: https://docs.fish.audio/api-reference/endpoint/model/get-model
get /model/{id}
Get details of a specific model
# List Models
Source: https://docs.fish.audio/api-reference/endpoint/model/list-models
get /model
Get a list of all models
# Update Model
Source: https://docs.fish.audio/api-reference/endpoint/model/update-model
patch /model/{id}
Update an existing model
# Speech to Text
Source: https://docs.fish.audio/api-reference/endpoint/openapi-v1/speech-to-text
post /v1/asr
Transcribe audio to text
This BETA endpoint only accepts `application/form-data` and `application/msgpack`.
# Text to Speech
Source: https://docs.fish.audio/api-reference/endpoint/openapi-v1/text-to-speech
post /v1/tts
Convert text to speech
This endpoint only accepts `application/json` and `application/msgpack`.
For best results, upload reference audio using the [create model](/api-reference/endpoint/model/create-model) before using this one. This improves speech quality and reduces latency.
To upload audio clips directly, without pre-uploading, serialize the request body with MessagePack as per the [instructions](/developer-guide/core-features/text-to-speech#direct-api-usage).
Audio formats supported:
* WAV / PCM
* Sample Rate: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz
* Default Sample Rate: 44.1kHz
* 16-bit, mono
* MP3
* Sample Rate: 32kHz, 44.1kHz
* Default Sample Rate: 44.1kHz
* mono
* Bitrate: 64kbps, 128kbps (default), 192kbps
* Opus
* Sample Rate: 48kHz
* Default Sample Rate: 48kHz
* mono
* Bitrate: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps
# Get API Credit
Source: https://docs.fish.audio/api-reference/endpoint/wallet/get-api-credit
get /wallet/{user_id}/api-credit
Get current API credit balance
# Get User Premium
Source: https://docs.fish.audio/api-reference/endpoint/wallet/get-user-package
get /wallet/{user_id}/package
Get current user premium information
# WebSocket TTS Streaming
Source: https://docs.fish.audio/api-reference/endpoint/websocket/tts-live
Real-time text-to-speech streaming via WebSocket
The WebSocket TTS endpoint enables bidirectional streaming for low-latency text-to-speech generation with MessagePack serialization.
The `request` payload inside `StartEvent` uses the same parameters as the HTTP [Text to Speech API](/api-reference/endpoint/openapi-v1/text-to-speech). For more detailed field guidance, model-specific behavior, and examples, see that page. In WebSocket mode, `request.text` is typically empty in `StartEvent`, and the text content is sent through subsequent `TextEvent` messages.
# Introduction
Source: https://docs.fish.audio/api-reference/introduction
How to use the Fish Audio API
## Welcome
You can generate a new API key at [https://fish.audio/app/api-keys/](https://fish.audio/app/api-keys/).
## Quick Start
See our [Quick Start](/developer-guide/getting-started/quickstart) guide to generate audio in under 2 minutes.
## Create a Voice Clone
Use our [/model endpoint](/api-reference/endpoint/model/create-model) to create a voice clone model.
## Generate Speech
Use our [/v1/tts endpoint](/api-reference/endpoint/openapi-v1/text-to-speech) to generate speech.
## Real-time Streaming
Use our [Python SDK](/developer-guide/sdk-guide/python/websocket) or [JavaScript SDK](/developer-guide/sdk-guide/javascript/websocket) for real-time audio streaming with WebSocket.
## Rate Limits
You can find the rate limits for each endpoint in the [Rate Limits](/developer-guide/models-pricing/pricing-and-rate-limits) section.
# API Reference
Source: https://docs.fish.audio/api-reference/sdk/javascript/api-reference
Complete reference for Fish Audio JavaScript SDK
## Client
Import and initialize the client:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
```
## Text to Speech
### convert()
Generate speech from text.
```typescript theme={null}
const audio = await fishAudio.textToSpeech.convert({ text: "Hello" });
```
Parameters: `request` (TTSRequest), `model?` (Backends)
Returns: `Promise>`
### convertRealtime()
Realtime streaming TTS over WebSocket.
```typescript theme={null}
async function* textStream() { yield "Hello, "; yield "world!"; }
const conn = await fishAudio.textToSpeech.convertRealtime({ text: "" }, textStream());
```
Parameters: `request` (TTSRequest with `text: ""`), `textStream` (`AsyncIterable`), `backend?` (Backends)
Returns: `RealtimeConnection` (`EventEmitter`-like connection) emitting `RealtimeEvents`
## Speech to Text
### convert()
Transcribe audio to text.
```typescript theme={null}
const res = await fishAudio.speechToText.convert({ audio: myAudio });
console.log(res.text);
```
Parameters: `request` (STTRequest)
Returns: `STTResponse`
## Voices
### search()
List/search available voice models.
```typescript theme={null}
const results = await fishAudio.voices.search();
```
Parameters: `request?` (ModelListRequest)
Returns: `ModelListResponse`
### get()
Get model details.
```typescript theme={null}
const model = await fishAudio.voices.get("model_id");
```
Parameters: `voiceId` (string)
Returns: `ModelEntity`
### ivc.create()
Create a new voice model from audio samples.
```typescript theme={null}
const res = await fishAudio.voices.ivc.create({ title, voices: [file], cover_image: file });
```
Parameters: `request` (ModelCreateRequest)
Returns: `ModelEntity`
### update()
Update model metadata.
```typescript theme={null}
await fishAudio.voices.update("model_id", { title: "New Title" });
```
Parameters: `voiceId` (string), `request` (UpdateModelRequest)
Returns: `UpdateVoiceResponse`
### delete()
Delete a model.
```typescript theme={null}
await fishAudio.voices.delete("model_id");
```
Parameters: `voiceId` (string)
Returns: `DeleteVoiceResponse`
## User
### get\_api\_credit()
Check API credit balance.
```typescript theme={null}
await fishAudio.user.get_api_credit();
```
Returns: `APICreditResponse`
### get\_package()
Get subscription package details.
```typescript theme={null}
await fishAudio.user.get_package();
```
Returns: `PackageResponse`
## Request Classes
### TTSRequest
Text-to-speech parameters.
```typescript theme={null}
{
text: "Hello",
reference_id: "model_id",
references: [ { audio: File, text: "sample" } ],
format: "mp3",
prosody: { speed: 1.0, volume: 0 },
}
```
Fields: `text`, `reference_id`, `references`, `format`, `mp3_bitrate`, `opus_bitrate`, `sample_rate`, `prosody`, `latency`, `chunk_length`, `normalize`, `temperature`, `top_p`
### STTRequest
Speech-to-text parameters.
```typescript theme={null}
{ audio: File, language?: "en", ignore_timestamps?: boolean }
```
Fields: `audio`, `language?`, `ignore_timestamps?`
### ReferenceAudio
Reference audio for voice cloning.
```typescript theme={null}
{ audio: File, text: "spoken text" }
```
Fields: `audio`, `text`
### Prosody
Speed and volume control.
```typescript theme={null}
{ speed: 1.2, volume: 5 }
```
Fields: `speed` (0.5–2.0), `volume` (-20 to 20)
### Backends
The backend model to use.
```typescript theme={null}
Backends = 's1' | 's2-pro';
```
## Response Classes
### STTResponse
Transcription result.
```typescript theme={null}
response.text // Complete transcription
response.duration // Duration in seconds
response.segments // ASRSegment[]
```
### ASRSegment
Timestamped text segment.
Fields: `text` (string), `start` (number, seconds), `end` (number, seconds)
### ModelEntity
Voice model information.
Fields: `_id`, `title`, `description`, `visibility`, `created_at`, `updated_at`, `tags`
### ModelListResponse
List response for voices.
Fields: `items` (ModelEntity\[]), `total` (number)
### APICreditResponse
API credit information.
Fields: `_id` (string), `user_id` (string), `credit` (string), `created_at` (string), `updated_at` (string), `has_phone_sha256` (boolean), `has_free_credit?` (boolean)
### PackageResponse
Subscription package details.
Fields: `user_id` (string), `type` (string), `total` (number), `balance` (number), `created_at` (string), `updated_at` (string), `finished_at` (string)
## WebSocket Classes
### RealtimeEvents
Events emitted by `convertRealtime` connections.
| Event | Meaning |
| ------------- | ---------------------- |
| `OPEN` | Connection established |
| `AUDIO_CHUNK` | Audio chunk received |
| `ERROR` | Error occurred |
| `CLOSE` | Connection closed |
## Event Classes
### StartEvent
Stream start event.
Fields: `event` ("start"), `request` (TTSRequest)
### TextEvent
Text chunk event.
Fields: `event` ("text"), `text` (string)
### FlushEvent
Flush text chunks event.
Fields: `event` ("flush")
### CloseEvent
Stream close event.
Fields: `event` ("stop")
## Exceptions
### FishAudioError
Generic error with status code, body, rawResponse.
### FishAudioTimeoutError
Connection timeout error.
# Client
Source: https://docs.fish.audio/api-reference/sdk/python/client
# fishaudio.client
Main Fish Audio client classes.
## FishAudio Objects
```python theme={null}
class FishAudio()
```
Synchronous Fish Audio API client.
**Example**:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(api_key="your_api_key")
# Generate speech
audio = client.tts.convert(text="Hello world")
with open("output.mp3", "wb") as f:
for chunk in audio:
f.write(chunk)
# List voices
voices = client.voices.list(page_size=20)
print(f"Found {voices.total} voices")
```
#### \_\_init\_\_
```python theme={null}
def __init__(*,
api_key: Optional[str] = None,
base_url: str = "https://api.fish.audio",
timeout: float = 240.0,
httpx_client: Optional[httpx.Client] = None)
```
Initialize Fish Audio client.
**Arguments**:
* `api_key` - API key (can also use FISH\_API\_KEY env var)
* `base_url` - API base URL
* `timeout` - Request timeout in seconds
* `httpx_client` - Optional custom HTTP client
#### tts
```python theme={null}
@property
def tts() -> TTSClient
```
Access TTS (text-to-speech) operations.
#### asr
```python theme={null}
@property
def asr() -> ASRClient
```
Access ASR (speech-to-text) operations.
#### voices
```python theme={null}
@property
def voices() -> VoicesClient
```
Access voice management operations.
#### account
```python theme={null}
@property
def account() -> AccountClient
```
Access account/billing operations.
#### close
```python theme={null}
def close() -> None
```
Close the HTTP client.
## AsyncFishAudio Objects
```python theme={null}
class AsyncFishAudio()
```
Asynchronous Fish Audio API client.
**Example**:
```python theme={null}
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio(api_key="your_api_key")
# Generate speech
audio = client.tts.convert(text="Hello world")
async with aiofiles.open("output.mp3", "wb") as f:
async for chunk in audio:
await f.write(chunk)
# List voices
voices = await client.voices.list(page_size=20)
print(f"Found {voices.total} voices")
asyncio.run(main())
```
#### \_\_init\_\_
```python theme={null}
def __init__(*,
api_key: Optional[str] = None,
base_url: str = "https://api.fish.audio",
timeout: float = 240.0,
httpx_client: Optional[httpx.AsyncClient] = None)
```
Initialize async Fish Audio client.
**Arguments**:
* `api_key` - API key (can also use FISH\_API\_KEY env var)
* `base_url` - API base URL
* `timeout` - Request timeout in seconds
* `httpx_client` - Optional custom async HTTP client
#### tts
```python theme={null}
@property
def tts() -> AsyncTTSClient
```
Access TTS (text-to-speech) operations.
#### asr
```python theme={null}
@property
def asr() -> AsyncASRClient
```
Access ASR (speech-to-text) operations.
#### voices
```python theme={null}
@property
def voices() -> AsyncVoicesClient
```
Access voice management operations.
#### account
```python theme={null}
@property
def account() -> AsyncAccountClient
```
Access account/billing operations.
#### close
```python theme={null}
async def close() -> None
```
Close the HTTP client.
# Core
Source: https://docs.fish.audio/api-reference/sdk/python/core
# fishaudio.core.client\_wrapper
HTTP client wrapper for managing requests and authentication.
## BaseClientWrapper Objects
```python theme={null}
class BaseClientWrapper()
```
Base wrapper with shared logic for sync/async clients.
#### get\_headers
```python theme={null}
def get_headers(
additional_headers: Optional[dict[str, str]] = None) -> dict[str, str]
```
Build headers including authentication and user agent.
## ClientWrapper Objects
```python theme={null}
class ClientWrapper(BaseClientWrapper)
```
Wrapper for httpx.Client that handles authentication and error handling.
#### request
```python theme={null}
def request(method: str,
path: str,
*,
request_options: Optional[RequestOptions] = None,
**kwargs: Any) -> httpx.Response
```
Make an HTTP request with error handling.
**Arguments**:
* `method` - HTTP method (GET, POST, etc.)
* `path` - API endpoint path
* `request_options` - Optional request-level overrides
* `**kwargs` - Additional arguments to pass to httpx.request
**Returns**:
httpx.Response object
**Raises**:
* `APIError` - On non-2xx responses
#### client
```python theme={null}
@property
def client() -> httpx.Client
```
Get underlying httpx.Client for advanced usage (e.g., WebSockets).
#### close
```python theme={null}
def close() -> None
```
Close the HTTP client.
## AsyncClientWrapper Objects
```python theme={null}
class AsyncClientWrapper(BaseClientWrapper)
```
Wrapper for httpx.AsyncClient that handles authentication and error handling.
#### request
```python theme={null}
async def request(method: str,
path: str,
*,
request_options: Optional[RequestOptions] = None,
**kwargs: Any) -> httpx.Response
```
Make an async HTTP request with error handling.
**Arguments**:
* `method` - HTTP method (GET, POST, etc.)
* `path` - API endpoint path
* `request_options` - Optional request-level overrides
* `**kwargs` - Additional arguments to pass to httpx.request
**Returns**:
httpx.Response object
**Raises**:
* `APIError` - On non-2xx responses
#### client
```python theme={null}
@property
def client() -> httpx.AsyncClient
```
Get underlying httpx.AsyncClient for advanced usage (e.g., WebSockets).
#### close
```python theme={null}
async def close() -> None
```
Close the HTTP client.
# fishaudio.core.request\_options
Request-level options for API calls.
## RequestOptions Objects
```python theme={null}
class RequestOptions()
```
Options that can be provided on a per-request basis to override client defaults.
**Attributes**:
* `timeout` - Override the client's default timeout (in seconds)
* `max_retries` - Override the client's default max retries
* `additional_headers` - Additional headers to include in the request
* `additional_query_params` - Additional query parameters to include
#### get\_timeout
```python theme={null}
def get_timeout() -> Optional[httpx.Timeout]
```
Convert timeout to httpx.Timeout if set.
# fishaudio.core.iterators
Audio stream wrappers with collection utilities.
## AudioStream Objects
```python theme={null}
class AudioStream()
```
Wrapper for sync audio byte streams with collection utilities.
This class wraps an iterator of audio bytes and provides a convenient
`.collect()` method to gather all chunks into a single bytes object.
**Examples**:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(api_key="...")
# Collect all audio at once
audio = client.tts.stream(text="Hello!").collect()
# Or stream chunks manually
for chunk in client.tts.stream(text="Hello!"):
process_chunk(chunk)
```
#### \_\_init\_\_
```python theme={null}
def __init__(iterator: Iterator[bytes])
```
Initialize the audio iterator wrapper.
**Arguments**:
* `iterator` - The underlying iterator of audio bytes
#### \_\_iter\_\_
```python theme={null}
def __iter__() -> Iterator[bytes]
```
Allow direct iteration over audio chunks.
#### collect
```python theme={null}
def collect() -> bytes
```
Collect all audio chunks into a single bytes object.
This consumes the iterator and returns all audio data as bytes.
After calling this method, the iterator cannot be used again.
**Returns**:
Complete audio data as bytes
**Examples**:
```python theme={null}
audio = client.tts.stream(text="Hello!").collect()
with open("output.mp3", "wb") as f:
f.write(audio)
```
## AsyncAudioStream Objects
```python theme={null}
class AsyncAudioStream()
```
Wrapper for async audio byte streams with collection utilities.
This class wraps an async iterator of audio bytes and provides a convenient
`.collect()` method to gather all chunks into a single bytes object.
**Examples**:
```python theme={null}
from fishaudio import AsyncFishAudio
client = AsyncFishAudio(api_key="...")
# Collect all audio at once
stream = await client.tts.stream(text="Hello!")
audio = await stream.collect()
# Or stream chunks manually
async for chunk in await client.tts.stream(text="Hello!"):
await process_chunk(chunk)
```
#### \_\_init\_\_
```python theme={null}
def __init__(async_iterator: AsyncIterator[bytes])
```
Initialize the async audio iterator wrapper.
**Arguments**:
* `async_iterator` - The underlying async iterator of audio bytes
#### \_\_aiter\_\_
```python theme={null}
def __aiter__() -> AsyncIterator[bytes]
```
Allow direct async iteration over audio chunks.
#### collect
```python theme={null}
async def collect() -> bytes
```
Collect all audio chunks into a single bytes object.
This consumes the async iterator and returns all audio data as bytes.
After calling this method, the iterator cannot be used again.
**Returns**:
Complete audio data as bytes
**Examples**:
```python theme={null}
stream = await client.tts.stream(text="Hello!")
audio = await stream.collect()
with open("output.mp3", "wb") as f:
f.write(audio)
```
# fishaudio.core.websocket\_options
WebSocket-level options for WebSocket connections.
## WebSocketOptions Objects
```python theme={null}
class WebSocketOptions()
```
Options for configuring WebSocket connections.
These options are passed directly to httpx\_ws's connect\_ws/aconnect\_ws functions.
For complete documentation, see [https://frankie567.github.io/httpx-ws/reference/httpx\_ws/](https://frankie567.github.io/httpx-ws/reference/httpx_ws/)
**Attributes**:
* `keepalive_ping_timeout_seconds` - Maximum delay the client will wait for an answer
to its Ping event. If the delay is exceeded, WebSocketNetworkError will be
raised and the connection closed. Default: 20 seconds.
* `keepalive_ping_interval_seconds` - Interval at which the client will automatically
send a Ping event to keep the connection alive. Set to None to disable this
mechanism. Default: 20 seconds.
* `max_message_size_bytes` - Message size in bytes to receive from the server.
* `Default` - 65536 bytes (64 KiB).
* `queue_size` - Size of the queue where received messages will be held until they
are consumed. If the queue is full, the client will stop receiving messages
from the server until the queue has room available. Default: 512.
**Notes**:
Parameter descriptions adapted from httpx\_ws documentation.
#### to\_httpx\_ws\_kwargs
```python theme={null}
def to_httpx_ws_kwargs() -> dict[str, Any]
```
Convert to kwargs dict for httpx\_ws aconnect\_ws/connect\_ws.
# fishaudio.core.omit
OMIT sentinel for distinguishing None from not-provided parameters.
# Exceptions
Source: https://docs.fish.audio/api-reference/sdk/python/exceptions
# fishaudio.exceptions
Custom exceptions for the Fish Audio SDK.
## FishAudioError Objects
```python theme={null}
class FishAudioError(Exception)
```
Base exception for all Fish Audio SDK errors.
## APIError Objects
```python theme={null}
class APIError(FishAudioError)
```
Raised when the API returns an error response.
## AuthenticationError Objects
```python theme={null}
class AuthenticationError(APIError)
```
Raised when authentication fails (401).
## PermissionError Objects
```python theme={null}
class PermissionError(APIError)
```
Raised when permission is denied (403).
## NotFoundError Objects
```python theme={null}
class NotFoundError(APIError)
```
Raised when a resource is not found (404).
## RateLimitError Objects
```python theme={null}
class RateLimitError(APIError)
```
Raised when rate limit is exceeded (429).
## ServerError Objects
```python theme={null}
class ServerError(APIError)
```
Raised when the server encounters an error (5xx).
## WebSocketError Objects
```python theme={null}
class WebSocketError(FishAudioError)
```
Raised when WebSocket connection or streaming fails.
## ValidationError Objects
```python theme={null}
class ValidationError(FishAudioError)
```
Raised when request validation fails.
## DependencyError Objects
```python theme={null}
class DependencyError(FishAudioError)
```
Raised when a required dependency is missing.
# Overview
Source: https://docs.fish.audio/api-reference/sdk/python/overview
Fish Audio Python SDK for text-to-speech and voice cloning

# Fish Audio Python SDK
[](https://badge.fury.io/py/fish-audio-sdk)
[](https://pypi.org/project/fish-audio-sdk/)
[](https://pypi.org/project/fish-audio-sdk/)
[](https://codecov.io/gh/fishaudio/fish-audio-python)
[](https://github.com/fishaudio/fish-audio-python/blob/main/LICENSE)
The official Python library for the Fish Audio API
**Documentation:** [Python SDK Guide](https://docs.fish.audio/developer-guide/sdk-guide/python/) | [API Reference](https://docs.fish.audio/api-reference/sdk/python/)
> \[!IMPORTANT]
>
> ## Changes to PyPI Versioning
>
> For existing users on Fish Audio Python SDK, please note that the starting version is now `1.0.0`. The last version before this was `2025.6.3`. You may need to adjust your version constraints accordingly.
>
> The original API in the `fish_audio_sdk` package has NOT been removed, but you will not receive any updates if you continue using the old versioning scheme.
>
> The simplest fix is to update your dependency to `fish-audio-sdk>=1.0.0` to continue receiving updates, or by pinning to a specific version like `fish-audio-sdk==1.0.0` when installing via your package manager. There are no changes to the API itself in this transition.
>
> If you're using the legacy `fish_audio_sdk` and would like to switch to the newer, more robust `fishaudio` package, see the [migration guide](https://docs.fish.audio/archive/python-sdk-legacy/migration-guide) to upgrade.
## Installation
```bash theme={null}
pip install fish-audio-sdk
# With audio playback utilities
pip install fish-audio-sdk[utils]
```
## Authentication
Get your API key from [fish.audio/app/api-keys](https://fish.audio/app/api-keys):
```bash theme={null}
export FISH_API_KEY=your_api_key_here
```
Or provide directly:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(api_key="your_api_key")
```
## Quick Start
**Synchronous:**
```python theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play, save
client = FishAudio()
# Generate audio
audio = client.tts.convert(text="Hello, world!")
# Play or save
play(audio)
save(audio, "output.mp3")
```
**Asynchronous:**
```python theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play, save
async def main():
client = AsyncFishAudio()
audio = await client.tts.convert(text="Hello, world!")
play(audio)
save(audio, "output.mp3")
asyncio.run(main())
```
## Core Features
### Text-to-Speech
**With custom voice:**
```python theme={null}
# Use a specific voice by ID
audio = client.tts.convert(
text="Custom voice",
reference_id="802e3bc2b27e49c2995d23ef70e6ac89"
)
```
**With speed control:**
```python theme={null}
audio = client.tts.convert(
text="Speaking faster!",
speed=1.5 # 1.5x speed
)
```
**Reusable configuration:**
```python theme={null}
from fishaudio.types import TTSConfig, Prosody
config = TTSConfig(
prosody=Prosody(speed=1.2, volume=-5),
reference_id="933563129e564b19a115bedd57b7406a",
format="wav",
latency="balanced"
)
# Reuse across generations
audio1 = client.tts.convert(text="First message", config=config)
audio2 = client.tts.convert(text="Second message", config=config)
```
**Chunk-by-chunk processing:**
```python theme={null}
# Stream and process chunks as they arrive
for chunk in client.tts.stream(text="Long content..."):
send_to_websocket(chunk)
# Or collect all chunks
audio = client.tts.stream(text="Hello!").collect()
```
[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/text-to-speech)
### Speech-to-Text
```python theme={null}
# Transcribe audio
with open("audio.wav", "rb") as f:
result = client.asr.transcribe(audio=f.read(), language="en")
print(result.text)
# Access timestamped segments
for segment in result.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
```
[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/speech-to-text)
### Real-time Streaming
Stream dynamically generated text for conversational AI and live applications:
**Synchronous:**
```python theme={null}
def text_chunks():
yield "Hello, "
yield "this is "
yield "streaming!"
audio_stream = client.tts.stream_websocket(text_chunks(), latency="balanced")
play(audio_stream)
```
**Asynchronous:**
```python theme={null}
async def text_chunks():
yield "Hello, "
yield "this is "
yield "streaming!"
audio_stream = await client.tts.stream_websocket(text_chunks(), latency="balanced")
play(audio_stream)
```
[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/websocket)
### Voice Cloning
**Instant cloning:**
```python theme={null}
from fishaudio.types import ReferenceAudio
# Clone voice on-the-fly
with open("reference.wav", "rb") as f:
audio = client.tts.convert(
text="Cloned voice speaking",
references=[ReferenceAudio(
audio=f.read(),
text="Text spoken in reference"
)]
)
```
**Persistent voice models:**
```python theme={null}
# Create voice model for reuse
with open("voice_sample.wav", "rb") as f:
voice = client.voices.create(
title="My Voice",
voices=[f.read()],
description="Custom voice clone"
)
# Use the created model
audio = client.tts.convert(
text="Using my saved voice",
reference_id=voice.id
)
```
[Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/voice-cloning)
## Resource Clients
| Resource | Description | Key Methods |
| ---------------- | ------------------ | ----------------------------------------------------- |
| `client.tts` | Text-to-speech | `convert()`, `stream()`, `stream_websocket()` |
| `client.asr` | Speech recognition | `transcribe()` |
| `client.voices` | Voice management | `list()`, `get()`, `create()`, `update()`, `delete()` |
| `client.account` | Account info | `get_credits()`, `get_package()` |
## Error Handling
```python theme={null}
from fishaudio.exceptions import (
AuthenticationError,
RateLimitError,
ValidationError,
FishAudioError
)
try:
audio = client.tts.convert(text="Hello!")
except AuthenticationError:
print("Invalid API key")
except RateLimitError:
print("Rate limit exceeded")
except ValidationError as e:
print(f"Invalid request: {e}")
except FishAudioError as e:
print(f"API error: {e}")
```
## Resources
* **Documentation:** [SDK Guide](https://docs.fish.audio/developer-guide/sdk-guide/python/) | [API Reference](https://docs.fish.audio/api-reference/sdk/python/)
* **Package:** [PyPI](https://pypi.org/project/fish-audio-sdk/) | [GitHub](https://github.com/fishaudio/fish-audio-python)
* **Legacy SDK:** [Documentation](https://docs.fish.audio/archive/python-sdk-legacy) | [Migration Guide](https://docs.fish.audio/archive/python-sdk-legacy/migration-guide)
## License
This project is licensed under the Apache-2.0 License - see the [LICENSE](LICENSE) file for details.
# Resources
Source: https://docs.fish.audio/api-reference/sdk/python/resources
# fishaudio.resources.voices
Voice management namespace client.
## VoicesClient Objects
```python theme={null}
class VoicesClient()
```
Synchronous voice management operations.
#### list
```python theme={null}
def list(
*,
page_size: int = 10,
page_number: int = 1,
title: Optional[str] = OMIT,
tags: Optional[Union[list[str], str]] = OMIT,
self_only: bool = False,
author_id: Optional[str] = OMIT,
language: Optional[Union[list[str], str]] = OMIT,
title_language: Optional[Union[list[str], str]] = OMIT,
sort_by: str = "task_count",
request_options: Optional[RequestOptions] = None
) -> PaginatedResponse[Voice]
```
List available voices/models.
**Arguments**:
* `page_size` - Number of results per page
* `page_number` - Page number (1-indexed)
* `title` - Filter by title
* `tags` - Filter by tags (single tag or list)
* `self_only` - Only return user's own voices
* `author_id` - Filter by author ID
* `language` - Filter by language(s)
* `title_language` - Filter by title language(s)
* `sort_by` - Sort field ("task\_count" or "created\_at")
* `request_options` - Request-level overrides
**Returns**:
Paginated response with total count and voice items
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
# List all voices
voices = client.voices.list(page_size=20)
print(f"Total: {voices.total}")
for voice in voices.items:
print(f"{voice.title}: {voice.id}")
# Filter by tags
tagged = client.voices.list(tags=["male", "english"])
```
#### get
```python theme={null}
def get(voice_id: str,
*,
request_options: Optional[RequestOptions] = None) -> Voice
```
Get voice by ID.
**Arguments**:
* `voice_id` - Voice model ID
* `request_options` - Request-level overrides
**Returns**:
Voice model details
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
voice = client.voices.get("voice_id_here")
print(voice.title, voice.description)
```
#### create
```python theme={null}
def create(*,
title: str,
voices: builtins.list[bytes],
description: Optional[str] = OMIT,
texts: Optional[builtins.list[str]] = OMIT,
tags: Optional[builtins.list[str]] = OMIT,
cover_image: Optional[bytes] = OMIT,
visibility: Visibility = "private",
train_mode: str = "fast",
enhance_audio_quality: bool = True,
request_options: Optional[RequestOptions] = None) -> Voice
```
Create/clone a new voice.
**Arguments**:
* `title` - Voice model name
* `voices` - List of audio file bytes for training
* `description` - Voice description
* `texts` - Transcripts for voice samples
* `tags` - Tags for categorization
* `cover_image` - Cover image bytes
* `visibility` - Visibility setting (public, unlist, private)
* `train_mode` - Training mode (currently only "fast" supported)
* `enhance_audio_quality` - Whether to enhance audio quality
* `request_options` - Request-level overrides
**Returns**:
Created voice model
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
with open("voice1.wav", "rb") as f1, open("voice2.wav", "rb") as f2:
voice = client.voices.create(
title="My Voice",
voices=[f1.read(), f2.read()],
description="Custom voice clone",
tags=["custom", "english"]
)
print(f"Created: {voice.id}")
```
#### update
```python theme={null}
def update(voice_id: str,
*,
title: Optional[str] = OMIT,
description: Optional[str] = OMIT,
cover_image: Optional[bytes] = OMIT,
visibility: Optional[Visibility] = OMIT,
tags: Optional[builtins.list[str]] = OMIT,
request_options: Optional[RequestOptions] = None) -> None
```
Update voice metadata.
**Arguments**:
* `voice_id` - Voice model ID
* `title` - New title
* `description` - New description
* `cover_image` - New cover image bytes
* `visibility` - New visibility setting
* `tags` - New tags
* `request_options` - Request-level overrides
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
client.voices.update(
"voice_id_here",
title="Updated Title",
visibility="public"
)
```
#### delete
```python theme={null}
def delete(voice_id: str,
*,
request_options: Optional[RequestOptions] = None) -> None
```
Delete a voice.
**Arguments**:
* `voice_id` - Voice model ID
* `request_options` - Request-level overrides
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
client.voices.delete("voice_id_here")
```
## AsyncVoicesClient Objects
```python theme={null}
class AsyncVoicesClient()
```
Asynchronous voice management operations.
#### list
```python theme={null}
async def list(
*,
page_size: int = 10,
page_number: int = 1,
title: Optional[str] = OMIT,
tags: Optional[Union[list[str], str]] = OMIT,
self_only: bool = False,
author_id: Optional[str] = OMIT,
language: Optional[Union[list[str], str]] = OMIT,
title_language: Optional[Union[list[str], str]] = OMIT,
sort_by: str = "task_count",
request_options: Optional[RequestOptions] = None
) -> PaginatedResponse[Voice]
```
List available voices/models (async). See sync version for details.
#### get
```python theme={null}
async def get(voice_id: str,
*,
request_options: Optional[RequestOptions] = None) -> Voice
```
Get voice by ID (async). See sync version for details.
#### create
```python theme={null}
async def create(*,
title: str,
voices: builtins.list[bytes],
description: Optional[str] = OMIT,
texts: Optional[builtins.list[str]] = OMIT,
tags: Optional[builtins.list[str]] = OMIT,
cover_image: Optional[bytes] = OMIT,
visibility: Visibility = "private",
train_mode: str = "fast",
enhance_audio_quality: bool = True,
request_options: Optional[RequestOptions] = None) -> Voice
```
Create/clone a new voice (async). See sync version for details.
#### update
```python theme={null}
async def update(voice_id: str,
*,
title: Optional[str] = OMIT,
description: Optional[str] = OMIT,
cover_image: Optional[bytes] = OMIT,
visibility: Optional[Visibility] = OMIT,
tags: Optional[builtins.list[str]] = OMIT,
request_options: Optional[RequestOptions] = None) -> None
```
Update voice metadata (async). See sync version for details.
#### delete
```python theme={null}
async def delete(voice_id: str,
*,
request_options: Optional[RequestOptions] = None) -> None
```
Delete a voice (async). See sync version for details.
# fishaudio.resources.account
Account namespace client for billing and credits.
## AccountClient Objects
```python theme={null}
class AccountClient()
```
Synchronous account operations.
#### get\_credits
```python theme={null}
def get_credits(*,
check_free_credit: Optional[bool] = OMIT,
request_options: Optional[RequestOptions] = None) -> Credits
```
Get API credit balance.
**Arguments**:
* `check_free_credit` - Whether to check free credit availability
* `request_options` - Request-level overrides
**Returns**:
Credits information
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
credits = client.account.get_credits()
print(f"Available credits: {float(credits.credit)}")
# Check free credit availability
credits = client.account.get_credits(check_free_credit=True)
if credits.has_free_credit:
print("Free credits available!")
```
#### get\_package
```python theme={null}
def get_package(*,
request_options: Optional[RequestOptions] = None) -> Package
```
Get package information.
**Arguments**:
* `request_options` - Request-level overrides
**Returns**:
Package information
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
package = client.account.get_package()
print(f"Balance: {package.balance}/{package.total}")
```
## AsyncAccountClient Objects
```python theme={null}
class AsyncAccountClient()
```
Asynchronous account operations.
#### get\_credits
```python theme={null}
async def get_credits(
*,
check_free_credit: Optional[bool] = OMIT,
request_options: Optional[RequestOptions] = None) -> Credits
```
Get API credit balance (async).
**Arguments**:
* `check_free_credit` - Whether to check free credit availability
* `request_options` - Request-level overrides
**Returns**:
Credits information
**Example**:
```python theme={null}
client = AsyncFishAudio(api_key="...")
credits = await client.account.get_credits()
print(f"Available credits: {float(credits.credit)}")
# Check free credit availability
credits = await client.account.get_credits(check_free_credit=True)
if credits.has_free_credit:
print("Free credits available!")
```
#### get\_package
```python theme={null}
async def get_package(*,
request_options: Optional[RequestOptions] = None
) -> Package
```
Get package information (async).
**Arguments**:
* `request_options` - Request-level overrides
**Returns**:
Package information
**Example**:
```python theme={null}
client = AsyncFishAudio(api_key="...")
package = await client.account.get_package()
print(f"Balance: {package.balance}/{package.total}")
```
# fishaudio.resources.tts
TTS (Text-to-Speech) namespace client.
## TTSClient Objects
```python theme={null}
class TTSClient()
```
Synchronous TTS operations.
#### stream
```python theme={null}
def stream(*,
text: str,
reference_id: Optional[str] = None,
references: Optional[list[ReferenceAudio]] = None,
format: Optional[AudioFormat] = None,
latency: Optional[LatencyMode] = None,
speed: Optional[float] = None,
config: TTSConfig = TTSConfig(),
model: Model = "s2-pro",
request_options: Optional[RequestOptions] = None) -> AudioStream
```
Stream text-to-speech audio chunks.
**Arguments**:
* `text` - Text to synthesize
* `reference_id` - Voice reference ID (overrides config.reference\_id if provided)
* `references` - Reference audio samples (overrides config.references if provided)
* `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided)
* `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided)
* `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided)
* `config` - TTS configuration (audio settings, voice, model parameters)
* `model` - TTS model to use
* `request_options` - Request-level overrides
**Returns**:
AudioStream object that can be iterated for audio chunks
**Example**:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(api_key="...")
# Stream and process chunks
for chunk in client.tts.stream(text="Hello world"):
process_audio_chunk(chunk)
# Or collect all at once
audio = client.tts.stream(text="Hello world").collect()
```
#### convert
```python theme={null}
def convert(*,
text: str,
reference_id: Optional[str] = None,
references: Optional[list[ReferenceAudio]] = None,
format: Optional[AudioFormat] = None,
latency: Optional[LatencyMode] = None,
speed: Optional[float] = None,
config: TTSConfig = TTSConfig(),
model: Model = "s2-pro",
request_options: Optional[RequestOptions] = None) -> bytes
```
Convert text to speech and return complete audio as bytes.
This is a convenience method that streams all audio chunks and combines them.
For chunk-by-chunk processing, use stream() instead.
**Arguments**:
* `text` - Text to synthesize
* `reference_id` - Voice reference ID (overrides config.reference\_id if provided)
* `references` - Reference audio samples (overrides config.references if provided)
* `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided)
* `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided)
* `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided)
* `config` - TTS configuration (audio settings, voice, model parameters)
* `model` - TTS model to use
* `request_options` - Request-level overrides
**Returns**:
Complete audio as bytes
**Example**:
```python theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play, save
client = FishAudio(api_key="...")
# Get complete audio
audio = client.tts.convert(text="Hello world")
# Play it
play(audio)
# Or save it
save(audio, "output.mp3")
```
#### stream\_websocket
```python theme={null}
def stream_websocket(
text_stream: Iterable[Union[str, TextEvent, FlushEvent]],
*,
reference_id: Optional[str] = None,
references: Optional[list[ReferenceAudio]] = None,
format: Optional[AudioFormat] = None,
latency: Optional[LatencyMode] = None,
speed: Optional[float] = None,
config: TTSConfig = TTSConfig(),
model: Model = "s2-pro",
max_workers: int = 10,
ws_options: Optional[WebSocketOptions] = None) -> Iterator[bytes]
```
Stream text and receive audio in real-time via WebSocket.
Perfect for conversational AI, live captioning, and streaming applications.
**Arguments**:
* `text_stream` - Iterator of text chunks to stream
* `reference_id` - Voice reference ID (overrides config.reference\_id if provided)
* `references` - Reference audio samples (overrides config.references if provided)
* `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided)
* `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided)
* `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided)
* `config` - TTS configuration (audio settings, voice, model parameters)
* `model` - TTS model to use
* `max_workers` - ThreadPoolExecutor workers for concurrent sender
* `ws_options` - WebSocket connection options for configuring timeouts, message size limits, etc.
Useful for long-running generations that may exceed default timeout values.
See WebSocketOptions class for available parameters.
**Returns**:
Iterator of audio bytes
**Example**:
```python theme={null}
from fishaudio import FishAudio, TTSConfig, ReferenceAudio, WebSocketOptions
client = FishAudio(api_key="...")
def text_generator():
yield "Hello, "
yield "this is "
yield "streaming text!"
# Simple usage with defaults
with open("output.mp3", "wb") as f:
for audio_chunk in client.tts.stream_websocket(text_generator()):
f.write(audio_chunk)
# With format and speed parameters
with open("output.wav", "wb") as f:
for audio_chunk in client.tts.stream_websocket(
text_generator(),
format="wav",
speed=1.3
):
f.write(audio_chunk)
# With reference_id parameter
with open("output.mp3", "wb") as f:
for audio_chunk in client.tts.stream_websocket(text_generator(), reference_id="your_model_id"):
f.write(audio_chunk)
# With references parameter
with open("output.mp3", "wb") as f:
for audio_chunk in client.tts.stream_websocket(
text_generator(),
references=[ReferenceAudio(audio=audio_bytes, text="sample")]
):
f.write(audio_chunk)
# With WebSocket options for long-running generations
# Useful if you're generating very long responses that may take >20 seconds
ws_options = WebSocketOptions(keepalive_ping_timeout_seconds=60.0)
with open("output.mp3", "wb") as f:
for audio_chunk in client.tts.stream_websocket(
text_generator(),
ws_options=ws_options
):
f.write(audio_chunk)
# Parameters override config values
config = TTSConfig(format="mp3", latency="balanced")
with open("output.wav", "wb") as f:
for audio_chunk in client.tts.stream_websocket(
text_generator(),
format="wav", # Parameter wins
config=config
):
f.write(audio_chunk)
```
## AsyncTTSClient Objects
```python theme={null}
class AsyncTTSClient()
```
Asynchronous TTS operations.
#### stream
```python theme={null}
async def stream(
*,
text: str,
reference_id: Optional[str] = None,
references: Optional[list[ReferenceAudio]] = None,
format: Optional[AudioFormat] = None,
latency: Optional[LatencyMode] = None,
speed: Optional[float] = None,
config: TTSConfig = TTSConfig(),
model: Model = "s2-pro",
request_options: Optional[RequestOptions] = None) -> AsyncAudioStream
```
Stream text-to-speech audio chunks (async).
**Arguments**:
* `text` - Text to synthesize
* `reference_id` - Voice reference ID (overrides config.reference\_id if provided)
* `references` - Reference audio samples (overrides config.references if provided)
* `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided)
* `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided)
* `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided)
* `config` - TTS configuration (audio settings, voice, model parameters)
* `model` - TTS model to use
* `request_options` - Request-level overrides
**Returns**:
AsyncAudioStream object that can be iterated for audio chunks
**Example**:
```python theme={null}
from fishaudio import AsyncFishAudio
client = AsyncFishAudio(api_key="...")
# Stream and process chunks
async for chunk in await client.tts.stream(text="Hello world"):
await process_audio_chunk(chunk)
# Or collect all at once
stream = await client.tts.stream(text="Hello world")
audio = await stream.collect()
```
#### convert
```python theme={null}
async def convert(*,
text: str,
reference_id: Optional[str] = None,
references: Optional[list[ReferenceAudio]] = None,
format: Optional[AudioFormat] = None,
latency: Optional[LatencyMode] = None,
speed: Optional[float] = None,
config: TTSConfig = TTSConfig(),
model: Model = "s2-pro",
request_options: Optional[RequestOptions] = None) -> bytes
```
Convert text to speech and return complete audio as bytes (async).
This is a convenience method that streams all audio chunks and combines them.
For chunk-by-chunk processing, use stream() instead.
**Arguments**:
* `text` - Text to synthesize
* `reference_id` - Voice reference ID (overrides config.reference\_id if provided)
* `references` - Reference audio samples (overrides config.references if provided)
* `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided)
* `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided)
* `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided)
* `config` - TTS configuration (audio settings, voice, model parameters)
* `model` - TTS model to use
* `request_options` - Request-level overrides
**Returns**:
Complete audio as bytes
**Example**:
```python theme={null}
from fishaudio import AsyncFishAudio
from fishaudio.utils import play, save
client = AsyncFishAudio(api_key="...")
# Get complete audio
audio = await client.tts.convert(text="Hello world")
# Play it
play(audio)
# Or save it
save(audio, "output.mp3")
```
#### stream\_websocket
```python theme={null}
async def stream_websocket(text_stream: AsyncIterable[Union[str, TextEvent,
FlushEvent]],
*,
reference_id: Optional[str] = None,
references: Optional[list[ReferenceAudio]] = None,
format: Optional[AudioFormat] = None,
latency: Optional[LatencyMode] = None,
speed: Optional[float] = None,
config: TTSConfig = TTSConfig(),
model: Model = "s2-pro",
ws_options: Optional[WebSocketOptions] = None)
```
Stream text and receive audio in real-time via WebSocket (async).
Perfect for conversational AI, live captioning, and streaming applications.
**Arguments**:
* `text_stream` - Async iterator of text chunks to stream
* `reference_id` - Voice reference ID (overrides config.reference\_id if provided)
* `references` - Reference audio samples (overrides config.references if provided)
* `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided)
* `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided)
* `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided)
* `config` - TTS configuration (audio settings, voice, model parameters)
* `model` - TTS model to use
* `ws_options` - WebSocket connection options for configuring timeouts, message size limits, etc.
Useful for long-running generations that may exceed default timeout values.
See WebSocketOptions class for available parameters.
**Returns**:
Async iterator of audio bytes
**Example**:
```python theme={null}
from fishaudio import AsyncFishAudio, TTSConfig, ReferenceAudio, WebSocketOptions
client = AsyncFishAudio(api_key="...")
async def text_generator():
yield "Hello, "
yield "this is "
yield "async streaming!"
# Simple usage with defaults
async with aiofiles.open("output.mp3", "wb") as f:
async for audio_chunk in client.tts.stream_websocket(text_generator()):
await f.write(audio_chunk)
# With format and speed parameters
async with aiofiles.open("output.wav", "wb") as f:
async for audio_chunk in client.tts.stream_websocket(
text_generator(),
format="wav",
speed=1.3
):
await f.write(audio_chunk)
# With reference_id parameter
async with aiofiles.open("output.mp3", "wb") as f:
async for audio_chunk in client.tts.stream_websocket(text_generator(), reference_id="your_model_id"):
await f.write(audio_chunk)
# With references parameter
async with aiofiles.open("output.mp3", "wb") as f:
async for audio_chunk in client.tts.stream_websocket(
text_generator(),
references=[ReferenceAudio(audio=audio_bytes, text="sample")]
):
await f.write(audio_chunk)
# With WebSocket options for long-running generations
# Useful if you're generating very long responses that may take >20 seconds
ws_options = WebSocketOptions(keepalive_ping_timeout_seconds=60.0)
async with aiofiles.open("output.mp3", "wb") as f:
async for audio_chunk in client.tts.stream_websocket(
text_generator(),
ws_options=ws_options
):
await f.write(audio_chunk)
# Parameters override config values
config = TTSConfig(format="mp3", latency="balanced")
async with aiofiles.open("output.wav", "wb") as f:
async for audio_chunk in client.tts.stream_websocket(
text_generator(),
format="wav", # Parameter wins
config=config
):
await f.write(audio_chunk)
```
# fishaudio.resources.realtime
Real-time WebSocket streaming helpers.
#### iter\_websocket\_audio
```python theme={null}
def iter_websocket_audio(ws) -> Iterator[bytes]
```
Process WebSocket audio messages (sync).
Receives messages from WebSocket, yields audio chunks, handles errors.
Unknown events are ignored and iteration continues.
**Arguments**:
* `ws` - WebSocket connection from httpx\_ws.connect\_ws
**Yields**:
Audio bytes
**Raises**:
* `WebSocketError` - On disconnect or error finish event
#### aiter\_websocket\_audio
```python theme={null}
async def aiter_websocket_audio(ws) -> AsyncIterator[bytes]
```
Process WebSocket audio messages (async).
Receives messages from WebSocket, yields audio chunks, handles errors.
Unknown events are ignored and iteration continues.
**Arguments**:
* `ws` - WebSocket connection from httpx\_ws.aconnect\_ws
**Yields**:
Audio bytes
**Raises**:
* `WebSocketError` - On disconnect or error finish event
# fishaudio.resources.asr
ASR (Automatic Speech Recognition) namespace client.
## ASRClient Objects
```python theme={null}
class ASRClient()
```
Synchronous ASR operations.
#### transcribe
```python theme={null}
def transcribe(
*,
audio: bytes,
language: Optional[str] = OMIT,
include_timestamps: bool = True,
request_options: Optional[RequestOptions] = None) -> ASRResponse
```
Transcribe audio to text.
**Arguments**:
* `audio` - Audio file bytes
* `language` - Language code (e.g., "en", "zh"). Auto-detected if not provided.
* `include_timestamps` - Whether to include timestamp information for segments
* `request_options` - Request-level overrides
**Returns**:
ASRResponse with transcription text, duration, and segments
**Example**:
```python theme={null}
client = FishAudio(api_key="...")
with open("audio.mp3", "rb") as f:
audio_bytes = f.read()
result = client.asr.transcribe(audio=audio_bytes, language="en")
print(result.text)
for segment in result.segments:
print(f"{segment.start}-{segment.end}: {segment.text}")
```
## AsyncASRClient Objects
```python theme={null}
class AsyncASRClient()
```
Asynchronous ASR operations.
#### transcribe
```python theme={null}
async def transcribe(
*,
audio: bytes,
language: Optional[str] = OMIT,
include_timestamps: bool = True,
request_options: Optional[RequestOptions] = None) -> ASRResponse
```
Transcribe audio to text (async).
**Arguments**:
* `audio` - Audio file bytes
* `language` - Language code (e.g., "en", "zh"). Auto-detected if not provided.
* `include_timestamps` - Whether to include timestamp information for segments
* `request_options` - Request-level overrides
**Returns**:
ASRResponse with transcription text, duration, and segments
**Example**:
```python theme={null}
client = AsyncFishAudio(api_key="...")
async with aiofiles.open("audio.mp3", "rb") as f:
audio_bytes = await f.read()
result = await client.asr.transcribe(audio=audio_bytes, language="en")
print(result.text)
for segment in result.segments:
print(f"{segment.start}-{segment.end}: {segment.text}")
```
# Types
Source: https://docs.fish.audio/api-reference/sdk/python/types
# fishaudio.types.voices
Voice and model management types.
## Sample Objects
```python theme={null}
class Sample(BaseModel)
```
A sample audio for a voice model.
**Attributes**:
* `title` - Title/name of the audio sample
* `text` - Transcription of the spoken content in the sample
* `task_id` - Unique identifier for the sample task
* `audio` - URL or path to the audio file
## Author Objects
```python theme={null}
class Author(BaseModel)
```
Voice model author information.
**Attributes**:
* `id` - Unique author identifier
* `nickname` - Author's display name
* `avatar` - URL to author's avatar image
## Voice Objects
```python theme={null}
class Voice(BaseModel)
```
A voice model.
Represents a TTS voice that can be used for synthesis.
**Attributes**:
* `id` - Unique voice model identifier (use as reference\_id in TTS)
* `type` - Model type. Options: "svc" (singing voice conversion), "tts" (text-to-speech)
* `title` - Voice model title/name
* `description` - Detailed description of the voice model
* `cover_image` - URL to the voice model's cover image
* `train_mode` - Training mode used. Options: "fast"
* `state` - Current model state (e.g., "ready", "training", "failed")
* `tags` - List of tags for categorization (e.g., \["male", "english", "young"])
* `samples` - List of audio samples demonstrating the voice
* `created_at` - Timestamp when the model was created
* `updated_at` - Timestamp when the model was last updated
* `languages` - List of supported language codes (e.g., \["en", "zh"])
* `visibility` - Model visibility. Options: "public", "private", "unlist"
* `lock_visibility` - Whether visibility setting is locked
* `like_count` - Number of likes the model has received
* `mark_count` - Number of bookmarks/favorites
* `shared_count` - Number of times the model has been shared
* `task_count` - Number of times the model has been used for generation
* `liked` - Whether the current user has liked this model. Default: False
* `marked` - Whether the current user has bookmarked this model. Default: False
* `author` - Information about the voice model's creator
# fishaudio.types.account
Account-related types (credits, packages, etc.).
## Credits Objects
```python theme={null}
class Credits(BaseModel)
```
User's API credit balance.
**Attributes**:
* `id` - Unique credits record identifier
* `user_id` - User identifier
* `credit` - Current credit balance (decimal for precise accounting)
* `created_at` - Timestamp when the credits record was created
* `updated_at` - Timestamp when the credits were last updated
* `has_phone_sha256` - Whether the user has a verified phone number. Optional
* `has_free_credit` - Whether the user has received free credits. Optional
## Package Objects
```python theme={null}
class Package(BaseModel)
```
User's prepaid package information.
**Attributes**:
* `id` - Unique package identifier
* `user_id` - User identifier
* `type` - Package type identifier
* `total` - Total units in the package
* `balance` - Remaining units in the package
* `created_at` - Timestamp when the package was purchased
* `updated_at` - Timestamp when the package was last updated
* `finished_at` - Timestamp when the package was fully consumed. None if still active
# fishaudio.types.tts
TTS-related types.
## ReferenceAudio Objects
```python theme={null}
class ReferenceAudio(BaseModel)
```
Reference audio for voice cloning/style.
**Attributes**:
* `audio` - Audio file bytes for the reference sample
* `text` - Transcription of what is spoken in the reference audio. Should match exactly
what's spoken and include punctuation for proper prosody.
## Prosody Objects
```python theme={null}
class Prosody(BaseModel)
```
Speech prosody settings (speed and volume).
**Attributes**:
* `speed` - Speech speed multiplier. Range: 0.5-2.0. Default: 1.0.
* `Examples` - 1.5 = 50% faster, 0.8 = 20% slower
* `volume` - Volume adjustment in decibels. Range: -20.0 to 20.0. Default: 0.0 (no change).
Positive values increase volume, negative values decrease it.
#### from\_speed\_override
```python theme={null}
@classmethod
def from_speed_override(cls,
speed: float,
base: Optional["Prosody"] = None) -> "Prosody"
```
Create Prosody with speed override, preserving volume from base.
**Arguments**:
* `speed` - Speed value to use
* `base` - Base prosody to preserve volume from (if any)
**Returns**:
New Prosody instance with overridden speed
## TTSConfig Objects
```python theme={null}
class TTSConfig(BaseModel)
```
TTS generation configuration.
Reusable configuration for text-to-speech requests. Create once, use multiple times.
All parameters have sensible defaults.
**Attributes**:
* `format` - Audio output format. Options: "mp3", "wav", "pcm", "opus". Default: "mp3"
* `sample_rate` - Audio sample rate in Hz. If None, uses format-specific default.
* `mp3_bitrate` - MP3 bitrate in kbps. Options: 64, 128, 192. Default: 128
* `opus_bitrate` - Opus bitrate in kbps. Options: -1000, 24, 32, 48, 64. Default: 32
* `normalize` - Whether to normalize/clean the input text. Default: True
* `chunk_length` - Characters per generation chunk. Range: 100-300. Default: 200.
Lower values = faster initial response, higher values = better quality
* `latency` - Generation mode. Options: "normal" (higher quality), "balanced" (faster). Default: "balanced"
* `reference_id` - Voice model ID from fish.audio (e.g., "802e3bc2b27e49c2995d23ef70e6ac89").
Find IDs in voice URLs or via voices.list()
* `references` - List of reference audio samples for instant voice cloning. Default: \[]
* `prosody` - Speech speed and volume settings. Default: None (uses natural prosody)
* `top_p` - Nucleus sampling parameter for token selection. Range: 0.0-1.0. Default: 0.7
* `temperature` - Randomness in generation. Range: 0.0-1.0. Default: 0.7.
Higher = more varied, lower = more consistent
* `max_new_tokens` - Maximum number of tokens to generate. Default: 1024
* `repetition_penalty` - Penalty for repeated tokens. Default: 1.2
* `min_chunk_length` - Minimum chunk length for generation. Default: 50
* `condition_on_previous_chunks` - Whether to condition generation on previous chunks. Default: True
* `early_stop_threshold` - Threshold for early stopping. Default: 1.0
## TTSRequest Objects
```python theme={null}
class TTSRequest(BaseModel)
```
Request parameters for text-to-speech generation.
This model is used internally for WebSocket streaming.
For the HTTP API, parameters are passed directly to methods.
**Attributes**:
* `text` - Text to synthesize into speech
* `chunk_length` - Characters per generation chunk. Range: 100-300. Default: 200
* `format` - Audio output format. Options: "mp3", "wav", "pcm", "opus". Default: "mp3"
* `sample_rate` - Audio sample rate in Hz. If None, uses format-specific default
* `mp3_bitrate` - MP3 bitrate in kbps. Options: 64, 128, 192. Default: 128
* `opus_bitrate` - Opus bitrate in kbps. Options: -1000, 24, 32, 48, 64. Default: 32
* `references` - List of reference audio samples for voice cloning. Default: \[]
* `reference_id` - Voice model ID for using a specific voice. Default: None
* `normalize` - Whether to normalize/clean the input text. Default: True
* `latency` - Generation mode. Options: "normal", "balanced". Default: "balanced"
* `prosody` - Speech speed and volume settings. Default: None
* `top_p` - Nucleus sampling for token selection. Range: 0.0-1.0. Default: 0.7
* `temperature` - Randomness in generation. Range: 0.0-1.0. Default: 0.7
* `max_new_tokens` - Maximum number of tokens to generate. Default: 1024
* `repetition_penalty` - Penalty for repeated tokens. Default: 1.2
* `min_chunk_length` - Minimum chunk length for generation. Default: 50
* `condition_on_previous_chunks` - Whether to condition generation on previous chunks. Default: True
* `early_stop_threshold` - Threshold for early stopping. Default: 1.0
## StartEvent Objects
```python theme={null}
class StartEvent(BaseModel)
```
WebSocket start event to initiate TTS streaming.
**Attributes**:
* `event` - Event type identifier, always "start"
* `request` - TTS configuration for the streaming session
## TextEvent Objects
```python theme={null}
class TextEvent(BaseModel)
```
WebSocket event to send a text chunk for synthesis.
**Attributes**:
* `event` - Event type identifier, always "text"
* `text` - Text chunk to synthesize
## FlushEvent Objects
```python theme={null}
class FlushEvent(BaseModel)
```
WebSocket event to force immediate audio generation from buffered text.
Use this to ensure all buffered text is synthesized without waiting for more input.
**Attributes**:
* `event` - Event type identifier, always "flush"
## CloseEvent Objects
```python theme={null}
class CloseEvent(BaseModel)
```
WebSocket event to end the streaming session.
**Attributes**:
* `event` - Event type identifier, always "stop"
# fishaudio.types.shared
Shared types used across the SDK.
## PaginatedResponse Objects
```python theme={null}
class PaginatedResponse(BaseModel, Generic[T])
```
Generic paginated response.
**Attributes**:
* `total` - Total number of items across all pages
* `items` - List of items on the current page
#### warn\_if\_deprecated\_model
```python theme={null}
def warn_if_deprecated_model(model: str) -> None
```
Emit a deprecation warning if a legacy model is used.
# fishaudio.types.asr
ASR (Automatic Speech Recognition) related types.
## ASRSegment Objects
```python theme={null}
class ASRSegment(BaseModel)
```
A timestamped segment of transcribed text.
**Attributes**:
* `text` - The transcribed text for this segment
* `start` - Segment start time in seconds
* `end` - Segment end time in seconds
## ASRResponse Objects
```python theme={null}
class ASRResponse(BaseModel)
```
Response from speech-to-text transcription.
**Attributes**:
* `text` - Complete transcription of the entire audio
* `duration` - Total audio duration in milliseconds
* `segments` - List of timestamped text segments. Empty if include\_timestamps=False
#### duration
Duration in milliseconds
# Utils
Source: https://docs.fish.audio/api-reference/sdk/python/utils
# fishaudio.utils.play
Audio playback utility.
#### play
```python theme={null}
def play(audio: Union[bytes, Iterable[bytes]],
*,
notebook: bool = False,
use_ffmpeg: bool = True) -> None
```
Play audio using various playback methods.
**Arguments**:
* `audio` - Audio bytes or iterable of bytes
* `notebook` - Use Jupyter notebook playback (IPython.display.Audio)
* `use_ffmpeg` - Use ffplay for playback (default, falls back to sounddevice)
**Raises**:
* `DependencyError` - If required playback tool is not installed
**Examples**:
```python theme={null}
from fishaudio import FishAudio, play
client = FishAudio(api_key="...")
audio = client.tts.convert(text="Hello world")
# Play directly
play(audio)
# In Jupyter notebook
play(audio, notebook=True)
# Force sounddevice fallback
play(audio, use_ffmpeg=False)
```
# fishaudio.utils.save
Audio saving utility.
#### save
```python theme={null}
def save(audio: Union[bytes, Iterable[bytes]], filename: str) -> None
```
Save audio to a file.
**Arguments**:
* `audio` - Audio bytes or iterable of bytes
* `filename` - Path to save the audio file
**Examples**:
```python theme={null}
from fishaudio import FishAudio, save
client = FishAudio(api_key="...")
audio = client.tts.convert(text="Hello world")
# Save to file
save(audio, "output.mp3")
# Works with iterators too
audio_stream = client.tts.convert(text="Another example")
save(audio_stream, "another.mp3")
```
# fishaudio.utils.stream
Audio streaming utility.
#### stream
```python theme={null}
def stream(audio_stream: Iterator[bytes]) -> bytes
```
Stream audio in real-time while playing it with mpv.
This function plays the audio as it's being generated and
simultaneously captures it to return the complete audio buffer.
**Arguments**:
* `audio_stream` - Iterator of audio byte chunks
**Returns**:
Complete audio bytes after streaming finishes
**Raises**:
* `DependencyError` - If mpv is not installed
**Examples**:
```python theme={null}
from fishaudio import FishAudio, stream
client = FishAudio(api_key="...")
audio_stream = client.tts.convert(text="Hello world")
# Stream and play in real-time, get complete audio
complete_audio = stream(audio_stream)
# Save the captured audio
with open("output.mp3", "wb") as f:
f.write(complete_audio)
```
# Legacy
Source: https://docs.fish.audio/archive/python-sdk-legacy/index
Archived documentation for the legacy Session-based Python SDK
This documentation is for the legacy Python SDK using the Session-based API. This API is deprecated.
**Please migrate to the [new Python SDK](/developer-guide/sdk-guide/python)** which uses a modern client-based architecture.
See the [migration guide](/archive/python-sdk-legacy/migration-guide) for help upgrading.
## About the Legacy SDK
This archive contains documentation for the `fish_audio_sdk` module using the Session-based API. While this API still functions, it is no longer actively maintained and lacks the modern features available in the new SDK.
### What's Different in the New SDK
The new Python SDK (`fishaudio` module) offers:
* **Modern client-based architecture** - More intuitive and consistent with modern Python libraries
* **Full async support** - Native asyncio integration for better performance
* **Better type safety** - Comprehensive type hints and better IDE support
* **Improved error handling** - More detailed error messages and exception hierarchy
* **Enhanced utilities** - Built-in audio playback, streaming, and file management
* **Active maintenance** - Regular updates and new features
### Migration Path
We strongly recommend migrating to the new SDK. The [migration guide](/archive/python-sdk-legacy/migration-guide) provides:
* Side-by-side code comparisons
* Complete list of breaking changes
* Common migration patterns
* Troubleshooting tips
## Migration
Complete guide to upgrading from the legacy SDK to the new client-based API
## Legacy Documentation Pages
How to install the legacy SDK
Session initialization and API keys
TTS with the Session-based API
Reference audio and voice models
ASR transcription with legacy API
Real-time streaming with WebSocketSession
# Contributing
Source: https://docs.fish.audio/contributing
Help improve Fish Audio and contribute to our open source projects.
# Contributing to Fish Audio
First off, thanks for taking the time to contribute!
All types of contributions are encouraged and valued. See the sections below for different ways to help and details about how this project handles them. Please make sure to read the relevant section before making your contribution. It will make it a lot easier for us maintainers and smooth out the experience for all involved. The community looks forward to your contributions.
If you like the project but don't have time to contribute, there are other easy ways to support Fish Audio:
* Star our repositories
* Tweet about it
* Reference Fish Audio in your project's readme
* Mention the project at local meetups and tell your friends/colleagues
## Code of Conduct
This project and everyone participating in it is governed by the Fish Audio Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to our community team.
## I Have a Question
Before you ask a question, please read the available [Documentation](https://docs.fish.audio).
It's best to search for existing [Issues](https://github.com/fishaudio) that might help you. In case you have found a suitable issue and still need clarification, you can write your question in that issue. It is also advisable to search the internet for answers first.
If you still need to ask a question:
1. Open an [Issue](https://github.com/fishaudio) in the relevant repository
2. Provide as much context as you can about what you're running into
3. Provide project and platform versions (Node.js, Python, OS, etc.), depending on what seems relevant
We will take care of the issue as soon as possible.
## I Want To Contribute
**Legal Notice**
When contributing to this project, you must agree that you have authored 100% of the content, that you have the necessary rights to the content, and that the content you contribute may be provided under the project license.
### Reporting Bugs
#### Before Submitting a Bug Report
A good bug report shouldn't leave others needing to chase you up for more information. Please investigate carefully, collect information, and describe the issue in detail:
* Make sure you are using the latest version
* Determine if your bug is really a bug and not an error on your side (e.g., incompatible environment components/versions)
* Check if there is already a bug report for your issue in the bug tracker
* Search the internet (including Stack Overflow) to see if others have discussed the issue
* Collect information about the bug:
* Stack trace (Traceback)
* OS, Platform and Version (Windows, Linux, macOS, x86, ARM)
* Version of the interpreter, compiler, SDK, runtime environment, package manager
* Your input and the output
* Can you reliably reproduce the issue? Can you reproduce it with older versions?
#### How Do I Submit a Good Bug Report?
You must never report security-related issues, vulnerabilities, or bugs including sensitive information to the issue tracker. Instead, sensitive bugs must be sent by email to our security team.
We use GitHub issues to track bugs and errors. If you run into an issue:
1. Open an [Issue](https://github.com/fishaudio) in the relevant repository
2. Explain the behavior you would expect and the actual behavior
3. Provide as much context as possible and describe the **reproduction steps** that someone else can follow to recreate the issue on their own
4. Provide the information you collected in the previous section
Once filed:
* The project team will label the issue accordingly
* A team member will try to reproduce the issue with your provided steps
* If there are no reproduction steps, the team will ask for them and mark the issue as `needs-repro`
* If the team reproduces the issue, it will be marked `needs-fix` and left to be implemented
### Suggesting Enhancements
This section guides you through submitting an enhancement suggestion for Fish Audio, including completely new features and minor improvements to existing functionality.
#### Before Submitting an Enhancement
* Make sure you are using the latest version
* Read the [documentation](https://docs.fish.audio) carefully to see if the functionality already exists
* Perform a [search](https://github.com/fishaudio) to see if the enhancement has already been suggested
* Consider whether your idea fits with the scope and aims of the project
#### How Do I Submit a Good Enhancement Suggestion?
Enhancement suggestions are tracked as GitHub issues:
* Use a **clear and descriptive title** for the issue
* Provide a **step-by-step description** of the suggested enhancement in as many details as possible
* **Describe the current behavior** and **explain which behavior you expected to see instead** and why
* Include **screenshots or screen recordings** if applicable
* **Explain why this enhancement would be useful** to most Fish Audio users
### Your First Code Contribution
We welcome first-time contributors! Here's how to get started:
1. **Fork the repository** you want to contribute to
2. **Clone your fork** locally
3. **Create a new branch** for your changes
4. **Make your changes** following our styleguides
5. **Test your changes** thoroughly
6. **Commit your changes** with clear commit messages
7. **Push to your fork** and submit a pull request
Look for issues labeled `good first issue` or `help wanted` for beginner-friendly tasks.
### Improving The Documentation
Documentation improvements are always welcome! This includes:
* Fixing typos and grammatical errors
* Adding missing information or clarifications
* Improving code examples
* Adding new guides or tutorials
* Translating documentation
See our [documentation repository](https://github.com/fishaudio/fish-docs) to get started.
## Styleguides
### Commit Messages
* Use clear and meaningful commit messages
* Start with a verb in the present tense (e.g., "Add", "Fix", "Update", "Remove")
* Keep the first line under 72 characters
* Reference issues and pull requests when relevant
* Provide additional context in the commit body if needed
Example:
```
Add voice cloning support for Python SDK
- Implement VoiceCloneClient class
- Add comprehensive error handling
- Include usage examples in docstrings
Closes #123
```
### Code Style
* Follow the existing code style in each repository
* Use meaningful variable and function names
* Add comments for complex logic
* Write tests for new features
* Ensure all tests pass before submitting
## Attribution
This contribution guide is based on the **contributing.md** generator. Fish Audio is committed to open source and welcomes contributions from developers worldwide.
# Emotion & Expression Control
Source: https://docs.fish.audio/developer-guide/best-practices/emotion-control
Make your AI voices express emotions naturally
## Overview
Control how your AI voice expresses emotions, from happy and excited to sad and contemplative. Add natural pauses, laughter, and other human-like elements to make speech more engaging.
The `(parenthesis)` syntax on this page applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details.
## How to Use
Simply wrap emotion tags in parentheses before your text:
```
(happy) What a beautiful day!
(sad) I'm sorry to hear that.
(excited) This is amazing news!
```
Include tone markers or audio effects:
```
(whispering) Let me tell you something.
(laughing) Ha ha ha, wow that's so funny!
```
## Important Rules
### Placement Matters
**For all languages:**
* Emotion tags MUST go at the beginning of sentences
* Tone controls can go anywhere in the text
* Sound effects can go anywhere in the text
**Correct:**
```
(happy) What a wonderful day!
```
**Incorrect:**
```
What a (happy) wonderful day!
```
## Best Practices
**Do:**
* Use one emotion per sentence
* Add sounds after relevant words
* Keep tags simple and clear
* Test different combinations
**Don't:**
* Overuse tags in short text
* Mix conflicting emotions
* Create custom tags
* Forget the parentheses
## Available Emotions
See the [Emotion Reference](/api-reference/emotion-reference) for the full list of supported emotions.
## Scene Examples
**Customer Service:**
```
(friendly) Hello! How can I help you today?
(empathetic) I understand your frustration.
(confident) I'll resolve this for you right away.
```
**Storytelling:**
```
(mysterious)(whispering) Once upon a midnight dreary...
(excited) Suddenly, the door burst open!
(scared)(shouting) Run for your lives!
```
**Educational Content:**
```
(enthusiastic) Welcome to today's lesson!
(curious) Have you ever wondered why the sky is blue?
(proud) Great job! You got it right!
```
## Real-World Examples
### Virtual Assistant
```
(friendly) Good morning!
(helpful) I've prepared your schedule for today.
(concerned) You have three urgent emails.
(encouraging) Let's tackle them together!
```
### Audiobook Narration
```
(narrator) Chapter One: The Beginning
(mysterious) The old house stood silent in the fog.
(scared)(whispering) "Is anyone there?" she asked.
(relieved)(sighing) No one answered. Phew.
```
### Game Character
```
(brave) I'll defeat the dragon!
(struggling)(panting) This is... harder than... I thought!
(triumphant)(shouting) Victory is mine!
(laughing) Ha ha ha!
```
## Advanced Techniques
### Emotion Transitions
Gradually change emotions:
```
(happy) I got the promotion!
(uncertain) But... it means moving away.
(sad) I'll miss everyone here.
```
### Background Effects
Add atmosphere:
```
The comedy show was amazing (audience laughing)
Everyone was having fun (background laughter)
The crowd loved it (crowd laughing)
```
## Troubleshooting
### Emotion Not Working?
1. Check tag placement (beginning of sentence for emotions)
2. Verify spelling exactly matches the list
3. Don't use quotes around tags
4. Include parentheses
### Unnatural Sound?
* Add appropriate text after sound tags
* Don't overuse in short sentences
* Space out emotional changes
* Test with different voices
### Tips for Success
1. **Start simple** - Use basic emotions first
2. **Preview often** - Test how it sounds
3. **Be consistent** - Keep character emotions logical
4. **Less is more** - Don't overuse tags
## Get Creative
Experiment with combinations to create unique character voices and engaging narratives. The key is finding the right balance between emotional expression and natural speech flow.
## Support
Need help with emotions?
* **Try it live:** [fish.audio](https://fish.audio)
* **Community:** [Discord](https://discord.gg/fish-audio)
* **Email:** [support@fish.audio](mailto:support@fish.audio)
# Real-time Voice Streaming
Source: https://docs.fish.audio/developer-guide/best-practices/real-time-streaming
Stream voice generation in real-time for interactive applications
## Overview
Real-time streaming lets you generate speech as you type or speak, perfect for chatbots, virtual assistants, and live applications.
## When to Use Streaming
**Perfect for:**
* Live chat applications
* Virtual assistants
* Interactive storytelling
* Real-time translations
* Gaming dialogue
**Not ideal for:**
* Pre-recorded content
* Batch processing
## Getting Started
### Web Playground
Try real-time streaming instantly:
1. Visit [fish.audio](https://fish.audio)
2. Enable "Streaming Mode"
3. Start typing and hear voice generation in real-time
### Using the SDK
Stream text as it's being written:
```python theme={null}
from fishaudio import FishAudio
# Initialize client
client = FishAudio(api_key="your_api_key")
# Stream text word by word
def stream_text():
text = "Hello, this is being generated in real time"
for word in text.split():
yield word + " "
# Generate speech as text streams
audio_stream = client.tts.stream_websocket(
stream_text(),
reference_id="your_voice_model_id",
temperature=0.7, # Controls variation
top_p=0.7, # Controls diversity
latency="balanced"
)
with open("output.mp3", "wb") as f:
for audio_chunk in audio_stream:
f.write(audio_chunk)
```
```javascript theme={null}
import { FishAudioClient, RealtimeEvents } from "fish-audio";
import { writeFile } from "fs/promises";
import path from "path";
const apiKey = "your_api_key";
const referenceId = "your_voice_model_id";
async function* makeTextStream() {
const chunks = [
"Hello from Fish Audio! ",
"This is a realtime text-to-speech test. ",
"We are streaming multiple chunks over WebSocket.",
];
for (const chunk of chunks) {
yield chunk;
await new Promise((r) => setTimeout(r, 200));
}
}
async function main() {
const client = new FishAudioClient({ apiKey });
// For realtime, set text to "" and stream content via makeTextStream
const request = {
text: "",
reference_id: referenceId,
};
const connection = await client.textToSpeech.convertRealtime(
request,
makeTextStream()
);
// Collect audio and write to a file when the stream ends
const chunks = [];
connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened"));
connection.on(RealtimeEvents.AUDIO_CHUNK, (audio) => {
if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) {
chunks.push(Buffer.from(audio));
}
});
connection.on(RealtimeEvents.ERROR, (err) =>
console.error("WebSocket error:", err)
);
connection.on(RealtimeEvents.CLOSE, async () => {
const outPath = path.resolve(process.cwd(), "out.mp3");
await writeFile(outPath, Buffer.concat(chunks));
console.log("Saved to", outPath);
});
}
main().catch((err) => {
console.error(err);
process.exit(1);
});
```
## Configuration Options
### Speed vs Quality
**Latency Modes:**
* **Normal:** Best quality, \~500ms latency
* **Balanced:** Good quality, \~300ms latency
```python theme={null}
# Use latency parameter with stream_websocket
audio_stream = client.tts.stream_websocket(
text_chunks(),
reference_id="model_id",
latency="balanced" # For faster response
)
```
```javascript theme={null}
const request = {
text: "",
reference_id: "model_id",
latency: "balanced", // For faster response
};
```
### Voice Control
**Temperature** (0.1 - 1.0):
* Lower: More consistent, predictable
* Higher: More varied, expressive
**Top-p** (0.1 - 1.0):
* Lower: More focused
* Higher: More diverse
## Real-time Applications
### Chatbot Integration
Stream responses as they're generated:
```python theme={null}
def chatbot_response(user_input):
# Get AI response (streaming)
ai_text = get_ai_response(user_input)
# Convert to speech in real-time
audio_stream = client.tts.stream_websocket(ai_text)
for audio_chunk in audio_stream:
play_audio(audio_chunk)
```
```javascript theme={null}
async function chatbotResponse(userInput) {
// Get AI response (streaming)
const aiTextStream = getAiResponse(userInput); // async iterable of strings
// Convert to speech in real-time
for await (const textChunk of aiTextStream) {
for await (const audioChunk of ttsStream(textChunk)) {
playAudio(audioChunk);
}
}
}
```
### Live Translation
Translate and speak simultaneously:
```python theme={null}
def live_translate(source_audio):
# Transcribe source audio
text = transcribe(source_audio)
# Translate text
translated = translate(text, target_language)
# Stream translated speech
for chunk in stream_text(translated):
generate_speech(chunk)
```
```javascript theme={null}
async function liveTranslate(sourceAudio) {
// Transcribe source audio
const text = await transcribe(sourceAudio);
// Translate text
const translated = await translate(text, targetLanguage);
// Stream translated speech
for await (const chunk of streamText(translated)) {
generateSpeech(chunk);
}
}
```
## Best Practices
### Text Buffering
**Do:**
* Send complete words with spaces
* Use punctuation for natural pauses
* Buffer 5-10 words for smoothness
**Don't:**
* Send individual characters
* Forget spaces between words
* Send huge chunks at once
### Connection Management
1. **Keep connections alive** for multiple generations
2. **Handle disconnections** gracefully
3. **Implement retry logic** for reliability
### Audio Playback
For smooth playback:
* Buffer 2-3 audio chunks
* Use cross-fading between chunks
* Handle network delays gracefully
## Common Use Cases
### Interactive Story
```python theme={null}
def interactive_story():
story_parts = [
"Once upon a time,",
"in a land far away,",
"there lived a brave knight..."
]
for part in story_parts:
# Generate and play each part
stream_speech(part)
# Wait for user input
user_choice = get_user_input()
# Continue based on choice
```
```javascript theme={null}
function interactiveStory() {
const storyParts = [
"Once upon a time,",
"in a land far away,",
"there lived a brave knight...",
];
for (const part of storyParts) {
// Generate and play each part
streamSpeech(part);
// Wait for user input
const userChoice = getUserInput();
// Continue based on choice
}
}
```
### Virtual Assistant
```python theme={null}
def virtual_assistant():
while True:
# Listen for wake word
if detect_wake_word():
# Start streaming response
response = process_command()
stream_speech(response)
```
```javascript theme={null}
async function virtualAssistant() {
while (true) {
// Listen for wake word
if (detectWakeWord()) {
// Start streaming response
const response = processCommand();
streamSpeech(response);
}
}
}
```
### Live Commentary
```python theme={null}
def live_commentary(event_stream):
for event in event_stream:
# Generate commentary
commentary = generate_commentary(event)
# Stream immediately
stream_speech(commentary)
```
```javascript theme={null}
async function liveCommentary(eventStream) {
for await (const event of eventStream) {
// Generate commentary
const commentary = generateCommentary(event);
// Stream immediately
streamSpeech(commentary);
}
}
```
## Troubleshooting
### Audio Gaps
**Problem:** Gaps between audio chunks
**Solution:**
* Increase buffer size
* Use balanced latency mode
* Check network connection
### Delayed Response
**Problem:** Long wait before audio starts
**Solution:**
* Use balanced latency mode
* Send initial text immediately
* Reduce chunk size
### Choppy Playback
**Problem:** Audio cuts in and out
**Solution:**
* Buffer more chunks before playing
* Check network stability
* Use consistent chunk sizes
## Advanced Features
### Dynamic Voice Switching
Change voices mid-stream:
```python theme={null}
# Start with one voice
def text1():
yield "Hello from voice one."
audio1 = client.tts.stream_websocket(text1(), reference_id="voice1")
for chunk in audio1:
play_audio(chunk)
# Switch to another
def text2():
yield "And now voice two!"
audio2 = client.tts.stream_websocket(text2(), reference_id="voice2")
for chunk in audio2:
play_audio(chunk)
```
```javascript theme={null}
// Start with one voice
const request1 = { reference_id: "voice1" };
streamSpeech("Hello from voice one.", request1);
// Switch to another
const request2 = { reference_id: "voice2" };
streamSpeech("And now voice two!", request2);
```
### Emotion Injection
Add emotions dynamically:
```python theme={null}
def emotional_speech(text, emotion):
emotional_text = f"({emotion}) {text}"
stream_speech(emotional_text)
```
```javascript theme={null}
function emotionalSpeech(text, emotion) {
const emotionalText = `(${emotion}) ${text}`;
streamSpeech(emotionalText);
}
```
### Speed Control
Adjust speaking speed:
```python theme={null}
from fishaudio.types import Prosody
# Use speed and volume with stream_websocket
audio_stream = client.tts.stream_websocket(
text_chunks(),
speed=1.5 # 1.5x speed
)
# Note: For full prosody control including volume, use TTSConfig
```
```javascript theme={null}
const request = {
text: "",
prosody: {
speed: 1.5, // 1.5x speed
volume: 0, // Normal volume
},
};
```
## Performance Tips
1. **Pre-load voices** for instant start
2. **Use connection pooling** for multiple streams
3. **Monitor latency** and adjust settings
4. **Cache common phrases** for instant playback
## Get Support
Need help with streaming?
* **Discord Community:** [Join our Discord](https://discord.gg/fish-audio)
* **Email Support:** [support@fish.audio](mailto:support@fish.audio)
* **Status Page:** [status.fish.audio](https://status.fish.audio)
# Voice Cloning Best Practices
Source: https://docs.fish.audio/developer-guide/best-practices/voice-cloning
Simple tips to get the best voice cloning results with Fish Audio
## Getting Started
Voice cloning lets you create a digital version of any voice. Use at least 10 seconds of audio recording for studio-quality results right in the Playground or via the API.
## Recording Your Voice
### Find a Quiet Space
**Good places to record:**
* A bedroom with curtains and carpet
* Inside a parked car
* A quiet office or study room
* Any room with soft furniture
**Avoid recording near:**
* Open windows with traffic noise
* Running appliances (AC, fans, refrigerators)
* Other people talking
* TVs or music playing
### Use What You Have
**Best options:**
* USB microphone or gaming headset
* Phone voice recorder app (place it on a stable surface)
* Earbuds with microphone (hold them steady)
**Quick tip:** Keep the microphone about a hand's width from your mouth and speak normally.
## What to Say
**Best approach:** Record 2-3 clips of 15-20 seconds each that form a complete paragraph.
Here's a sample script you can read naturally:
```
"Hello, my name is Alex, and I enjoy reading books about technology
and science. Yesterday, I walked through the park, observing the
beautiful autumn leaves. The weather was quite pleasant, with a
gentle breeze and warm sunshine. I often think about how amazing
our world is, full of interesting discoveries waiting to be made."
```
### Recording Tips
**Must Have:**
* Only one person speaking
* Steady volume throughout
* Consistent tone and emotion
* Small pauses between sentences (about half a second)
**Nice to Have:**
* No background noise
* No room echo
* Professional mic (but phone is fine too!)
**Avoid:**
* Multiple speakers in one recording
* Big changes in volume or emotion
* Background music or TV
* Rushing through without pauses
## Troubleshooting
### Common Problems
**Voice sounds robotic?**
* Try recording for longer, 30-60 seconds
* Speak more naturally and add pauses
**Voice doesn't sound like you?**
* Make sure you're the only person speaking in the recording
* Check that there's no background music or TV
**Poor audio quality?**
* Find a quieter room to record
* Move closer to your microphone
* Try using a different recording device
## Important: Getting Permission
Only clone voices you have permission to use:
* Your own voice
* Someone who gave you written permission
* Never use voices from the internet without permission
* Never use celebrity or public figure voices without permission
## How to Upload Your Recording
Visit [fish.audio](https://fish.audio) and log in
Find the voice creation button in your dashboard
Select your recorded file and give your voice a name
It usually takes just a few seconds
Type some text and hear your cloned voice speak!
## Making Different Voices
Want to create character voices or different styles? Try these:
### Different Emotions
Record the same text with different feelings:
* Happy and energetic
* Calm and relaxed
* Serious and professional
### Different Characters
Create unique voices for:
* Storytelling and audiobooks
* Game characters
* Educational content
* Podcast intros
## Get Help
Need assistance? We're here to help:
* **Community Forum**: [Join our Discord](https://discord.gg/fish-audio)
* **Email Support**: [support@fish.audio](mailto:support@fish.audio)
* **Video Tutorials**: Coming soon!
# Creating Voice Models
Source: https://docs.fish.audio/developer-guide/core-features/creating-models
Learn how to create custom voice models with Fish Audio
## Overview
Create custom voice models to generate consistent, high-quality speech. You can create models through our web interface or programmatically via API.
## Web Interface
The easiest way to create a voice model:
Visit [fish.audio](https://fish.audio) and log in
Click on "Models" in your dashboard
Select "Create New Model"
Add 1 or more voice samples (at least 10 seconds each)
Choose privacy settings and training options
Click "Create" and wait for processing
## Using the API
### Using the SDK
Create models with the Python or JavaScript SDK:
First, install the SDK:
```bash theme={null}
pip install fish-audio-sdk
```
Then create a model:
```python theme={null}
from fish_audio_sdk import Session
# Initialize session with your API key
session = Session("your_api_key")
# Create the model
model = session.create_model(
title="My Voice Model",
description="Custom voice for storytelling",
voices=[
voice_file1.read(),
voice_file2.read()
],
cover_image=image_file.read() # Optional
)
print(f"Model created: {model.id}")
```
First, install the SDK:
```bash theme={null}
npm install fish-audio
```
Then create a model:
```javascript theme={null}
import { FishAudioClient } from "fish-audio";
import { createReadStream } from "fs";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
const title = "My Voice Model";
const audioFile1 = createReadStream("sample1.mp3");
// Optionally add more samples:
// const audioFile2 = createReadStream("sample2.wav");
const coverImageFile = createReadStream("cover.png"); // optional
try {
const response = await fishAudio.voices.ivc.create({
title,
voices: [audioFile1],
cover_image: coverImageFile,
description: "Custom voice for storytelling",
visibility: "private",
});
console.log("Voice created:", {
id: response._id,
title: response.title,
state: response.state,
});
} catch (err) {
console.error("Create voice request failed:", err);
}
```
### Direct API
Create models directly using the REST API:
```python theme={null}
import requests
response = requests.post(
"https://api.fish.audio/model",
files=[
("voices", open("sample1.mp3", "rb")),
("voices", open("sample2.wav", "rb"))
],
data=[
("title", "My Voice Model"),
("description", "Custom voice model"),
("visibility", "private"),
("type", "tts"),
("train_mode", "fast"),
("enhance_audio_quality", "true")
],
headers={
"Authorization": "Bearer YOUR_API_KEY"
}
)
result = response.json()
print(f"Model ID: {result['id']}")
```
```javascript theme={null}
import { readFile } from "fs/promises";
const form = new FormData();
form.append("title", "My Voice Model");
form.append("description", "Custom voice model");
form.append("visibility", "private");
form.append("type", "tts");
form.append("train_mode", "fast");
form.append("enhance_audio_quality", "true");
const v1 = await readFile("sample1.mp3");
const v2 = await readFile("sample2.wav");
form.append("voices", new File([v1], "sample1.mp3"));
form.append("voices", new File([v2], "sample2.wav"));
const res = await fetch("https://api.fish.audio/model", {
method: "POST",
headers: { Authorization: "Bearer " },
body: form,
});
const result = await res.json();
console.log("Model ID:", result.id);
```
## Model Settings
### Required Parameters
| Parameter | Description | Type | Options |
| ----------------- | --------------------------------------------------------------------- | -------------- | ----------------------- |
| **title** | Name of your model | `string` | Any text |
| **voices** | Audio samples | `Array` | .mp3, .wav, .m4a, .opus |
| **type**\* | Model type | `enum` | `tts` |
| **train\_mode**\* | Model train mode, fast means model instantly available after creation | `enum` | `fast` |
\*Automatically set by Python and JavaScript SDKs
### Optional Parameters
| Parameter | Description | Type | Options |
| --------------------------- | -------------------------------------------------- | --------------- | ---------------------------------------------------- |
| **visibility** | Who can use your model | `enum` | `private`, `public`, `unlist`
`default: public` |
| **description** | Model description | `string` | Any text |
| **cover\_image** | Model cover image, required if the model is public | `File` | .jpg, .png |
| **texts** | Transcripts of audio samples | `Array` | Must match number of audio files |
| **tags** | Tags for your model | `string[]` | Any text |
| **enhance\_audio\_quality** | Remove background noise | `boolean` | `true`, `false`
`default: false` |
For detailed explanations view our [API reference](/api-reference/endpoint/model/create-model).
## Audio Requirements
### Quality Guidelines
**Minimum Requirements:**
* At least 1 audio sample
* 10+ seconds per sample
**Best Practices:**
* Use multiple diverse samples
* 1 consistent speaker throughout
* Include different emotions and tones
* Record in a quiet environment
* Maintain steady volume
## Adding Transcripts
Including text transcripts improves model quality:
```python theme={null}
response = requests.post(
"https://api.fish.audio/model",
files=[
("voices", open("hello.mp3", "rb")),
("voices", open("world.wav", "rb"))
],
data=[
("title", "Enhanced Model"),
("texts", "Hello, this is my first recording."),
("texts", "Welcome to the world of AI voices."),
# ... other parameters
],
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
```
```javascript theme={null}
import { FishAudioClient } from "fish-audio";
import { createReadStream } from "fs";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
const response = await fishAudio.voices.ivc.create({
title: "Enhanced Model",
voices: [
createReadStream("hello.mp3"),
createReadStream("world.wav"),
],
texts: [
"Hello, this is my first recording.",
"Welcome to the world of AI voices.",
],
// other optional fields:
// visibility: "private",
// enhance_audio_quality: true,
});
console.log("Model ID:", response._id);
```
Text transcripts must match the exact number of audio files. If you provide 3 audio files, you must provide exactly 3 text transcripts.
## Using Your Model
Once training is complete:
```python theme={null}
# Generate speech with your model
response = requests.post(
"https://api.fish.audio/v1/tts",
json={
"text": "Hello from my custom voice!",
"model_id": model_id,
"format": "mp3"
},
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
# Save the audio
with open("output.mp3", "wb") as f:
f.write(response.content)
```
```javascript theme={null}
import { FishAudioClient } from "fish-audio";
import { writeFile } from "fs/promises";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
const audio = await fishAudio.textToSpeech.convert({
text: "Hello from my custom voice!",
model_id: "your_model_id_here",
format: "mp3",
});
const buffer = Buffer.from(await new Response(audio).arrayBuffer());
await writeFile("output.mp3", buffer);
console.log("✓ Audio saved to output.mp3");
```
## Troubleshooting
### Common Issues
**Model training fails:**
* Check audio quality and format
* Ensure single speaker in all samples
* Verify files are not corrupted
**Poor voice quality:**
* Add more diverse audio samples
* Enable audio enhancement
* Use higher quality recording
## Best Practices
1. **Start Simple:** Begin with 2-3 samples in fast mode to test
2. **Iterate:** Refine with more samples and quality mode
3. **Document:** Keep track of which samples work best
4. **Test Thoroughly:** Try different texts and emotions
5. **Privacy First:** Keep personal models private
## Support
Need help creating models?
* **API Documentation:** [Full API Reference](/api-reference/introduction)
* **Discord Community:** [Join our Discord](https://discord.gg/fish-audio)
* **Email Support:** [support@fish.audio](mailto:support@fish.audio)
# Emotion Control
Source: https://docs.fish.audio/developer-guide/core-features/emotions
Add natural emotions and expressions to your AI-generated speech
## Overview
Fish Audio models support 64+ emotional expressions and voice styles that can be controlled through text markers in your input. Add natural pauses, laughter, and other human-like elements to make speech more engaging and realistic.
The `(parenthesis)` syntax on this page applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details.
## How It Works
Simply wrap emotion tags in parentheses within your text:
```
(happy) What a beautiful day!
(sad) I'm sorry to hear that.
(excited) This is amazing news!
```
The TTS models will automatically recognize these markers and adjust the voice accordingly.
## Complete Emotion Reference
### Basic Emotions (24 expressions)
| Emotion | Tag | Description | Example Context |
| ----------- | --------------- | ----------------------- | --------------------------- |
| Happy | `(happy)` | Cheerful, upbeat tone | Good news, greetings |
| Sad | `(sad)` | Melancholic, downcast | Sympathy, bad news |
| Angry | `(angry)` | Frustrated, aggressive | Complaints, warnings |
| Excited | `(excited)` | Energetic, enthusiastic | Announcements, celebrations |
| Calm | `(calm)` | Peaceful, relaxed | Instructions, meditation |
| Nervous | `(nervous)` | Anxious, uncertain | Disclaimers, apologies |
| Confident | `(confident)` | Assertive, self-assured | Presentations, sales |
| Surprised | `(surprised)` | Shocked, amazed | Reactions, discoveries |
| Satisfied | `(satisfied)` | Content, pleased | Confirmations, reviews |
| Delighted | `(delighted)` | Very pleased, joyful | Celebrations, compliments |
| Scared | `(scared)` | Frightened, fearful | Warnings, horror stories |
| Worried | `(worried)` | Concerned, troubled | Concerns, questions |
| Upset | `(upset)` | Disturbed, distressed | Complaints, problems |
| Frustrated | `(frustrated)` | Annoyed, exasperated | Technical issues, delays |
| Depressed | `(depressed)` | Very sad, hopeless | Serious topics |
| Empathetic | `(empathetic)` | Understanding, caring | Support, counseling |
| Embarrassed | `(embarrassed)` | Ashamed, awkward | Apologies, mistakes |
| Disgusted | `(disgusted)` | Repelled, revolted | Negative reviews |
| Moved | `(moved)` | Emotionally touched | Heartfelt moments |
| Proud | `(proud)` | Accomplished, satisfied | Achievements, praise |
| Relaxed | `(relaxed)` | At ease, casual | Casual conversation |
| Grateful | `(grateful)` | Thankful, appreciative | Thanks, appreciation |
| Curious | `(curious)` | Inquisitive, interested | Questions, exploration |
| Sarcastic | `(sarcastic)` | Ironic, mocking | Humor, criticism |
### Advanced Emotions (25 expressions)
| Emotion | Tag | Description | Example Context |
| ------------- | ----------------- | ------------------------ | ---------------------- |
| Disdainful | `(disdainful)` | Contemptuous, scornful | Criticism, rejection |
| Unhappy | `(unhappy)` | Discontent, dissatisfied | Complaints, feedback |
| Anxious | `(anxious)` | Very worried, uneasy | Urgent matters |
| Hysterical | `(hysterical)` | Uncontrollably emotional | Extreme reactions |
| Indifferent | `(indifferent)` | Uncaring, neutral | Neutral responses |
| Uncertain | `(uncertain)` | Doubtful, unsure | Speculation, questions |
| Doubtful | `(doubtful)` | Skeptical, questioning | Disbelief, questioning |
| Confused | `(confused)` | Puzzled, perplexed | Clarification requests |
| Disappointed | `(disappointed)` | Let down, dissatisfied | Unmet expectations |
| Regretful | `(regretful)` | Sorry, remorseful | Apologies, mistakes |
| Guilty | `(guilty)` | Culpable, responsible | Confessions, apologies |
| Ashamed | `(ashamed)` | Deeply embarrassed | Serious mistakes |
| Jealous | `(jealous)` | Envious, resentful | Comparisons |
| Envious | `(envious)` | Wanting what others have | Admiration with desire |
| Hopeful | `(hopeful)` | Optimistic about future | Future plans |
| Optimistic | `(optimistic)` | Positive outlook | Encouragement |
| Pessimistic | `(pessimistic)` | Negative outlook | Warnings, doubts |
| Nostalgic | `(nostalgic)` | Longing for the past | Memories, stories |
| Lonely | `(lonely)` | Isolated, alone | Emotional content |
| Bored | `(bored)` | Uninterested, weary | Disinterest |
| Contemptuous | `(contemptuous)` | Showing contempt | Strong criticism |
| Sympathetic | `(sympathetic)` | Showing sympathy | Condolences |
| Compassionate | `(compassionate)` | Showing deep care | Support, help |
| Determined | `(determined)` | Resolved, decided | Goals, commitments |
| Resigned | `(resigned)` | Accepting defeat | Giving up, acceptance |
### Tone Markers (5 expressions)
Control volume and intensity:
| Tone | Tag | Description | When to Use |
| ---------- | ------------------- | -------------------- | -------------------------- |
| Hurried | `(in a hurry tone)` | Rushed, urgent | Time-sensitive information |
| Shouting | `(shouting)` | Loud, calling out | Getting attention |
| Screaming | `(screaming)` | Very loud, panicked | Emergencies, fear |
| Whispering | `(whispering)` | Very soft, secretive | Secrets, quiet scenes |
| Soft | `(soft tone)` | Gentle, quiet | Comfort, lullabies |
### Audio Effects (10 expressions)
Add natural human sounds:
| Effect | Tag | Description | Suggested Text |
| ------------- | ----------------- | ---------------------------- | -------------- |
| Laughing | `(laughing)` | Full laughter | Ha, ha, ha |
| Chuckling | `(chuckling)` | Light laugh | Heh, heh |
| Sobbing | `(sobbing)` | Crying heavily | (optional) |
| Crying Loudly | `(crying loudly)` | Intense crying | (optional) |
| Sighing | `(sighing)` | Exhale of relief/frustration | sigh |
| Groaning | `(groaning)` | Sound of frustration | ugh |
| Panting | `(panting)` | Out of breath | huff, puff |
| Gasping | `(gasping)` | Sharp intake of breath | gasp |
| Yawning | `(yawning)` | Tired sound | yawn |
| Snoring | `(snoring)` | Sleep sound | zzz |
### Special Effects
Additional markers for atmosphere and context:
| Effect | Tag | Description |
| ------------------- | ----------------------- | ------------------------ |
| Audience Laughter | `(audience laughing)` | Crowd laughing sound |
| Background Laughter | `(background laughter)` | Ambient laughter |
| Crowd Laughter | `(crowd laughing)` | Large group laughing |
| Short Pause | `(break)` | Brief pause in speech |
| Long Pause | `(long-break)` | Extended pause in speech |
You can also use natural expressions like "Ha,ha,ha" for laughter without tags.
## Usage Guidelines
### Placement Rules
**For English and Most Languages:**
* Emotion tags MUST go at the beginning of sentences
* Tone controls can go anywhere in the text
* Sound effects can go anywhere in the text
**Correct:**
```
(happy) What a wonderful day!
```
**Incorrect:**
```
What a (happy) wonderful day!
```
## Advanced Techniques
### Combining Effects
You can layer multiple emotions for complex expressions:
```
(sad)(whispering) I miss you so much.
(angry)(shouting) Get out of here now!
(excited)(laughing) We won! Ha ha!
```
### Emotion Transitions
Create natural emotional progressions:
```
(happy) I got the promotion!
(uncertain) But... it means relocating.
(sad) I'll miss everyone here.
(hopeful) Though it's a great opportunity.
(determined) I'm going to make it work!
```
### Background Effects
Add atmospheric sounds:
```
The comedy show was amazing (audience laughing)
Everyone was having fun (background laughter)
The crowd loved it (crowd laughing)
```
### Intensity Modifiers
Fine-tune emotional intensity with descriptive modifiers:
```
(slightly sad) I'm a bit disappointed.
(very excited) This is absolutely amazing!
(extremely angry) This is unacceptable!
```
## Language Support
All 13 supported languages can use emotion markers. Emotions must be at sentence start for these languages:
* **English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese**
## Best Practices
### Do's
* Use one primary emotion per sentence
* Test different emotion combinations
* Match emotions to context logically
* Add appropriate text after sound effects (e.g., "Ha ha" after laughing)
* Use natural expressions when possible
* Space out emotional changes for realism
### Don'ts
* Don't overuse emotion tags in short text
* Don't mix conflicting emotions
* Don't create custom tags - use only supported ones
* Don't forget parentheses
* Don't place emotion tags mid-sentence in English
## Common Use Cases
### Customer Service
```
(friendly) Hello! How can I help you today?
(empathetic) I understand your frustration.
(confident) I'll resolve this for you right away.
(grateful) Thank you for your patience!
```
### Storytelling
```
(narrator) Once upon a time...
(mysterious)(whispering) The old house stood silent.
(scared) "Is anyone there?" she called out.
(relieved)(sighing) No one answered. Phew.
```
### Educational Content
```
(enthusiastic) Welcome to today's lesson!
(curious) Have you ever wondered why?
(encouraging) That's a great question!
(proud) Excellent work!
```
### Marketing & Sales
```
(excited) Introducing our newest product!
(confident) You won't find better quality anywhere.
(urgent) Limited time offer!
(satisfied) Join thousands of happy customers!
```
## Troubleshooting
### Emotion Not Working?
1. **Check placement** - Emotions must be at the beginning of sentences for English
2. **Verify spelling** - Tags must match exactly as listed
3. **Include parentheses** - Tags must be wrapped in parentheses
### Unnatural Sound?
* Space out emotional changes
* Use appropriate intensity
* Test with different voices
* Add context text after sound effects
### Performance Notes
* Emotion markers don't count toward token limits
* No additional latency for emotion processing
* All emotions available on all pricing tiers
* Maximum of 3 combined emotions per sentence recommended
## Quick Reference Tables
### Emotion Intensity Scale
| Base Emotion | Mild | Moderate | Intense |
| ------------ | ------------ | -------- | --------- |
| Happy | satisfied | happy | delighted |
| Sad | disappointed | sad | depressed |
| Angry | frustrated | angry | furious |
| Scared | nervous | scared | terrified |
| Excited | interested | excited | ecstatic |
### Common Combinations
| Scenario | Emotion Combo | Example |
| ---------------- | ------------------------ | ------------------------------------- |
| Whispered Secret | (mysterious)(whispering) | "I have something to tell you..." |
| Angry Shout | (angry)(shouting) | "Stop right there!" |
| Sad Sigh | (sad)(sighing) | "I wish things were different. Sigh." |
| Excited Laugh | (excited)(laughing) | "We did it! Ha ha!" |
| Nervous Question | (nervous)(uncertain) | "Are you sure about this?" |
## See Also
* [Emotion Reference Guide](/api-reference/emotion-reference) - Complete emotion list with examples
* [API Reference](/api-reference/introduction) - Implementation details
* [Text-to-Speech Guide and Best Practices](/developer-guide/core-features/text-to-speech)
# Fine-grained Control
Source: https://docs.fish.audio/developer-guide/core-features/fine-grained-control
Advanced control over speech generation
## Getting Started
To use fine-grained control, you can use either our SDK, API, or Playground.
SDK/API: We recommend disabling normalization by setting `"normalize": false` in the request body. This ensures that the API doesn't alter the intonation of control tags.
Playground: You can use V1.6 Control Model, without setting any other options.
Disabling normalization may reduce the stability of reading numbers, dates, and URLs. You'll need to handle these cases manually for best results.
## Phoneme Control
Phoneme control allows you to specify exact pronunciations for words or characters. Currently, we support:
* CMU Arpabet (for English)
* Pinyin (for Chinese)
To use phoneme control, wrap the desired pronunciation in `<|phoneme_start|>` and `<|phoneme_end|>` tags. Each tag should contain a single word or character.
### English Example
Standard: "I am an engineer."
With phoneme control: "I am an `<|phoneme_start|>EH N JH AH N IH R<|phoneme_end|>`."
### Chinese Example
Standard: "我是一个工程师。"
With phoneme control: "我是一个`<|phoneme_start|>gong1<|phoneme_end|><|phoneme_start|>cheng2<|phoneme_end|><|phoneme_start|>shi1<|phoneme_end|>`。"
## Paralanguage
Paralanguage controls allow you to add natural speech elements and pauses to make the generated speech sound more human-like. There are two main types of controls:
### Pause Words
You can use common pause words like "um", "uh", "嗯", "啊" to control the rhythm of the speech.
### Special Effects
The following special effects can be added using parentheses:
| Effect | Description | First Available | Stage |
| ---------------- | ------------------ | --------------- | ------------ |
| `(break)` | Short pause | V1.6 | Experimental |
| `(long-break)` | Extended pause | V1.6 | Experimental |
| `(breath)` | Breathing sound | V1.6 | Experimental |
| `(laugh)` | Laughter sound | V1.6 | Experimental |
| `(cough)` | Coughing sound | V1.6 | Experimental |
| `(lip-smacking)` | Lip smacking sound | V1.6 | Experimental |
| `(sigh)` | Sighing sound | V1.6 | Experimental |
The effects `(laugh)`, `(cough)`, `(lip-smacking)`, and `(sigh)` are developing. You may need to repeat them multiple times for better results.
Example:
Standard: "I am an engineer."
With paralanguage: "I am, um, an (break) engineer."
# Speech to Text Guide
Source: https://docs.fish.audio/developer-guide/core-features/speech-to-text
Convert audio recordings into accurate text transcriptions
## Overview
Transform any audio recording into text with Fish Audio's speech recognition. Perfect for transcriptions, subtitles, and voice commands.
## Getting Started
### Web Interface
Transcribe audio instantly:
Go to [fish.audio](https://fish.audio) and log in
Click on "Speech to Text" in your dashboard
Select your audio file (MP3, WAV, M4A)
Click "Transcribe" and copy your text
## Supported Formats
### Audio Files
**Accepted formats:**
* MP3 (recommended)
* WAV
* M4A
* OGG
* FLAC
* AAC
**File requirements:**
* Maximum size: 20MB
* Maximum duration: 60 minutes
* Minimum duration: 1 second
## Language Support
### Automatic Detection
The system automatically detects the language spoken in your audio. No configuration needed!
### Manual Selection
For better accuracy, specify the language:
**Major Languages:**
* English (en)
* Chinese (zh)
* Japanese (ja)
With **additional languages** to be supported soon!
## Audio Quality Tips
### For Best Results
**Recording Environment:**
* Quiet room with minimal echo
* No background music
* Clear, consistent speaking voice
* One speaker at a time
**Audio Settings:**
* Sample rate: 16kHz or higher
* Bit rate: 128kbps or higher
* Mono or stereo (mono preferred)
### Common Issues
**Poor transcription quality?**
* Remove background noise
* Increase microphone volume
* Speak clearly and not too fast
* Avoid multiple speakers talking over each other
## Use Cases
### Meeting Transcription
Convert recorded meetings into searchable text:
1. Record your meeting (Zoom, Teams, etc.)
2. Export the audio file
3. Upload to Fish Audio
4. Get formatted transcription with timestamps
### Podcast Transcripts
Create written versions of your podcasts:
* Generate show notes automatically
* Create searchable content
* Improve accessibility
* Enable translations
### Video Subtitles
Generate subtitles for your videos:
1. Extract audio from video
2. Transcribe with Fish Audio
3. Get timestamped text
4. Import into video editor
### Voice Notes
Convert voice memos to text:
* Dictate ideas quickly
* Transcribe later for editing
* Search through voice notes
* Share as text documents
## Advanced Features
### Timestamps
Get precise timing for each spoken segment:
```
[00:00:00] Welcome to our podcast.
[00:00:03] Today we're discussing AI technology.
[00:00:07] Let's dive right in.
```
Perfect for:
* Creating subtitles
* Navigating long recordings
* Synchronizing with video
* Building searchable archives
### Speaker Detection
Identify different speakers in conversations:
```
Speaker 1: "What do you think about the proposal?"
Speaker 2: "I think it has potential."
Speaker 1: "Let's discuss the details."
```
### Punctuation & Formatting
Automatic formatting includes:
* Sentence capitalization
* Punctuation marks
* Paragraph breaks
* Number formatting
## Tips for Different Content
### Interviews
**Best practices:**
* Use a good microphone for each speaker
* Record in a quiet environment
* Speak one at a time
* Keep consistent volume levels
### Lectures & Presentations
**Optimize for:**
* Clear articulation of technical terms
* Pause between topics
* Repeat important points
* Avoid reading too fast
### Phone Calls
**Considerations:**
* Phone audio is lower quality
* Expect slightly lower accuracy
* Speak clearly and slowly
* Avoid speakerphone if possible
## Accuracy Expectations
### What Affects Accuracy
**Positive factors:**
* Clear audio quality
* Native speaker accent
* Common vocabulary
* Single speaker
**Challenging factors:**
* Heavy accents
* Technical jargon
* Multiple speakers
* Background noise
### Typical Accuracy Rates
* **Professional recording:** 95-98%
* **Clean amateur recording:** 90-95%
* **Phone/video calls:** 85-90%
* **Noisy environments:** 75-85%
## Post-Processing Tips
### Editing Transcriptions
After transcription:
1. **Review for accuracy** - Check names and technical terms
2. **Add formatting** - Break into paragraphs
3. **Correct errors** - Fix any misheard words
4. **Add context** - Include speaker names
### Export Options
Save your transcriptions as:
* Plain text (.txt)
* Word document (.docx)
* Subtitle file (.srt)
* PDF document
## Common Applications
### Business
* Meeting minutes
* Interview transcripts
* Call recordings
* Training materials
### Education
* Lecture notes
* Research interviews
* Student recordings
* Language learning
### Content Creation
* Video scripts
* Podcast show notes
* Social media captions
* Blog post drafts
### Accessibility
* Hearing impaired support
* Multi-language content
* Searchable archives
* Documentation
## Troubleshooting
### No Text Output
**Check:**
* Audio file isn't corrupted
* File format is supported
* Audio contains speech
* Volume is audible
### Incorrect Language
**Solutions:**
* Manually select the correct language
* Ensure majority of audio is in one language
* Separate multi-language content
### Missing Words
**Common causes:**
* Speaking too fast
* Mumbling or unclear speech
* Technical terms not recognized
* Very quiet sections
## Privacy & Security
### Your Data
* Audio files are processed securely
* Transcriptions are private to your account
* Files are not used for training
* Delete anytime from your account
### Sensitive Content
For confidential audio:
* Use on-premise solutions if available
* Review privacy policy
* Consider redacting sensitive information
* Download and delete after processing
## Best Practices Summary
1. **Start with quality audio** - Good input = good output
2. **Choose the right environment** - Quiet spaces work best
3. **Speak clearly** - Articulate and consistent pace
4. **Review and edit** - All transcriptions benefit from review
5. **Use appropriate tools** - Different content needs different approaches
## Get Support
Need help with transcription?
* **Try it free:** [fish.audio](https://fish.audio)
* **Community:** [Discord](https://discord.gg/fish-audio)
* **Email:** [support@fish.audio](mailto:support@fish.audio)
* **Status:** [status.fish.audio](https://status.fish.audio)
# Text to Speech
Source: https://docs.fish.audio/developer-guide/core-features/text-to-speech
Convert text to natural-sounding speech with Fish Audio
## Overview
Transform any text into natural, expressive speech using Fish Audio's advanced TTS models. Choose from pre-made voices or use your own cloned voices.
Discover the world's best cloned voices models on our [Discovery](https://fish.audio/discovery) page.
## Quick Start
### Web Interface
The easiest way to generate speech:
Go to [fish.audio](https://fish.audio) and log in
Type or paste the text you want to convert
Select from available voices or use your own
Click "Generate" and download your audio
## Using the SDK
```bash theme={null}
pip install fish-audio-sdk
```
Generate speech with just a few lines of code:
```python theme={null}
from fishaudio import FishAudio
from fishaudio.utils import save
# Initialize client
client = FishAudio(api_key="your_api_key_here")
# Generate speech
audio = client.tts.convert(
text="Hello, world!",
reference_id="your_voice_model_id"
)
save(audio, "output.mp3")
print("✓ Audio saved to output.mp3")
```
```bash theme={null}
npm install fish-audio
```
Generate speech with just a few lines of code:
```javascript theme={null}
import { FishAudioClient } from "fish-audio";
import { writeFile } from "fs/promises";
// Initialize session
const fishAudio = new FishAudioClient({ apiKey: "your_api_key_here" });
const audio = await fishAudio.textToSpeech.convert({
text: "Hello, world!",
reference_id: "your_voice_model_id",
});
const buffer = Buffer.from(await new Response(audio).arrayBuffer());
await writeFile("output.mp3", buffer);
console.log("✓ Audio saved to output.mp3");
```
## Voice Options
### Using Pre-made Voices
Browse and select voices from the playground:
```python theme={null}
# Use a voice from the playground
audio = client.tts.convert(
text="Welcome to Fish Audio!",
reference_id="7f92f8afb8ec43bf81429cc1c9199cb1"
)
```
```javascript theme={null}
# Use a voice from the playground
const audio = await fishAudio.textToSpeech.convert({
text: "Welcome to Fish Audio!",
reference_id: "7f92f8afb8ec43bf81429cc1c9199cb1",
});
```
### Using Your Cloned Voice
Use voices you've created:
```python theme={null}
# Use your own cloned voice
audio = client.tts.convert(
text="This is my custom voice speaking",
reference_id="your_model_id"
)
```
```javascript theme={null}
# Use your own cloned voice
const audio = await fishAudio.textToSpeech.convert({
text: "This is my custom voice speaking",
reference_id: "your_model_id",
});
```
### Using Reference Audio
Provide reference audio directly:
```python theme={null}
from fishaudio.types import ReferenceAudio
# Use reference audio on-the-fly
with open("voice_sample.wav", "rb") as f:
audio = client.tts.convert(
text="Hello from reference audio",
references=[
ReferenceAudio(
audio=f.read(),
text="Sample text from the audio"
)
]
)
```
```javascript theme={null}
// Use reference audio on-the-fly
const fileBuffer = await readFile("voice_sample.wav");
const voiceFile = new File([fileBuffer], "voice_sample.wav");
const audio = await fishAudio.textToSpeech.convert({
text: "Hello from reference audio",
references: [
{ audio: voiceFile, text: "Sample text from the audio" }
]
});
```
## Model Selection
Choose the right model for your needs:
| Model | Best For | Quality | Speed |
| ---------- | --------------- | --------- | ------- |
| **s1** | Prototyping | Excellent | Fast |
| **s2-pro** | Latest features | Excellent | Fastest |
Specify a model in your request:
```python theme={null}
# Using the latest model (default)
audio = client.tts.convert(text="Hello world")
```
```javascript theme={null}
// Using the latest S2-Pro model
const audio = await fishAudio.textToSpeech.convert(
{ text: "Hello world" },
"s2-pro"
);
```
## Advanced Options
### Audio Formats
Choose your output format:
```python theme={null}
audio = client.tts.convert(
text="Your text here",
format="mp3", # Options: "mp3", "wav", "pcm", "opus"
mp3_bitrate=128 # For MP3: 64, 128, or 192
)
```
```javascript theme={null}
const audio = await fishAudio.textToSpeech.convert({
text: "Your text here",
format: "mp3", // Options: "mp3", "wav", "pcm", "opus"
mp3_bitrate: 128, // For MP3: 64, 128, or 192
});
```
### Chunk Length
Control text processing chunks:
```python theme={null}
audio = client.tts.convert(
text="Long text content...",
chunk_length=200 # 100-300 characters per chunk
)
```
```javascript theme={null}
const audio = await fishAudio.textToSpeech.convert({
text: "Long text content...",
chunk_length: 200, // 100-300 characters per chunk
});
```
### Latency Mode
Optimize for speed or quality:
```python theme={null}
audio = client.tts.convert(
text="Quick response needed",
latency="balanced" # "normal" or "balanced"
)
```
```javascript theme={null}
const audio = await fishAudio.textToSpeech.convert({
text: "Quick response needed",
latency: "balanced", // "normal" or "balanced"
});
```
Balanced mode reduces latency to \~300ms but may slightly decrease stability.
## Direct API Usage
For direct API calls without the SDK:
```python theme={null}
import httpx
import ormsgpack
# Prepare request
request_data = {
"text": "Hello, world!",
"reference_id": "your_model_id",
"format": "mp3"
}
# Make API call
with httpx.Client() as client:
response = client.post(
"https://api.fish.audio/v1/tts",
content=ormsgpack.packb(request_data),
headers={
"authorization": "Bearer YOUR_API_KEY",
"content-type": "application/msgpack",
"model": "s2-pro"
}
)
# Save audio
with open("output.mp3", "wb") as f:
f.write(response.content)
```
```javascript theme={null}
import { encode } from "@msgpack/msgpack";
import { writeFile } from "fs/promises";
const body = encode({
text: "Hello, world!",
reference_id: "your_model_id",
format: "mp3",
});
const res = await fetch("https://api.fish.audio/v1/tts", {
method: "POST",
headers: {
Authorization: "Bearer ",
"Content-Type": "application/msgpack",
model: "s2-pro",
},
body,
});
const buffer = Buffer.from(await res.arrayBuffer());
await writeFile("output.mp3", buffer);
```
## Streaming Audio
Stream audio for real-time applications:
```python theme={null}
# Stream audio chunks
audio_stream = client.tts.stream(
text="Streaming this text in real-time",
reference_id="model_id"
)
with open("stream_output.mp3", "wb") as f:
for chunk in audio_stream:
f.write(chunk)
# Process chunk immediately for real-time playback
```
```javascript theme={null}
// Use a Websocket to stream real-time audio
import { FishAudioClient, RealtimeEvents } from "fish-audio";
import { writeFile } from "fs/promises";
import path from "path";
// Simple async generator that yields text chunks
async function* makeTextStream() {
const chunks = [
"Hello from Fish Audio! ",
"This is a realtime text-to-speech test. ",
"We are streaming multiple chunks over WebSocket.",
];
for (const chunk of chunks) {
yield chunk;
}
}
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
// For realtime, set text to "" and stream the content via makeTextStream
const request = { text: "" };
const connection = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream());
// Collect audio and write to a file when the stream ends
const chunks: Buffer[] = [];
connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened"));
connection.on(RealtimeEvents.AUDIO_CHUNK, (audio: unknown): void => {
if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) {
chunks.push(Buffer.from(audio));
}
});
connection.on(RealtimeEvents.ERROR, (err) => console.error("WebSocket error:", err));
connection.on(RealtimeEvents.CLOSE, async () => {
const outPath = path.resolve(process.cwd(), "out.mp3");
await writeFile(outPath, Buffer.concat(chunks));
console.log("Saved to", outPath);
});
```
## Adding Emotions
The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details.
Make your speech more expressive:
```python theme={null}
# Add emotion markers to your text
emotional_text = """
(excited) I just won the lottery!
(sad) But then I lost the ticket.
(laughing) Just kidding, I found it!
"""
audio = client.tts.convert(
text=emotional_text,
reference_id="model_id"
)
```
```javascript theme={null}
// Add emotion markers to your text
const emotionalText = `(excited) I just won the lottery!
(sad) But then I lost the ticket.
(laughing) Just kidding, I found it!`;
const audio = await fishAudio.textToSpeech.convert({
text: emotionalText,
reference_id: "model_id",
});
```
Available emotions:
* Basic: `(happy)`, `(sad)`, `(angry)`, `(excited)`, `(calm)`
* Tones: `(shouting)`, `(whispering)`, `(soft tone)`
* Effects: `(laughing)`, `(sighing)`, `(crying)`
For more precise control over pronunciation and additional paralanguage features like pauses and breathing, see [Fine-grained Control](/developer-guide/core-features/fine-grained-control).
## Best Practices
### Text Preparation
**Do:**
* Use proper punctuation for natural pauses
* Add emotion markers for expression
* Break long texts into paragraphs
* Use consistent formatting
**Don't:**
* Use ALL CAPS (unless shouting)
* Mix multiple languages randomly
* Include special characters unnecessarily
* Forget punctuation
### Performance Tips
1. **Batch Processing:** Process multiple texts efficiently
2. **Cache Models:** Store frequently used model IDs
3. **Optimize Chunk Size:** Use 200 characters for best balance
4. **Handle Errors:** Implement retry logic for network issues
### Quality Optimization
For best results:
* Use high-quality reference audio for cloning
* Choose appropriate emotion markers
* Test different latency modes
* Monitor API rate limits
## Troubleshooting
### Common Issues
**No audio output:**
* Check API key validity
* Verify model ID exists
* Ensure proper audio format
**Poor quality:**
* Use better reference audio
* Try normal latency mode
* Check text formatting
**Slow generation:**
* Use balanced latency mode
* Reduce chunk length
* Check network connection
## Code Examples
### Batch Processing
```python theme={null}
from fishaudio.utils import save
texts = [
"First announcement",
"Second announcement",
"Third announcement"
]
for i, text in enumerate(texts):
audio = client.tts.convert(
text=text,
reference_id="model_id"
)
save(audio, f"output_{i}.mp3")
```
```javascript theme={null}
const texts = [
"First announcement",
"Second announcement",
"Third announcement",
];
for (let i = 0; i < texts.length; i++) {
const audio = await fishAudio.textToSpeech.convert({
text: texts[i],
reference_id: "model_id",
});
const buffer = Buffer.from(await new Response(audio).arrayBuffer());
await writeFile(`output_${i}.mp3`, buffer);
}
```
### Error Handling
```python theme={null}
import time
from fishaudio.exceptions import FishAudioError
def generate_with_retry(text, max_retries=3):
for attempt in range(max_retries):
try:
audio = client.tts.convert(
text=text,
reference_id="model_id"
)
return audio
except FishAudioError as e:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise e
```
```javascript theme={null}
async function generateWithRetry(text, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const audio = await fishAudio.textToSpeech.convert({
text,
reference_id: "model_id",
});
const buffer = Buffer.from(await new Response(audio).arrayBuffer());
return buffer;
} catch (err) {
if (attempt < maxRetries - 1) {
const delayMs = 2 ** attempt * 1000;
await new Promise((r) => setTimeout(r, delayMs));
} else {
throw err;
}
}
}
}
const buffer = await generateWithRetry("Hello with retry");
await writeFile("retry_output.mp3", buffer);
```
## API Reference
### Request Parameters
| Parameter | Type | Description | Default |
| ----------------- | ------- | -------------------- | -------- |
| **text** | string | Text to convert | Required |
| **reference\_id** | string | Model/voice ID | None |
| **format** | string | Audio format | "mp3" |
| **chunk\_length** | integer | Characters per chunk | 200 |
| **normalize** | boolean | Normalize text | true |
| **latency** | string | Speed vs quality | "normal" |
### Response
Returns audio data in the specified format as binary stream.
## Get Support
Need help with text-to-speech?
* [API Reference](/api-reference/introduction)
* **Discord Community:** [Join our Discord](https://discord.gg/fish-audio)
* **Email Support:** [support@fish.audio](mailto:support@fish.audio)
# Changelog
Source: https://docs.fish.audio/developer-guide/getting-started/changelog
Complete release history and version updates for all Fish Audio products
## Fish Audio S2
Next-generation text-to-speech model with inline emotion cues, multi-speaker dialogue support, and 80+ languages.
S2 introduces `[bracket]` syntax for natural language control over emotion and paralinguistic cues (e.g., `[whisper]`, `[laugh]`, `[emphasis]`). Tags are treated as standard text rather than dedicated control tokens, so you are not limited to a fixed set of expressions. Built on the Qwen3-4B backbone and fully open-source.
Use model ID `s2-pro` in the API. S1 remains supported for existing integrations.
[GitHub](https://github.com/fishaudio/fish-speech) | [HuggingFace](https://huggingface.co/fishaudio)
## Fish Audio S1
Historic rebrand from Fish Speech to Fish Audio. #1 ranking on TTS-Arena2 with industry-leading performance.
S1 (4B params): 0.008 WER, 0.004 CER - Available on Fish Audio Playground
S1-mini (0.5B params): 0.011 WER, 0.005 CER - Open source on Hugging Face
64+ emotional expressions with RLHF integration and multilingual support for English, Chinese, Japanese, and more.
[Read More about S1](https://fish.audio/blog/introducing-s1/)
## v1.5.1
Fixed critical PyTorch security settings and improved inference speed significantly. Added ONNX export support for better deployment options and enhanced text processing for Arabic and Hebrew languages. Includes bug fixes for Apple Silicon (MPS) compatibility and reorganized library structure for cleaner codebase.
## v1.5.0
Introduced v1.5 model architecture with improved dataset handling and bearer token authentication for APIs.
Added reference audio caching by hash for faster performance and better Apple Silicon support. Includes OpenAPI documentation refactoring and base64 reference data support in JSON format.
## v1.4.3
Introduced Fish Agent for conversational AI with streaming capabilities and real-time interactions.
Added comprehensive Korean language documentation and fixed critical non-English speech issues. Improved WebUI streaming functionality and PyTorch version compatibility.
## v1.4.2
Documentation-focused release with comprehensive updates for v1.4, macOS support, and multiple language translations.
Improved Docker support and API enhancements for JSON format handling. Added audio selection to WebUI and fixed various stability issues including cache handling and backend performance.
## v1.4.1
Infrastructure improvements focused on Docker optimization and multi-platform builds.
Updated PyTorch version and replaced audio backend from sox for better performance. Enhanced CI/CD pipeline with buildx support and fixed various Docker-related issues.
## v1.4.0
Major release with new VQGAN architecture for improved audio quality and faster inference.
Updated WebUI with enhanced interface and better language switching. Added Japanese documentation translation and fixed inference warmup issues for better performance.
## v1.2.1
Replaced Whisper with SenseVoice for better ASR and added native Apple Silicon support.
Includes Portuguese (Brazil) localization, streaming audio functionality, and CPU-only inference improvements. Pinned PyTorch to 2.3.1 to fix inference speed issues and aligned API with official closed-source version.
## v1.2
Introduced auto-reranking system for better results along with bilingual support and model quantization.
Replaced standard Whisper with Faster Whisper for improved speed and added Japanese documentation. Enhanced model stability and inference performance with optimized v1.2 architecture.
## v1.1.2
Minor release adding Chinese text normalization support and a streaming audio download button in the WebUI.
Fixed LoRA merging issues and improved Firefly performance.
## v1.1.1
Breaking changes: Replaced zibai with uvicorn for API server, new text-splitter with byte-based length calculation, and license change to CC-BY-NC-SA 4.0.
Added Apple Silicon (MPS) support, Windows one-click installation, and automatic model downloading with resume capability. Improved WebUI with better file selection and download progress indicators.
## v1.1.0
Added VITS decoder integration with full streaming support and queue management for real-time audio generation.
Introduced internationalization (i18n) with Spanish translation and improved Windows packaging. Optimized GPU memory usage and CPU-only inference performance while adding LoRA support to the Gradio UI.
## v1.0.0
Major milestone release introducing new VQ-GAN architecture with VITS decoder support, LoRA fine-tuning, and streaming inference capabilities.
Breaking changes include removal of the Rust-based data server, new tokenizer replacing phonemizer, and updated model architecture (VQ + DiT + Reflow). Achieved 4x memory reduction during loading and added WebUI for training and annotation.
## v0.2.0
First public release of Fish Speech featuring a complete text-to-speech pipeline with VQ-GAN audio codec and LLAMA-based language model.
Includes multi-language support (Chinese, English, Japanese), Gradio WebUI for inference, HTTP API server, and Docker support. Added special optimizations for Chinese users including mirror downloads and localized documentation.
# Overview of Fish Audio
Source: https://docs.fish.audio/developer-guide/getting-started/introduction
Discover Fish Audio's powerful voice generation platform and what you can build
## What is Fish Audio?
Fish Audio is a cutting-edge AI platform for voice generation, voice cloning, and audio storytelling.
Our technology brings dynamic, natural-sounding voices to your applications, enabling immersive experiences across industries.
Introducing our latest generation voice models:
**Fish Audio S2-Pro:** Our latest model delivers unparalleled naturalness and emotion, setting a new standard for AI-generated speech. [Learn more about our models →](/developer-guide/models-pricing/models-overview)
## Core Capabilities
Generate natural, expressive speech from text in multiple languages and styles
Create custom voice models from as little as 15 seconds of audio
Build multi-character narratives with emotion and dynamic voice switching
## Try It Now
Test our voices in the interactive playground - no code required
Browse available voice models and their capabilities
## Ready to Start?
Get your API key and make your first API call in minutes.
Generate your first AI voice in under 5 minutes
## Platform Capabilities
Fish Audio empowers developers to create innovative voice experiences across diverse industries. Whether you're building consumer apps, enterprise solutions, or creative tools, our platform provides the flexibility and power you need.
### What You Can Build
Automate podcast production, YouTube narration, and audiobook generation
Create dynamic NPC dialogue and real-time character voices
Build interactive language learning tools and accessible educational content
Deploy natural-sounding IVR systems and support agents
Develop screen readers and voice restoration tools
Generate ASMR content, music vocals, interactive stories, and adult content
### Key Features
Stream audio in real-time for live applications
Industry-leading naturalness and clarity
Generate speech in 30+ languages
Fine-tune prosody, emotion, and speaking style
RESTful API with SDKs for Python, Node.js, and more
Handle everything from prototypes to production workloads
## Learn More
* [Models & Pricing](/developer-guide/models-pricing/models-overview) - Explore voice models and pricing options
* [Core Features](/developer-guide/core-features/text-to-speech) - Deep dive into TTS and voice cloning
* [SDKs & Tools](/developer-guide/sdk-guide/python/installation) - Install language-specific libraries
* [Best Practices](/developer-guide/best-practices/voice-cloning) - Production-ready tips and optimization for voice cloning, emotion and expression control, and real-time voice streaming
# Quick Start
Source: https://docs.fish.audio/developer-guide/getting-started/quickstart
Generate your first AI voice with Fish Audio in under 5 minutes
## Overview
This guide will walk you through generating your first text-to-speech audio with Fish Audio. By the end, you'll have converted text into natural-sounding speech using our API.
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Your First TTS Request
Choose your preferred method to generate speech:
Store your API key as an environment variable (recommended approach):
```bash theme={null}
export FISH_API_KEY="replace_me"
```
Run this [cURL](https://curl.se/) command to generate your first speech:
```bash theme={null}
curl -X POST https://api.fish.audio/v1/tts \
-H "Authorization: Bearer $FISH_API_KEY" \
-H "Content-Type: application/json" \
-H "model: s2-pro" \
-d '{
"text": "Hello! Welcome to Fish Audio. This is my first AI-generated voice.",
"format": "mp3"
}' \
--output welcome.mp3
```
The audio has been saved as `welcome.mp3`. You can play it by:
* Double-clicking the file or opening it in any media player
* Or using the command line:
```bash theme={null}
# On macOS
afplay welcome.mp3
# On Linux
mpg123 welcome.mp3
# On Windows
start welcome.mp3
```
```bash theme={null}
pip install fish-audio-sdk
```
Create a Python script:
```python theme={null}
from fishaudio import FishAudio
from fishaudio.utils import save
# Initialize with your API key
client = FishAudio(api_key="your_api_key_here")
# Generate speech
audio = client.tts.convert(text="Hello! Welcome to Fish Audio.")
save(audio, "welcome.mp3")
print("✓ Audio saved to welcome.mp3")
```
```bash theme={null}
python generate_speech.py
```
The audio has been saved as `welcome.mp3`. You can play it by:
* Double-clicking the file or opening it in any media player
* Or using the command line:
```bash theme={null}
# On macOS
afplay welcome.mp3
# On Linux
mpg123 welcome.mp3
# On Windows
start welcome.mp3
```
```bash theme={null}
npm install fish-audio
```
Create a JavaScript script:
```javascript theme={null}
import { FishAudioClient } from "fish-audio";
import { writeFile } from "fs/promises";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
const audio = await fishAudio.textToSpeech.convert({
text: "Hello, world!",
});
const buffer = Buffer.from(await new Response(audio).arrayBuffer());
await writeFile("welcome.mp3", buffer);
console.log("✓ Audio saved to welcome.mp3");
```
```bash theme={null}
node generate_speech.mjs
```
The audio has been saved as `welcome.mp3`. You can play it by:
* Double-clicking the file or opening it in any media player
* Or using the command line:
```bash theme={null}
# On macOS
afplay welcome.mp3
# On Linux
mpg123 welcome.mp3
# On Windows
start welcome.mp3
```
## Customizing Your Voice
The examples above use the default voice. To use a different voice, add the `reference_id` parameter with a model ID from [fish.audio](https://fish.audio). You can find the model ID in the URL or use the copy button when viewing any voice.
Choose a voice to try:
From: [https://fish.audio/m/8ef4a238714b45718ce04243307c57a7](https://fish.audio/m/8ef4a238714b45718ce04243307c57a7)
```bash theme={null}
export REFERENCE_ID="8ef4a238714b45718ce04243307c57a7"
```
From: [https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89](https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89)
```bash theme={null}
export REFERENCE_ID="802e3bc2b27e49c2995d23ef70e6ac89"
```
Then generate speech with your chosen voice:
```bash theme={null}
curl -X POST https://api.fish.audio/v1/tts \
-H "Authorization: Bearer $FISH_API_KEY" \
-H "Content-Type: application/json" \
-H "model: s2" \
-d '{
"text": "This is a custom voice from Fish Audio! You can explore hundreds of different voices on the platform, or even create your own.",
"reference_id": "'"$REFERENCE_ID"'",
"format": "mp3"
}' \
--output custom_voice.mp3
```
```python theme={null}
import os
from fishaudio import FishAudio
from fishaudio.utils import save
client = FishAudio(api_key="your_api_key_here")
# Generate speech with custom voice
audio = client.tts.convert(
text="This is a custom voice from Fish Audio! You can explore hundreds of different voices on the platform, or even create your own.",
reference_id=os.environ.get("REFERENCE_ID")
)
save(audio, "custom_voice.mp3")
```
```javascript theme={null}
import { FishAudioClient } from "fish-audio";
import { writeFile } from "fs/promises";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
const audio = await fishAudio.textToSpeech.convert({
text: "This is a custom voice from Fish Audio! You can explore hundreds of different voices on the platform, or even create your own.",
reference_id: process.env.REFERENCE_ID,
});
const buffer = Buffer.from(await new Response(audio).arrayBuffer());
await writeFile("custom_voice.mp3", buffer);
console.log("✓ Audio saved to custom_voice.mp3");
```
## Support
Need help? Check out these resources:
* [API Reference](/api-reference/introduction) - Complete API documentation
* [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model
* [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech
* [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming
* [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community
* [Support Email](mailto:support@fish.audio) - Contact our support team
# LiveKit
Source: https://docs.fish.audio/developer-guide/integrations/livekit
Build real-time voice AI agents with Fish Audio and LiveKit
[LiveKit Agents](https://github.com/livekit/agents) is an open source framework for building real-time voice and multimodal AI agents. It handles streaming audio pipelines, turn detection, interruptions, and LLM orchestration so you can focus on your agent's behavior.
Fish Audio integrates with LiveKit through the `fishaudio` plugin, providing text-to-speech synthesis with support for both chunked and real-time WebSocket streaming modes.
## Prerequisites
* A [Fish Audio account](https://fish.audio) with an API key
* Python 3.9 or higher
## Installation
Install LiveKit Agents with Fish Audio support:
```bash theme={null}
pip install "livekit-agents[fishaudio]"
```
## Configuration
Set your Fish Audio API key as an environment variable:
```bash theme={null}
export FISH_API_KEY=your_api_key_here
```
## Basic usage
Add Fish Audio TTS to your LiveKit agent:
```python theme={null}
from livekit.plugins.fishaudio import TTS
tts = TTS(
reference_id="your_voice_model_id", # Optional: use a specific voice
model="s1",
sample_rate=24000,
latency_mode="balanced"
)
```
### Key parameters
| Parameter | Description |
| --------------- | ------------------------------------------------------------------------- |
| `api_key` | Your Fish Audio API key (or use `FISH_API_KEY` env var) |
| `model` | TTS model/backend to use (default: `s1`) |
| `reference_id` | Voice model ID from the [Fish Audio library](https://fish.audio/discover) |
| `output_format` | Audio format: `pcm`, `mp3`, `wav`, or `opus` (default: `pcm`) |
| `sample_rate` | Audio sample rate in Hz (default: `24000`) |
| `num_channels` | Number of audio channels (default: `1`) |
| `base_url` | Custom API endpoint (default: `https://api.fish.audio`) |
| `latency_mode` | `normal` (\~500ms) or `balanced` (\~300ms, default) |
### Streaming modes
The plugin supports two synthesis modes:
```python theme={null}
# Chunked (non-streaming) synthesis
stream = tts.synthesize("Hello, world!")
# Real-time WebSocket streaming
stream = tts.stream()
```
## Resources
* [LiveKit Agents Documentation](https://docs.livekit.io/agents/)
* [LiveKit GitHub](https://github.com/livekit/agents)
* [Fish Audio Plugin Reference](https://docs.livekit.io/reference/python/v1/livekit/plugins/fishaudio/index.html)
* [Fish Audio Voice Library](https://fish.audio/discovery)
# n8n
Source: https://docs.fish.audio/developer-guide/integrations/n8n
Automate workflows with Fish Audio and n8n
[n8n](https://n8n.io/) is a fair-code licensed workflow automation platform. The Fish Audio community node brings text-to-speech, speech-to-text, and voice cloning capabilities to your n8n workflows.
## Installation
Install from n8n community nodes:
1. Go to **Settings** > **Community Nodes**
2. Select **Install**
3. Enter `n8n-nodes-fishaudio`
4. Accept the risks and install
See the [n8n community nodes guide](https://docs.n8n.io/integrations/community-nodes/installation/) for details.
## Configuration
1. Go to **Credentials** > **Add Credential**
2. Search for "Fish Audio API"
3. Enter your API key from [fish.audio/app/api-keys](https://fish.audio/app/api-keys)
## Features
The node supports:
* **Text-to-Speech** — Generate audio from text using any voice model
* **Speech-to-Text** — Transcribe audio files
* **Voice Models** — List, create, and manage custom voices
* **Account** — Check credit balance
The node is also available as an AI tool for use with n8n's AI Agent nodes.
## Resources
* [npm package](https://www.npmjs.com/package/n8n-nodes-fishaudio)
* [GitHub](https://github.com/fishaudio/fish-audio-n8n)
* [n8n Community Nodes](https://docs.n8n.io/integrations/community-nodes/)
# Pipecat
Source: https://docs.fish.audio/developer-guide/integrations/pipecat
Build voice AI agents with Fish Audio and Pipecat
[Pipecat](https://github.com/pipecat-ai/pipecat) is an open source framework for building voice and multimodal conversational AI. It handles the orchestration of audio, AI services, and conversation pipelines so you can focus on what makes your agent unique.
Fish Audio integrates with Pipecat through `FishAudioTTSService`, which provides real-time text-to-speech synthesis using WebSocket streaming for low-latency conversational applications.
## Prerequisites
* A [Fish Audio account](https://fish.audio) with an API key
* Python 3.9 or higher
## Installation
Install Pipecat with Fish Audio support:
```bash theme={null}
pip install "pipecat-ai[fish]"
```
## Configuration
Set your Fish Audio API key as an environment variable:
```bash theme={null}
export FISH_API_KEY=your_api_key_here
```
## Basic usage
Add `FishAudioTTSService` to your Pipecat pipeline:
```python theme={null}
from pipecat.services.fish import FishAudioTTSService
tts = FishAudioTTSService(
api_key=os.getenv("FISH_API_KEY"),
reference_id="your_voice_model_id", # Optional: use a specific voice
model_id="s1",
params=FishAudioTTSService.InputParams(
latency="normal",
prosody_speed=1.0
)
)
```
### Key parameters
| Parameter | Description |
| --------------- | ------------------------------------------------------------------------- |
| `api_key` | Your Fish Audio API key |
| `reference_id` | Voice model ID from the [Fish Audio library](https://fish.audio/discover) |
| `model_id` | TTS model version (default: `s1`) |
| `output_format` | Audio format: `pcm`, `mp3`, `wav`, or `opus` |
### Prosody controls
Customize speech characteristics with `InputParams`:
```python theme={null}
params=FishAudioTTSService.InputParams(
latency="balanced", # "normal" or "balanced"
prosody_speed=1.2, # 0.5 to 2.0
prosody_volume=0, # Volume adjustment in dB
normalize=True # Audio normalization
)
```
## Resources
* [Pipecat Documentation](https://docs.pipecat.ai/server/services/tts/fish)
* [Pipecat GitHub](https://github.com/pipecat-ai/pipecat)
* [Fish Audio Voice Library](https://fish.audio/discovery)
# Choosing a Model
Source: https://docs.fish.audio/developer-guide/models-pricing/choosing-a-model
Select the right Fish Audio model for your use case and requirements
We recommend using **Fish Audio S2-Pro** for all projects - our flagship model with industry-leading quality and performance.
## Support
Need help? Check out these resources:
* [API Reference](/api-reference/introduction) - Complete API documentation
* [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model
* [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech
* [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming
* [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community
* [Support Email](mailto:support@fish.audio) - Contact our support team
# Model Deprecations
Source: https://docs.fish.audio/developer-guide/models-pricing/deprecations
Track deprecated models and migration timelines for Fish Audio services
## Available Models
Currently available models:
* **Fish Audio S2** (Recommended) - Latest generation with best performance
* **Fish Audio S1** - Highly expressive and natural sounding
## Deprecated Models
* **speech-1.6** - Fish Speech v1.6 has been deprecated on February, 28th, 2026
* **speech-1.5** - Fish Speech v1.5 has been deprecated on February, 28th, 2026
We strongly recommend using **Fish Audio S1** for all new projects to access the latest capabilities and performance improvements.
## Support
Need help? Check out these resources:
* [API Reference](/api-reference/introduction) - Complete API documentation
* [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model
* [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech
* [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming
* [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community
* [Support Email](mailto:support@fish.audio) - Contact our support team
# Models Overview
Source: https://docs.fish.audio/developer-guide/models-pricing/models-overview
Explore Fish Audio's voice generation models and their capabilities
## Available Models
Fish Audio offers state-of-the-art text-to-speech models optimized for different use cases and performance requirements.
### Recommended Model
**Fish Audio S2-Pro** - Our next-generation TTS model with best-in-class performance
* Natural language control with `[bracket]` syntax — not limited to a fixed set (e.g., `[whispers sweetly]`, `[laughing nervously]`)
* Multi-speaker dialogue support **(S2-Pro exclusive)**
* 80+ languages
* 100ms time-to-first-audio
* Full SGLang-based serving stack
* Open-source
We recommend using `s2-pro` for all new projects to access the latest capabilities and performance improvements. S1 remains available for existing integrations.
### Previous Model
**Fish Audio S1** - High-quality voice generation
* 4 billion parameters
* 0.008 WER (0.8% word error rate)
* Full emotional control capabilities with `(parenthesis)` syntax
## Model Specifications
### Fish Audio S1 Performance Metrics
* **Word Error Rate (WER)**: 0.008 (0.8%)
* **Character Error Rate (CER)**: 0.004 (0.4%)
* **Real-time Factor**: \~1:7 on standard hardware
* **TTS-Arena2 Ranking**: #1 worldwide
## Supported Languages
### S2-Pro
S2-Pro supports 80+ languages with automatic language detection and inline emotion and paralinguistic cue support.
Language detection is automatic - simply provide text in your target language.
### S1
S1 supports text-to-speech generation in 13 languages with full emotional expression capabilities.
```
English, Chinese, Japanese, German,
French, Spanish, Korean, Arabic,
Russian, Dutch, Italian, Polish, Portuguese
```
## Voice Styles and Emotions
Fish Audio models support emotional expressions and voice styles that can be controlled through text markers in your input.
### S2-Pro Natural Language Control
S2-Pro treats `[bracket]` tags as standard text rather than dedicated control tokens. Through training on massive datasets, the model learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as `[whispers sweetly]` or `[laughing nervously]`.
Common examples include:
```
[whisper] [laugh] [emphasis] [sigh] [gasp] [pause]
[angry] [excited] [sad] [surprised] [inhale] [exhale]
```
S2-Pro cues can be placed anywhere in your text to control emotion at specific positions. For example: `"I can't believe it [gasp] you actually did it [laugh]"`
### S1 Voice Styles and Emotions
S1 supports 64+ emotional expressions using `(parenthesis)` syntax.
### Basic Emotions (24 expressions)
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
### Advanced Emotions (25 expressions)
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
### Tone Markers (5 expressions)
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
### Audio Effects (10 expressions)
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing)
(panting) (groaning) (crowd laughing) (background laughter) (audience laughing)
```
You can also use natural expressions like "Ha,ha,ha" for laughter. Experiment with combinations to achieve the perfect emotional tone for your application.
## Support
Need help? Check out these resources:
* [API Reference](/api-reference/introduction) - Complete API documentation
* [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model
* [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech
* [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming
* [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community
* [Support Email](mailto:support@fish.audio) - Contact our support team
# Pricing & Rate Limits
Source: https://docs.fish.audio/developer-guide/models-pricing/pricing-and-rate-limits
Understand Fish Audio pricing plans, usage costs, and API rate limits
## API Pricing
The Fish Audio API uses pay-as-you-go pricing based on actual usage. There are no subscription fees or monthly minimums for API access.
### Text-to-Speech (TTS) Models
TTS pricing is based on the size of input text, measured in millions of UTF-8 bytes.
| Model Name | Price (USD) |
| ---------- | ----------------------- |
| `s2-pro` | \$15.00 / M UTF-8 bytes |
| `s1` | \$15.00 / M UTF-8 bytes |
1M UTF-8 bytes is approximately 180,000 English words, or about 12 hours of speech
### Automatic Speech Recognition (ASR) Models
| Model Name | Price (USD) |
| -------------- | ------------------- |
| `transcribe-1` | \$0.36 / audio hour |
**How ASR billing works:**
* Charges are based on the duration of audio processed
* Duration is rounded up to the nearest second
## Rate Limits
These limits help us ensure fair usage and maintain service quality for all users.
### Concurrent Request Limits
| Tier | Spending Threshold | Concurrent Requests |
| ----------- | ------------------ | ------------------- |
| Starter | \< \$100 paid | 5 requests |
| Elevated | ≥ \$100 paid | 15 requests |
| High Volume | ≥ \$1,000 paid | 50 requests |
| Enterprise | Custom | Custom limits |
Concurrency tiers unlock as soon as your total prepaid amount reaches the threshold. You do not need to spend the full balance first. If your workload needs a higher concurrency tier, you can top up in advance to unlock the next tier immediately.
Please reach out to our team to enable enterprise volume pricing, rate limits, and billing.
## Support
Need help? Check out these resources:
* [API Reference](/api-reference/introduction) - Complete API documentation
* [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model
* [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech
* [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming
* [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community
* [Support Email](mailto:support@fish.audio) - Contact our support team
# Story Studio
Source: https://docs.fish.audio/developer-guide/products/story-studio
Build immersive audio stories and narratives
Coming soon! We're preparing comprehensive documentation for Story Studio.
In the meantime, you can:
* Visit the [Fish Audio Playground](https://fish.audio) to explore our storytelling features
* Check back soon for detailed guides and tutorials
Join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates.
# Text to Speech
Source: https://docs.fish.audio/developer-guide/products/tts
Convert text into natural-sounding speech with Fish Audio's AI voices
Coming soon! We're preparing comprehensive documentation for our Text-to-Speech web interface.
In the meantime, you can:
* Visit the [Fish Audio Playground](https://fish.audio) to try our TTS features
* Check our [API documentation](/api-reference/endpoint/openapi-v1/text-to-speech) for programmatic access
* Read our [TTS Guide and Best Practices](/developer-guide/core-features/text-to-speech)
Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates.
# Voice Cloning
Source: https://docs.fish.audio/developer-guide/products/voice-cloning
Create custom voice models from audio samples
Coming soon! We're preparing comprehensive documentation for our Voice Cloning web interface.
In the meantime, you can:
* Visit the [Fish Audio Playground](https://fish.audio) to try voice cloning
* View our [Python SDK voice cloning guide](/developer-guide/sdk-guide/python/voice-cloning)
* Read our [voice cloning best practices](/developer-guide/best-practices/voice-cloning)
Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates.
# Agent Quickstart
Source: https://docs.fish.audio/developer-guide/resources/agent-quickstart
Low-noise entry points and canonical URLs for AI agents using Fish Audio documentation
## Purpose
This page is the recommended starting point for AI agents, RAG pipelines, and documentation crawlers that need accurate Fish Audio references with minimal markup noise.
## Built-In Agent Indexes
This documentation site already provides built-in LLM-friendly indexes:
* [llms.txt](https://docs.fish.audio/llms.txt) for the curated documentation index
* [llms-full.txt](https://docs.fish.audio/llms-full.txt) for broader site context
In most cases, agents should read `llms.txt` first and only fetch `llms-full.txt` when they need wider context across the whole documentation set.
## Install the Agent Skill
For coding agents that support [Agent Skills](https://github.com/vercel-labs/skills) (Claude Code, Cursor, Windsurf, Codex, and others), install the ready-made raw-API skill with a single command:
```bash theme={null}
npx skills add https://docs.fish.audio --skill fish-audio-api
```
The skill teaches the agent how to call the Fish Audio REST and WebSocket APIs directly from `curl`, Python, Node.js, or any HTTP client — no SDK required. It covers authentication, every endpoint in our [OpenAPI schema](https://docs.fish.audio/api-reference/openapi.json), MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, and the WebSocket streaming protocol.
Discovery endpoint: [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json). Run `npx skills add https://docs.fish.audio` (without `--skill`) to install every skill published here, including the auto-generated product overview skill.
## Retrieval Order
1. Read [llms.txt](https://docs.fish.audio/llms.txt) for the curated documentation index.
2. Read [llms-full.txt](https://docs.fish.audio/llms-full.txt) when broad site context is needed.
3. Read [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) for REST schemas, parameters, and examples.
4. Read [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) for the WebSocket streaming protocol.
5. Fetch individual `.md` pages only after narrowing to a specific task.
## Canonical API Facts
* Base API URL: `https://api.fish.audio`
* Authentication: `Authorization: Bearer `
* TTS model selection: send a required `model` header. Recommended default: `s2-pro`
* Main REST endpoints:
* `POST /v1/tts`
* `POST /v1/asr`
* `GET /model`
* `POST /model`
* `GET /model/{id}`
* `PATCH /model/{id}`
* `DELETE /model/{id}`
* Real-time streaming endpoint: `wss://api.fish.audio/v1/tts/live`
## High-Value URLs
### Start Here
* [Agent Quickstart](https://docs.fish.audio/developer-guide/resources/agent-quickstart.md)
* [Quick Start](https://docs.fish.audio/developer-guide/getting-started/quickstart.md)
* [AI Coding Agents](https://docs.fish.audio/developer-guide/resources/coding-agents.md)
### API Specs
* [OpenAPI](https://docs.fish.audio/api-reference/openapi.json)
* [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml)
* [API Introduction](https://docs.fish.audio/api-reference/introduction.md)
### Authentication And SDK Setup
* [Python Authentication](https://docs.fish.audio/developer-guide/sdk-guide/python/authentication.md)
* [JavaScript Authentication](https://docs.fish.audio/developer-guide/sdk-guide/javascript/authentication.md)
* [Python SDK Overview](https://docs.fish.audio/developer-guide/sdk-guide/python/overview.md)
* [JavaScript Installation](https://docs.fish.audio/developer-guide/sdk-guide/javascript/installation.md)
### Core Product Tasks
* [Text to Speech Guide](https://docs.fish.audio/developer-guide/core-features/text-to-speech.md)
* [Speech to Text Guide](https://docs.fish.audio/developer-guide/core-features/speech-to-text.md)
* [Creating Voice Models](https://docs.fish.audio/developer-guide/core-features/creating-models.md)
* [Emotion Control](https://docs.fish.audio/developer-guide/core-features/emotions.md)
* [Fine-grained Control](https://docs.fish.audio/developer-guide/core-features/fine-grained-control.md)
### Real-Time And Integrations
* [WebSocket TTS Streaming](https://docs.fish.audio/api-reference/endpoint/websocket/tts-live.md)
* [Real-time Voice Streaming Best Practices](https://docs.fish.audio/developer-guide/best-practices/real-time-streaming.md)
* [Python WebSocket Streaming](https://docs.fish.audio/developer-guide/sdk-guide/python/websocket.md)
* [JavaScript WebSocket](https://docs.fish.audio/developer-guide/sdk-guide/javascript/websocket.md)
* [LiveKit Integration](https://docs.fish.audio/developer-guide/integrations/livekit.md)
* [Pipecat Integration](https://docs.fish.audio/developer-guide/integrations/pipecat.md)
### Models, Pricing, And Lifecycle
* [Models Overview](https://docs.fish.audio/developer-guide/models-pricing/models-overview.md)
* [Choosing a Model](https://docs.fish.audio/developer-guide/models-pricing/choosing-a-model.md)
* [Pricing And Rate Limits](https://docs.fish.audio/developer-guide/models-pricing/pricing-and-rate-limits.md)
* [Model Deprecations](https://docs.fish.audio/developer-guide/models-pricing/deprecations.md)
## Task Routing
* If the task is "generate speech", start with Quick Start, the Text to Speech guide, and `POST /v1/tts`.
* If the task is "transcribe audio", start with the Speech to Text guide and `POST /v1/asr`.
* If the task is "clone or manage voices", start with Creating Voice Models and the `/model` endpoints.
* If the task is "stream audio in real time", start with AsyncAPI, WebSocket TTS Streaming, and the WebSocket SDK guides.
* If the task is "pick the right model or estimate cost", start with Models Overview and Pricing And Rate Limits.
## Notes For Agents
* Prefer `openapi.json` and `asyncapi.yml` for machine-readable schemas.
* Prefer `.md` URLs when you need a single human-authored page in Markdown form.
* Some richer pages use interactive MDX widgets. If a fetched page contains UI or component noise, fall back to this page, `llms.txt`, `llms-full.txt`, or the API spec files first.
* Treat this page as the canonical low-noise entry point for Fish Audio documentation retrieval.
# Brand Guidelines
Source: https://docs.fish.audio/developer-guide/resources/brand
Design guidelines for using Fish Audio brand assets
## Logo
### Wordmark
Our preferred logo format combines the [Fish Audio Icon](#icon) with the wordmark side by side.
This is the primary version of our logo and should be used whenever possible for maximum brand recognition and clarity.
### Icon
Our icon features a whale composed of audio bars and sound waves, symbolizing the fusion of marine life with audio technology. This design represents our brand's commitment to natural, flowing, and powerful voice generation.
The Fish Audio icon should only be used when space constraints or context make it impractical to display the full wordmark. Always prefer the wordmark with icon combination when possible.
### Avoid
To maintain the integrity of our brand identity, please do not alter our logo in any of the following ways:
## Colors
Our official brand colors consist of black and white for primary logo applications, complemented by secondary grays for subtle variations and an accent purple for visual highlights in marketing materials.
## Typography
Our brand uses **Onest Semibold** in the logo wordmark. This documentation is also set in Onest, so you're experiencing our brand typography right now.
[Download Onest on Google Fonts](https://fonts.google.com/specimen/Onest)
## Usage Guidelines
The Fish Audio name and logos are trademarks of Hanabi AI Inc. You may freely use and redistribute our brand assets when referencing Fish Audio. By using our brand assets, you agree that we own them and that any goodwill generated by your use benefits Fish Audio.
### Do
* Use our brand assets freely in your projects, applications, and content
* Share our brand assets in blog posts, tutorials, documentation, and educational materials
* Follow the visual guidelines shown above (spacing, colors, sizing)
* Link to fish.audio when using our brand online
### Don't
* Use our logo as part of your own product name or branding
* Imply partnership, sponsorship, or endorsement without permission
* Feature our logo more prominently than your own brand
### Questions?
If you're unsure whether your use case is appropriate or need special permission, please contact us at [support@fish.audio](mailto:support@fish.audio).
## Download Assets
# AI Coding Agents
Source: https://docs.fish.audio/developer-guide/resources/coding-agents
Connect AI coding assistants to Fish Audio documentation via MCP for real-time API guidance
## Overview
Integrate Fish Audio's comprehensive documentation directly into your AI coding assistants. Using MCP (Model Context Protocol), coding agents like Claude Code, Cursor, and Windsurf can access our latest API references, guides, and examples in real-time.
The Fish Audio MCP server provides instant access to:
* Complete API documentation
* SDK usage examples
* Best practices and implementation patterns
* Troubleshooting guides
Connect once and get accurate, up-to-date Fish Audio knowledge in your coding environment.
This documentation site also exposes built-in LLM-friendly indexes:
* [llms.txt](https://docs.fish.audio/llms.txt) for the curated page index
* [llms-full.txt](https://docs.fish.audio/llms-full.txt) for broader site context
If your coding agent supports direct document fetching, start with `llms.txt` before pulling individual pages.
## Install as an Agent Skill
Fish Audio publishes a ready-made [Agent Skill](https://github.com/vercel-labs/skills) that teaches your coding agent how to call the Fish Audio REST and WebSocket APIs directly, without an SDK. It covers authentication, every endpoint in our OpenAPI schema, MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, and the WebSocket streaming protocol.
```bash theme={null}
npx skills add https://docs.fish.audio --skill fish-audio-api
```
This installs the skill into your agent's local skill directory (for example `~/.claude/skills/fish-audio-api/`). Once installed, ask your agent to "call the Fish Audio TTS API with curl" or "stream TTS over WebSocket in Python" and it will follow the skill's conventions.
```bash theme={null}
npx skills add https://docs.fish.audio
```
Installs every skill advertised at [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json), including the auto-generated product overview skill and the raw-API skill.
The discovery index lives at [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json) and each skill's raw markdown is served at [/.well-known/agent-skills/\/SKILL.md](https://docs.fish.audio/.well-known/agent-skills/fish-audio-api/SKILL.md). Review the skill content first, then install with:
```bash theme={null}
npx skills add https://docs.fish.audio --list # show available skills
npx skills add https://docs.fish.audio --skill fish-audio-api
```
The `skills` CLI works with any agent that uses `SKILL.md` conventions — Claude Code, Cursor, Windsurf, Codex, and others. See [`npx skills --help`](https://github.com/vercel-labs/skills) for agent-specific install flags such as `-a claude-code` or `-a cursor`.
Prefer MCP if you want live documentation search inside your editor. Prefer the Agent Skill if you want a self-contained instruction file that works offline after install and doesn't rely on a running MCP server.
## Why Use MCP Integration?
Access the latest API documentation without leaving your editor
Generate working code based on current API specifications
Get context-aware help for debugging and optimization
## Setup
Open your terminal in your project directory and run:
```bash theme={null}
claude mcp add --transport http fish-audio --scope project https://docs.fish.audio/mcp
```
This creates a `.mcp.json` file in your project root with the Fish Audio documentation server configuration.
Claude Code supports three installation scopes:
* **`--scope project`** (recommended): Stores configuration in `.mcp.json` at project root. Version-controlled and shared with your team.
* **`--scope user`**: Available globally across all your projects, but private to your account.
* **`--scope local`** (default): Project-specific but private to you only. Good for experimentation.
For team collaboration, use project scope and commit the `.mcp.json` file to git.
Check that the server is connected:
```bash theme={null}
claude mcp list
```
You should see `fish-audio` in the list of configured servers.
Ask Claude Code: "What Fish Audio models are available?" or "How do I use Fish Audio's TTS API?"
Use `Cmd+Shift+P` (Mac) or `Ctrl+Shift+P` (Windows/Linux) to open the command palette, then search for "Open MCP settings".
Select "Add custom MCP" to open the `mcp.json` configuration file.
Add the Fish Audio documentation server:
```json theme={null}
{
"mcpServers": {
"fish-audio": {
"url": "https://docs.fish.audio/mcp"
}
}
}
```
Save the configuration file and reload Cursor to apply changes.
In Cursor's chat, ask: "What tools do you have available?" You should see the Fish Audio MCP server listed. Then try: "What Fish Audio TTS models are available?"
Cursor's MCP support was added in early 2025. Ensure you're running the latest version for full functionality.
Go to `File > Preferences > Windsurf Settings`, then navigate to `Cascade > Model Context Protocol (MCP) Servers`.
Click "Add custom server +" or "View raw config" to edit the configuration file at `~/.codeium/windsurf/mcp_config.json`.
Add the Fish Audio documentation server:
```json theme={null}
{
"mcpServers": {
"fish-audio": {
"url": "https://docs.fish.audio/mcp"
}
}
}
```
Save the configuration and click the refresh button in Windsurf to apply changes.
Open Cascade chat (Ctrl+L) and ask: "Search Fish Audio docs for TTS API usage" or "What emotion parameters does Fish Audio support?"
Windsurf's MCP support was introduced in Wave 3 (February 2025). Ensure you're running the latest version.
## Using the Integration
### Example Queries
Once connected, ask your coding agent questions naturally:
"How do I authenticate with Fish Audio API?"
"Show me Python code for text-to-speech"
"What emotion parameters are available?"
"Help me implement real-time streaming"
### Code Generation Examples
Ask: "Generate a Python function for text-to-speech with Fish Audio"
```python theme={null}
from fish_audio import FishAudioClient
def text_to_speech(text: str, voice_id: str, output_file: str):
"""Convert text to speech using Fish Audio API"""
client = FishAudioClient(api_key="your-api-key")
response = client.tts.create(
text=text,
model_id=voice_id,
format="mp3"
)
with open(output_file, "wb") as f:
f.write(response.audio_data)
return output_file
```
Ask: "Create a voice cloning pipeline with error handling"
```python theme={null}
from fish_audio import FishAudioClient
import logging
def clone_voice(audio_path: str, name: str):
"""Clone a voice from audio sample"""
client = FishAudioClient(api_key="your-api-key")
try:
# Upload audio sample
with open(audio_path, "rb") as f:
model = client.models.create(
name=name,
audio_data=f.read(),
description="Custom cloned voice"
)
logging.info(f"Voice cloned: {model.id}")
return model.id
except Exception as e:
logging.error(f"Cloning failed: {e}")
raise
```
Ask: "Implement real-time TTS streaming"
```python theme={null}
from fish_audio import FishAudioClient
import asyncio
async def stream_tts(text: str, voice_id: str):
"""Stream TTS audio in real-time"""
client = FishAudioClient(api_key="your-api-key")
async for chunk in client.tts.stream(
text=text,
model_id=voice_id,
chunk_size=1024
):
# Process audio chunk
yield chunk
```
## Available Documentation
Your coding agent can access:
Complete endpoint documentation with parameters
Python SDK usage and examples
Optimization patterns and tips
Available models and rate limits
Custom voice creation guides
Common issues and solutions
## Advanced Usage
### Custom Commands
Create agent workflows for common tasks:
```text Voice Pipeline theme={null}
"Create a complete voice generation pipeline with:
- Authentication
- Voice selection
- Emotion control
- Error handling
- Audio export"
```
```text Batch Processing theme={null}
"Build a batch TTS processor that:
- Reads from CSV
- Handles rate limits
- Retries on failure
- Tracks progress"
```
```text WebSocket Client theme={null}
"Implement a WebSocket client for:
- Real-time streaming
- Auto-reconnection
- Buffer management
- Error recovery"
```
### Context-Aware Features
With MCP integration, your agent can:
* Suggest appropriate models based on use case
* Handle rate limiting automatically
* Provide inline documentation
* Validate API calls against specifications
* Recommend optimization strategies
## Troubleshooting
If the MCP server isn't connecting:
1. Verify internet connectivity
2. Check `https://docs.fish.audio/mcp` is accessible
3. Ensure your agent supports MCP protocol
4. Restart your coding environment
5. Clear any cached configurations
The MCP server always serves the latest documentation:
1. Refresh the MCP connection in settings
2. Clear documentation cache if available
3. Report persistent issues to [support@fish.audio](mailto:support@fish.audio)
If certain features aren't available:
1. Verify you're using the latest agent version
2. Check MCP protocol compatibility
3. Ensure proper server configuration
4. Contact support for assistance
## Security
**Your data is safe:** - MCP provides read-only access to public documentation
* No API keys are transmitted through MCP - All connections use HTTPS
encryption - No user queries or usage data is stored
## Next Steps
Start with Fish Audio API basics
Install and configure the Python SDK
Learn text-to-speech optimization
Create custom voice models
## Support
Need help with MCP integration?
* **Technical Support**: [support@fish.audio](mailto:support@fish.audio)
* **Documentation Issues**: [GitHub](https://github.com/fishaudio)
* **Community**: [Discord](https://discord.gg/dF9Db2Tt3Y)
# Migration Guide
Source: https://docs.fish.audio/developer-guide/resources/migration
Switch from ElevenLabs, OpenAI, or other TTS providers to Fish Audio
Coming soon! We're preparing comprehensive migration guides to help you seamlessly switch to Fish Audio.
We're working on detailed migration guides for:
* ElevenLabs
* OpenAI TTS
* Google Cloud Text-to-Speech
* Amazon Polly
* Other TTS providers
Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates.
# Roadmap
Source: https://docs.fish.audio/developer-guide/resources/roadmap
Upcoming features and improvements for Fish Audio
## Roadmap
Explore what's coming next for Fish Audio. Our roadmap reflects our current priorities and vision for the platform.
This roadmap is subject to change based on user feedback and technical considerations. Features may be added, modified, or removed as we continue to develop the platform.
### Coming Soon
Details about our upcoming features and improvements will be published here.
## Feature Requests
Have a feature request or want to vote on priorities? We'd love to hear from you:
* **Email**: [support@fish.audio](mailto:support@fish.audio)
* **Discord**: Join our [community Discord](https://discord.gg/dF9Db2Tt3Y)
* **GitHub**: Open an issue on our [GitHub repository](https://github.com/fishaudio)
## Stay Updated
Subscribe to our [changelog](/developer-guide/getting-started/changelog) RSS feed to get notified when new features are released.
# Authentication
Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/authentication
Manage API keys and client setup in the Fish Audio JavaScript SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Client Initialization
Initialize a `FishAudioClient` with your API key to start using the SDK:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
// Initialize with your API key
const fishAudio = new FishAudioClient({ apiKey: "your_api_key" });
```
### Using Environment Variables
For better security, store your API key in environment variables:
Set the environment variable in your shell:
```bash theme={null}
export FISH_API_KEY=your_api_key_here
```
Then initialize immediately:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
const fishAudio = new FishAudioClient();
```
```typescript theme={null}
import { config } from "dotenv";
import { FishAudioClient } from "fish-audio";
// Load environment variables from .env file
config();
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
```
Create a `.env` file in your project root:
```bash theme={null}
FISH_API_KEY=your_api_key_here
```
### Custom Endpoints
If you need to use a proxy or custom endpoint:
```typescript theme={null}
const fishAudio = new FishAudioClient({
apiKey: "your_api_key",
baseUrl: "https://your-proxy-domain.com",
});
```
# Installation
Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/installation
Install and set up the Fish Audio JavaScript SDK
To use the Fish Audio API in server-side JavaScript environments like Node.js, Deno, or Bun,
you can use the official [Fish Audio SDK for TypeScript and JavaScript](https://www.npmjs.com/package/fish-audio).
## Requirements
* Node.js 18 or higher
## Install
Install the JavaScript SDK from npm. Choose your preferred package manager:
```bash theme={null}
npm install fish-audio
```
```bash theme={null}
yarn add fish-audio
```
```bash theme={null}
pnpm add fish-audio
```
## Support
Need help? Check out these resources:
* [API Reference](/api-reference/introduction) - Complete API documentation
* [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model
* [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech
* [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming
* [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community
* [Support Email](mailto:support@fish.audio) - Contact our support team
# Speech to Text
Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/speech-to-text
Convert audio to text with Fish Audio JavaScript SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Basic Usage
Transcribe audio to text:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
import { createReadStream } from "fs";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
const result = await fishAudio.speechToText.convert({
audio: createReadStream("audio.mp3"),
});
console.log(result.text);
console.log("Duration (s):", result.duration);
```
## Language Specification
Improve accuracy by specifying the language:
```typescript theme={null}
// English transcription
await fishAudio.speechToText.convert({
audio: createReadStream("audio.mp3"),
language: "en"
});
// Chinese transcription
await fishAudio.speechToText.convert({
audio: createReadStream("audio.mp3"),
language: "zh"
});
```
Common language codes: `en` (English), `zh` (Chinese), `es` (Spanish), `fr` (French), `de` (German), `ja` (Japanese), `ko` (Korean), `pt` (Portuguese)
Automatic language detection works well, but specifying the language improves accuracy and speed.
## Working with Segments
Get detailed timing for each segment:
```typescript theme={null}
const response = await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3") });
// Full transcription
console.log(response.text);
// Segment details
for (const seg of response.segments ?? []) {
console.log(`[${seg.start.toFixed(2)}s - ${seg.end.toFixed(2)}s] ${seg.text}`);
}
```
## Timestamps Control
Control timestamp generation:
```typescript theme={null}
// Include timestamps (default)
await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), ignore_timestamps: false });
// Skip timestamp processing for faster results
await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), ignore_timestamps: true });
```
`ignore_timestamps: false` (default) includes segment timestamps. Set to `true` to skip timestamp processing for faster transcription when you only need the text.
## Audio Formats
Supported audio formats:
* MP3 (recommended)
* WAV
* M4A
* OGG
* FLAC
* AAC
File requirements:
* Maximum size: 20MB
* Maximum duration: 60 minutes
* Sample rate: 16kHz or higher recommended
## Transcribing TTS Output
Transcribe generated speech:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
const fishAudio = new FishAudioClient();
// Generate speech
const ttsAudio = await fishAudio.textToSpeech.convert({ text: "Hello, this is a test" });
// Transcribe it
const asr = await fishAudio.speechToText.convert({ audio: ttsAudio });
console.log(asr.text);
```
## Error Handling
Handle common errors:
```typescript theme={null}
try {
await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3") });
} catch (e: any) {
const status = e?.status || e?.response?.status;
if (status === 413) console.error("Audio file too large (max 20MB)");
else if (status === 400) console.error("Invalid audio format");
else throw e;
}
```
## Response Structure
The ASR response includes:
| Field | Type | Description |
| ---------- | ------------- | ------------------------- |
| `text` | string | Complete transcription |
| `duration` | number | Audio duration (seconds) |
| `segments` | ASRSegment\[] | Timestamped text segments |
Segment structure:
| Field | Type | Description |
| ------- | ------ | -------------------- |
| `text` | string | Segment text |
| `start` | number | Start time (seconds) |
| `end` | number | End time (seconds) |
Note the timing units: `duration` and segment times are in seconds.
## Request Parameters
| Parameter | Type | Description | Default | | |
| ------------------- | ------- | -------------------------- | ------------------ | ------------------- | -------- |
| `audio` | File | Buffer | Readable stream | Audio to transcribe | Required |
| `language` | string | Language code (e.g., "en") | None (auto-detect) | | |
| `ignore_timestamps` | boolean | Skip timestamp processing | false | | |
# Text to Speech
Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/text-to-speech
Convert text to natural speech with Fish Audio JavaScript SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Basic Usage
Generate speech from text:
```typescript theme={null}
import { FishAudioClient, play } from "fish-audio";
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
const audio = await fishAudio.textToSpeech.convert({
text: "Hello, world!",
});
await play(audio);
```
## Using Voice Models
Specify a voice model for consistent voice generation:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
const fishAudio = new FishAudioClient();
const audio = await fishAudio.textToSpeech.convert({
text: "This is my custom voice",
reference_id: "your_model_id", // Your model ID from fish.audio
});
await play(audio);
```
### Getting Model IDs
The `reference_id` is the model ID from the URL when viewing a model on Fish Audio:
* Model URL: `https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89`
* Reference ID: `802e3bc2b27e49c2995d23ef70e6ac89`
You can also get model IDs programmatically:
```typescript theme={null}
// List your models
const results = await fishAudio.voices.search({ self: true });
for (const model of results.items ?? []) {
console.log(`${model.title}: ${model._id}`);
}
// Get specific model details
const model = await fishAudio.voices.get("your_model_id");
console.log(`Model: ${model.title}, ID: ${model._id}`);
```
## Emotions
The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details.
Add emotional expressions to your text:
```typescript theme={null}
import type { TTSRequest } from "fish-audio";
const text = `
(happy) I'm excited to share this!
(sad) Unfortunately, it didn't work out.
(whispering) This is a secret.
`;
const request: TTSRequest = { text, reference_id: "model_id" };
```
Common emotions: `(happy)`, `(sad)`, `(angry)`, `(excited)`, `(calm)`, `(surprised)`, `(whispering)`, `(shouting)`, `(laughing)`, `(sighing)`
For more advanced control over speech generation, including phoneme-level control and additional paralanguage features, see [Fine-grained Control](/developer-guide/core-features/fine-grained-control).
## Audio Formats
Choose output format based on your needs:
```typescript theme={null}
// MP3 (default)
await fishAudio.textToSpeech.convert({ text: "...", format: "mp3", mp3_bitrate: 192 });
// WAV - uncompressed
await fishAudio.textToSpeech.convert({ text: "...", format: "wav", sample_rate: 44100 });
// Opus - efficient for streaming
await fishAudio.textToSpeech.convert({ text: "...", format: "opus", opus_bitrate: 48 });
// PCM - raw audio data
await fishAudio.textToSpeech.convert({ text: "...", format: "pcm", sample_rate: 16000 });
```
## Prosody Control
Adjust speech speed and volume:
```typescript theme={null}
const audio = await fishAudio.textToSpeech.convert({
text: "Adjusted speech",
prosody: {
speed: 1.2, // 0.5 - 2.0
volume: 5, // -20 - 20
},
});
```
## Advanced Parameters
Fine-tune generation:
```typescript theme={null}
const audio = await client.textToSpeech.convert({
text: "Your text here",
chunk_length: 200, // Characters per chunk (100-300)
normalize: true, // Normalize text
latency: "balanced", // "normal" or "balanced"
temperature: 0.7, // Randomness (0.0-1.0)
top_p: 0.7, // Token selection (0.0-1.0)
});
```
## Choosing Backend
Our state-of-the-art [S2-Pro model](/developer-guide/models-pricing/models-overview)
is the default backend model for TTS. Optionally specify the model via the second argument (`backend: Backends`).
```typescript theme={null}
const audio = await fishAudio.textToSpeech.convert({
text: "Hello, world!",
}, "s2-pro");
```
## Streaming
For real-time streaming, see the [WebSocket guide](/developer-guide/sdk-guide/javascript/websocket).
## Error Handling
Handle common errors:
```typescript theme={null}
async function generateWithRetry(request: Record, maxRetries = 3) {
const fishAudio = new FishAudioClient();
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fishAudio.textToSpeech.convert(request);
} catch (e: any) {
const status = e?.status || e?.response?.status;
if (status === 429) await new Promise(r => setTimeout(r, 2 ** attempt * 1000));
else if (status === 401) throw new Error("Invalid API key");
else throw e;
}
}
}
```
## Request Parameters
| Parameter | Type | Description | Default |
| -------------- | --------- | -------------------- | ---------- |
| `text` | string | Text to convert | Required |
| `reference_id` | string | Voice model ID | None |
| `references` | object\[] | Reference audio | \[] |
| `format` | string | Audio format | "mp3" |
| `chunk_length` | number | Chunk size (100-300) | 200 |
| `normalize` | boolean | Normalize text | true |
| `latency` | string | Speed vs quality | "balanced" |
| `prosody` | object | Speed/volume | None |
| `temperature` | number | Randomness | 0.7 |
| `top_p` | number | Token selection | 0.7 |
## Next Steps
* [Fine-grained control](/developer-guide/core-features/fine-grained-control) for phoneme-level control and paralanguage
* [Voice cloning](/developer-guide/sdk-guide/javascript/voice-cloning) for custom voices
* [WebSocket streaming](/developer-guide/sdk-guide/javascript/websocket) for real-time apps
* [Guide and Best Practices](/developer-guide/core-features/text-to-speech) for production use
* [API reference](/api-reference/endpoint/openapi-v1/text-to-speech) for direct API calls
# Voice Cloning
Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/voice-cloning
Clone voices using reference audio with Fish Audio JavaScript SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Overview
Voice cloning allows you to generate speech that matches a specific voice using reference audio. Fish Audio supports two approaches:
* Using pre-trained voice models (reference\_id)
* Providing reference audio directly in your request
Use `reference_id` when you'll reuse a voice multiple times - it's faster and more efficient. Use `references` for one-off voice cloning or testing different voices without creating models.
## Using Reference Audio
Clone a voice by providing reference audio directly:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
import type { TTSRequest, ReferenceAudio } from "fish-audio";
import { readFile } from "fs/promises";
const fishAudio = new FishAudioClient();
const audioBuffer = await readFile("voice_sample.wav");
const referenceFile = new File([audioBuffer], "voice_sample.wav");
const referenceAudio: ReferenceAudio = {
audio: referenceFile,
text: "Text spoken in the reference audio"
};
const request: TTSRequest = {
text: "Hello, world!",
references: [referenceAudio]
};
const audio = await client.textToSpeech.convert(request);
```
## Multiple References
Improve voice quality by providing multiple reference samples:
```typescript theme={null}
import type { TTSRequest, ReferenceAudio } from "fish-audio";
import { readFile } from "fs/promises";
const references = [] as ReferenceAudio[];
for (const i of [0, 1, 2]) {
const buf = await readFile(`sample_${i}.wav`);
references.push({ audio: new File([buf], `sample_${i}.wav`), text: `Text from sample ${i}` });
}
const request: TTSRequest = {
text: "Better voice quality with multiple references",
references,
};
```
## Creating Voice Models
For repeated use, create a persistent voice model:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
import { createReadStream } from "fs";
const fishAudio = new FishAudioClient();
// Create a voice model from samples
const response = await fishAudio.voices.ivc.create({
title: "My Custom Voice",
voices: [
createReadStream("voice_0.wav"),
createReadStream("voice_1.wav"),
createReadStream("voice_2.wav"),
],
cover_image: createReadStream("cover.png"),
});
console.log("Created model:", response._id);
// Use the model
const audio = await fishAudio.textToSpeech.convert({
text: "Using my saved voice model",
reference_id: response._id,
});
```
## Best Practices
### Audio Quality
For best results, reference audio should:
* Be 10-30 seconds long per sample
* Have clear speech without background noise
* Match the language you'll generate
* Include varied intonation and emotion
### Sample Text
The text parameter in ReferenceAudio should:
* Match exactly what's spoken in the audio
* Include punctuation for proper prosody
* Be in the same language as generation
### Performance Tips
1. **Pre-upload models** for frequently used voices
2. **Use 2-3 reference samples** for optimal quality
3. **Keep samples under 30 seconds** each
4. **Normalize audio levels** before uploading
## Audio Format Requirements
Supported formats for reference audio:
* WAV (recommended)
* MP3
* M4A
* Other common audio formats
Sample rates:
* 16kHz minimum
* 44.1kHz recommended
* Mono or stereo (converted to mono)
## Example: Voice Bank
Build a library of cloned voices:
```typescript theme={null}
import { FishAudioClient } from "fish-audio";
const fishAudio = new FishAudioClient();
async function createVoiceBank() {
const voiceBank: Record = {};
const models = await fishAudio.voices.search();
for (const m of models.items ?? []) voiceBank[m.title] = m._id as string;
return voiceBank;
}
async function generateWithVoice(text: string, voiceName: string) {
const bank = await createVoiceBank();
const modelId = bank[voiceName];
if (!modelId) throw new Error(`Voice '${voiceName}' not found`);
return fishAudio.textToSpeech.convert({ text, reference_id: modelId });
}
```
## Combining with Emotions
Add emotions to cloned voices:
```typescript theme={null}
// With a saved model
await fishAudio.textToSpeech.convert({
text: "(happy) This is exciting news! (calm) Let me explain the details.",
reference_id: "your_model_id",
});
// Or with direct references
await fishAudio.textToSpeech.convert({
text: "(excited) Amazing discovery!",
references: [referenceAudio],
});
```
## Error Handling
Common issues and solutions:
```typescript theme={null}
try {
await fishAudio.textToSpeech.convert({ text: "Test speech", references: [referenceAudio] });
} catch (e: any) {
const msg = String(e?.message || e);
if (msg.includes("Invalid audio format")) console.error("Check audio format - use WAV or MP3");
else if (msg.includes("Audio too short")) console.error("Reference audio should be at least 10 seconds");
else throw e;
}
```
# WebSocket
Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/websocket
Real-time streaming with Fish Audio JavaScript SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Overview
WebSocket streaming enables real-time text-to-speech generation, perfect for conversational AI, live captioning, and streaming applications.
## Basic Streaming
Stream text and receive audio in real-time:
```typescript theme={null}
import { FishAudioClient, RealtimeEvents } from "fish-audio";
import { writeFile } from "fs/promises";
import path from "path";
// Simple async generator that yields text chunks
async function* makeTextStream() {
const chunks = [
"Hello from Fish Audio! ",
"This is a realtime text-to-speech test. ",
"We are streaming multiple chunks over WebSocket.",
];
for (const chunk of chunks) {
yield chunk;
}
}
const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY });
// For realtime, set text to "" and stream the content via makeTextStream
const request = { text: "" };
const connection = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream());
// Collect audio and write to a file when the stream ends
const chunks: Buffer[] = [];
connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened"));
connection.on(RealtimeEvents.AUDIO_CHUNK, (audio: unknown): void => {
if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) {
chunks.push(Buffer.from(audio));
}
});
connection.on(RealtimeEvents.ERROR, (err) => console.error("WebSocket error:", err));
connection.on(RealtimeEvents.CLOSE, async () => {
const outPath = path.resolve(process.cwd(), "out.mp3");
await writeFile(outPath, Buffer.concat(chunks));
console.log("Saved to", outPath);
});
```
Set `text: ""` in the request when streaming. The actual text comes from your text stream generator.
## Using Voice Models
Stream with a specific voice:
```typescript theme={null}
const request = {
text: "", // Empty for streaming
reference_id: "your_model_id",
format: "mp3",
};
const conn = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream());
conn.on(RealtimeEvents.AUDIO_CHUNK, () => { /* handle audio */ });
```
## Dynamic Text Generation
Stream text as it's generated:
```typescript theme={null}
async function* generateText() {
const responses = [
"Processing your request...",
"Here's what I found:",
"The answer is 42.",
];
for (const response of responses) {
for (const word of response.split(" ")) {
yield word + " ";
await new Promise(r => setTimeout(r, 20));
}
}
}
await fishAudio.textToSpeech.convertRealtime({ text: "" }, generateText());
```
## Line-by-Line Processing
Stream text line by line:
```typescript theme={null}
import { createReadStream } from "fs";
import readline from "readline";
async function* readFileLines(filepath: string) {
const rl = readline.createInterface({ input: createReadStream(filepath) });
for await (const line of rl) {
yield line.trim() + " ";
}
}
await fishAudio.textToSpeech.convertRealtime({ text: "" }, readFileLines("story.txt"));
```
## Errors
Handle connection errors via event listeners:
```typescript theme={null}
connection.on(RealtimeEvents.ERROR, (err) => {
console.error("WebSocket error:", err);
// Fallback to regular TTS or retry
});
```
## Configuration/Choosing Backend
Customize WebSocket behavior by configuring the client.
Optionally specify the backend model to use.
Our state-of-the-art [S2-Pro model](/developer-guide/models-pricing/models-overview) is the default:
```typescript theme={null}
// Custom endpoint
const fishAudio = new FishAudioClient({
apiKey: process.env.FISH_API_KEY,
baseUrl: "https://api.fish.audio", // Use a proxy/custom endpoint if needed
});
// Select backend model
const conn = await fishAudio.textToSpeech.convertRealtime(
request,
makeTextStream(),
backend: "s2-pro"
);
```
## Best Practices
1. **Chunk Size**: Yield text in natural phrases for best prosody
2. **Buffer Management**: Process audio chunks immediately to avoid memory buildup
3. **Connection Reuse**: Keep WebSocket sessions alive for multiple streams
4. **Error Recovery**: Implement retry logic for connection failures
5. **Format Selection**: Use PCM for real-time playback, MP3 for storage
## Events
The connection emits these events:
| Event | Description |
| ------------- | --------------------------------- |
| `OPEN` | WebSocket connection established |
| `AUDIO_CHUNK` | Audio chunk received (Uint8Array) |
| `ERROR` | Error occurred on the connection |
| `CLOSE` | Connection closed |
# Authentication
Source: https://docs.fish.audio/developer-guide/sdk-guide/python/authentication
Configure API authentication for the Fish Audio Python SDK
## Get Your API Key
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Client Initialization
Initialize the [`FishAudio`](/api-reference/sdk/python/client#fishaudio-objects) client with your API key:
The most secure approach is using environment variables:
```python theme={null}
from fishaudio import FishAudio
# Automatically reads from FISH_API_KEY environment variable
client = FishAudio()
```
Set the environment variable in your shell:
```bash theme={null}
export FISH_API_KEY=your_api_key_here
```
Or create a `.env` file in your project root:
```bash theme={null}
FISH_API_KEY=your_api_key_here
```
Then load it using `python-dotenv`:
```python theme={null}
from dotenv import load_dotenv
from fishaudio import FishAudio
# Load environment variables from .env file
load_dotenv()
client = FishAudio()
```
Using environment variables keeps your API key out of your codebase and makes it easy to use different keys for development and production.
Provide the API key directly when initializing the client:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(api_key="your_api_key_here")
```
This approach is less secure. Never commit code containing your actual API key. Use this only for quick testing or when loading the key from a secure secrets manager.
If you're using a proxy or custom endpoint:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(
api_key="your_api_key",
base_url="https://your-proxy-domain.com"
)
```
This is useful for:
* Corporate proxies
* Development/staging environments
* Self-hosted deployments
## Verifying Authentication
Test your authentication by making a simple API call to check your account credits:
```python focus={7-9} theme={null}
from fishaudio import FishAudio
from fishaudio.exceptions import AuthenticationError
try:
client = FishAudio()
# Check account credits (requires valid authentication)
credits = client.account.get_credits()
print(f"Authentication successful! Credits: {credits.credit}")
except AuthenticationError:
print("Authentication failed. Check your API key.")
```
Handle [`AuthenticationError`](/api-reference/sdk/python/exceptions#authenticationerror-objects) when verifying authentication. The example uses [`get_credits()`](/api-reference/sdk/python/resources#get_credits) to verify the authentication works.
## Next Steps
Generate speech with the authenticated client
Clone voices and create custom models
Check credits and manage your account
Handle authentication errors properly
# Overview
Source: https://docs.fish.audio/developer-guide/sdk-guide/python/overview
The official Python library for the Fish Audio API
This guide will walk you through installation, authentication, and core features.
If you're using the legacy Session-based API (`fish_audio_sdk`), see the [migration guide](/archive/python-sdk-legacy/migration-guide) to upgrade to the new SDK.
## Installation
Install via pip (Python 3.9 or higher required):
```bash theme={null}
pip install fish-audio-sdk
```
For audio playback utilities, install with the `utils` extra:
```bash theme={null}
pip install fish-audio-sdk[utils]
```
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
Configure your API key using environment variables:
```bash theme={null}
export FISH_API_KEY=your_api_key_here
```
Or create a `.env` file in your project root:
```bash theme={null}
FISH_API_KEY=your_api_key_here
```
## Quick Start
Get started with the [`FishAudio`](/api-reference/sdk/python/client#fishaudio-objects) client in less than a minute:
```python Synchronous theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play, save
# Initialize client (reads from FISH_API_KEY environment variable)
client = FishAudio()
# Generate and play audio
audio = client.tts.convert(text="Hello, playing from Fish Audio!")
play(audio)
# Generate and save audio
audio = client.tts.convert(text="Saving this audio to a file!")
save(audio, "output.mp3")
```
```python Asynchronous theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play, save
async def main():
# Initialize async client
client = AsyncFishAudio()
# Generate and play audio
audio = await client.tts.convert(text="Hello, playing from Fish Audio!")
play(audio)
# Generate and save audio
audio = await client.tts.convert(text="Saving this audio to a file!")
save(audio, "output.mp3")
asyncio.run(main())
```
## Core Features
### Text-to-Speech
Fully customizable text-to-speech generation:
```python Synchronous focus={6-10} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# With a specific voice
audio = client.tts.convert(
text="Custom voice",
reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian
)
play(audio)
```
```python Asynchronous focus={8-12} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# With a specific voice
audio = await client.tts.convert(
text="Custom voice",
reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian
)
play(audio)
asyncio.run(main())
```
```python Synchronous focus={6-10} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# With speed control
audio = client.tts.convert(
text="I'm talking pretty fast, is this still too slow?",
speed=1.5 # 1.5x speed
)
play(audio)
```
```python Asynchronous focus={8-12} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# With speed control
audio = await client.tts.convert(
text="I'm talking pretty fast, is this still too slow?",
speed=1.5 # 1.5x speed
)
play(audio)
asyncio.run(main())
```
Create reusable configurations with [`TTSConfig`](/api-reference/sdk/python/types#ttsconfig-objects). [`Prosody`](/api-reference/sdk/python/types#prosody-objects) controls speech characteristics like speed and volume:
```python Synchronous focus={7-18} theme={null}
from fishaudio import FishAudio
from fishaudio.types import TTSConfig, Prosody
from fishaudio.utils import play
client = FishAudio()
# Define config once
my_config = TTSConfig(
prosody=Prosody(speed=1.2, volume=-5),
reference_id="933563129e564b19a115bedd57b7406a", # Sarah
format="wav",
latency="balanced"
)
# Reuse across multiple generations
audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config)
audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config)
audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config)
play(audio1)
play(audio2)
play(audio3)
```
```python Asynchronous focus={9-20} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import TTSConfig, Prosody
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Define config once
my_config = TTSConfig(
prosody=Prosody(speed=1.2, volume=-5),
reference_id="933563129e564b19a115bedd57b7406a", # Sarah
format="wav",
latency="balanced"
)
# Reuse across multiple generations
audio1 = await client.tts.convert(text="Welcome to our product demonstration.", config=my_config)
audio2 = await client.tts.convert(text="Let me show you the key features.", config=my_config)
audio3 = await client.tts.convert(text="Thank you for watching this tutorial.", config=my_config)
play(audio1)
play(audio2)
play(audio3)
asyncio.run(main())
```
For chunk-by-chunk processing, use [`stream()`](/api-reference/sdk/python/resources#stream) which returns an `AudioStream` (iterable). For real-time streaming with dynamic text, see [Real-time Streaming](#real-time-streaming) below.
Learn more in the [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech).
### Speech-to-Text
Transcribe audio to text for various use cases:
```python Synchronous focus={5-16} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Transcribe audio
with open("audio.wav", "rb") as f:
result = client.asr.transcribe(
audio=f.read(),
language="en" # Optional: specify language
)
print(result.text)
# Access segments
for segment in result.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
```
```python Asynchronous focus={7-18} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Transcribe audio
with open("audio.wav", "rb") as f:
result = await client.asr.transcribe(
audio=f.read(),
language="en" # Optional: specify language
)
print(result.text)
# Access segments
for segment in result.segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
asyncio.run(main())
```
Learn more in the [Speech-to-Text guide](/developer-guide/sdk-guide/python/speech-to-text).
### Real-time Streaming
Stream dynamically generated text for conversational AI and live applications. Perfect for integrating with LLM streaming responses, live captions, and chatbot interactions:
```python Synchronous focus={7-15} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# Stream dynamically generated text (e.g., from LLM)
def text_chunks():
yield "Hello, "
yield "this is "
yield "streaming text!"
audio_stream = client.tts.stream_websocket(
text_chunks(),
latency="balanced"
)
play(audio_stream)
```
```python Asynchronous focus={9-17} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Stream dynamically generated text
async def text_chunks():
yield "Hello, "
yield "this is "
yield "streaming text!"
audio_stream = await client.tts.stream_websocket(
text_chunks(),
latency="balanced"
)
play(audio_stream)
asyncio.run(main())
```
Learn more in the [WebSocket Streaming guide](/developer-guide/sdk-guide/python/websocket).
### Voice Cloning
**Instant voice cloning** - Clone a voice on-the-fly using [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects):
```python Synchronous focus={6-12} theme={null}
from fishaudio import FishAudio
from fishaudio.types import ReferenceAudio
client = FishAudio()
# Instant voice cloning
with open("reference.wav", "rb") as f:
audio = client.tts.convert(
text="This will sound like the reference voice",
references=[ReferenceAudio(
audio=f.read(),
text="Text spoken in the reference audio"
)]
)
```
```python Asynchronous focus={8-14} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import ReferenceAudio
async def main():
client = AsyncFishAudio()
# Instant voice cloning
with open("reference.wav", "rb") as f:
audio = await client.tts.convert(
text="This will sound like the reference voice",
references=[ReferenceAudio(
audio=f.read(),
text="Text spoken in the reference audio"
)]
)
asyncio.run(main())
```
**Voice models** - Create persistent voice models for repeated use:
```python Synchronous focus={6-11} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Create persistent voice model
with open("voice_sample.wav", "rb") as f:
voice = client.voices.create(
title="My Custom Voice",
voices=[f.read()],
description="Custom voice clone"
)
print(f"Created voice: {voice.id}")
```
```python Asynchronous focus={8-13} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Create persistent voice model
with open("voice_sample.wav", "rb") as f:
voice = await client.voices.create(
title="My Custom Voice",
voices=[f.read()],
description="Custom voice clone"
)
print(f"Created voice: {voice.id}")
asyncio.run(main())
```
Learn more in the [Voice Cloning guide](/developer-guide/sdk-guide/python/voice-cloning).
## Client Initialization
The recommended approach using environment variables:
```python theme={null}
from fishaudio import FishAudio
# Automatically reads from FISH_API_KEY environment variable
client = FishAudio()
```
Provide the API key directly:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(api_key="your_api_key")
```
Never commit API keys to version control. Use environment variables or secret management systems.
Configure a custom base URL:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio(
api_key="your_api_key",
base_url="https://your-proxy-domain.com"
)
```
## Sync vs Async
The SDK provides both synchronous and asynchronous clients:
```python Synchronous theme={null}
from fishaudio import FishAudio
# For typical applications
client = FishAudio()
audio = client.tts.convert(text="Hello!")
```
```python Asynchronous theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
# For async applications (web servers, concurrent tasks)
client = AsyncFishAudio()
audio = await client.tts.convert(text="Hello!")
asyncio.run(main())
```
Use [`AsyncFishAudio`](/api-reference/sdk/python/client#asyncfishaudio-objects) when:
* Building async web applications (FastAPI, Sanic, etc.)
* Processing multiple requests concurrently
* Integrating with other async libraries
* You need maximum performance
## Resource Clients
The SDK organizes functionality into resource clients:
| Resource | Description | Key Methods |
| ----------------------------------------------------------------------------- | ------------------ | ----------------------------------------------------- |
| [`client.tts`](/api-reference/sdk/python/resources#ttsclient-objects) | Text-to-speech | `convert()`, `stream()`, `stream_websocket()` |
| [`client.asr`](/api-reference/sdk/python/resources#asrclient-objects) | Speech recognition | `transcribe()` |
| [`client.voices`](/api-reference/sdk/python/resources#voicesclient-objects) | Voice management | `list()`, `get()`, `create()`, `update()`, `delete()` |
| [`client.account`](/api-reference/sdk/python/resources#accountclient-objects) | Account info | `get_credits()`, `get_package()` |
## Utility Functions
The SDK includes helpful utilities (requires `utils` extra):
```python theme={null}
from fishaudio.utils import save, play, stream
# Save audio to file
save(audio, "output.mp3")
# Play audio (automatically detects environment)
play(audio) # Works in Jupyter, regular Python, etc.
# Stream audio in real-time (requires mpv)
stream(audio_iterator)
```
Use [`play()`](/api-reference/sdk/python/utils#play) for playback and [`save()`](/api-reference/sdk/python/utils#save) for writing audio files.
Learn more in the [API Reference - Utils](/api-reference/sdk/python/utils).
## Error Handling
The SDK provides a comprehensive exception hierarchy:
```python theme={null}
from fishaudio import FishAudio
from fishaudio.exceptions import (
FishAudioError,
AuthenticationError,
RateLimitError,
ValidationError
)
client = FishAudio()
try:
audio = client.tts.convert(text="Hello!")
except AuthenticationError:
print("Invalid API key")
except RateLimitError:
print("Rate limit exceeded. Please wait before retrying.")
except ValidationError as e:
print(f"Invalid request: {e}")
except FishAudioError as e:
print(f"API error: {e}")
```
The SDK includes exceptions for [`AuthenticationError`](/api-reference/sdk/python/exceptions#authenticationerror-objects), [`RateLimitError`](/api-reference/sdk/python/exceptions#ratelimiterror-objects), [`ValidationError`](/api-reference/sdk/python/exceptions#validationerror-objects), and [`FishAudioError`](/api-reference/sdk/python/exceptions#fishaudioerror-objects) for common error scenarios.
Learn more in the [API Reference - Exceptions](/api-reference/sdk/python/exceptions).
## Next Steps
Set up API keys and client configuration
Generate natural-sounding speech
Clone voices and manage voice models
Transcribe audio to text
Real-time audio streaming
Complete API documentation
## Resources
* [GitHub Repository](https://github.com/fishaudio/fish-audio-python)
* [PyPI Package](https://pypi.org/project/fish-audio-sdk/)
* [Migration Guide](/archive/python-sdk-legacy/migration-guide) - Upgrade from legacy SDK
* [Best Practices](/developer-guide/best-practices/) - Production-ready tips
* [API Reference](/api-reference/sdk/python/) - Detailed documentation
# Speech-to-Text
Source: https://docs.fish.audio/developer-guide/sdk-guide/python/speech-to-text
Transcribe audio to text with the Fish Audio Python SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Basic Transcription
Transcribe audio files to text with automatic language detection using [`asr.transcribe()`](/api-reference/sdk/python/resources#transcribe):
```python Synchronous focus={6-10} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Transcribe audio
with open("audio.mp3", "rb") as f:
result = client.asr.transcribe(audio=f.read())
print(f"Transcription: {result.text}")
print(f"Duration: {result.duration}ms")
```
```python Asynchronous focus={8-12} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Transcribe audio
with open("audio.mp3", "rb") as f:
result = await client.asr.transcribe(audio=f.read())
print(f"Transcription: {result.text}")
print(f"Duration: {result.duration}ms")
asyncio.run(main())
```
The [`ASRResponse`](/api-reference/sdk/python/types#asrresponse-objects) object contains the full transcription and segment details.
## Language Specification
Specify the language for more accurate transcription:
```python Synchronous focus={5-11} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Specify language code
with open("chinese_audio.mp3", "rb") as f:
result = client.asr.transcribe(
audio=f.read(),
language="zh" # Chinese
)
print(result.text)
```
```python Asynchronous focus={7-13} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Specify language code
with open("chinese_audio.mp3", "rb") as f:
result = await client.asr.transcribe(
audio=f.read(),
language="zh" # Chinese
)
print(result.text)
asyncio.run(main())
```
Auto-detection works well for most cases, but specifying the language can improve accuracy, especially for languages with similar phonetics.
## Segment Timestamps
Access word-level or phrase-level timestamps:
```python Synchronous focus={5-14} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Transcribe with segments
with open("audio.mp3", "rb") as f:
result = client.asr.transcribe(audio=f.read())
# Access full text
print(f"Full text: {result.text}")
# Iterate through segments
for segment in result.segments:
print(f"[{segment.start}ms - {segment.end}ms]: {segment.text}")
```
```python Asynchronous focus={7-16} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Transcribe with segments
with open("audio.mp3", "rb") as f:
result = await client.asr.transcribe(audio=f.read())
# Access full text
print(f"Full text: {result.text}")
# Iterate through segments
for segment in result.segments:
print(f"[{segment.start}ms - {segment.end}ms]: {segment.text}")
asyncio.run(main())
```
## Next Steps
Convert transcribed text back to speech
Use transcribed audio for voice cloning
Complete ASR API documentation
Production tips and optimization
## Related Resources
* [ASR Types Reference](/api-reference/sdk/python/types#asr) - ASR response data structures
* [Error Handling](/api-reference/sdk/python/exceptions) - Exception types and handling
# Text-to-Speech
Source: https://docs.fish.audio/developer-guide/sdk-guide/python/text-to-speech
Generate natural-sounding speech with the Fish Audio Python SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Understanding TTS Methods
The SDK provides three methods for text-to-speech generation, each optimized for different use cases:
| Method | Returns | Best For |
| ---------------------------------------------------------------------------- | -------------------- | ------------------------------------------------------------------------ |
| [`convert()`](/api-reference/sdk/python/resources#convert) | Complete audio bytes | Most use cases - simple, gets full audio at once |
| [`stream()`](/api-reference/sdk/python/resources#stream) | `AudioStream` | Chunk-by-chunk processing, memory-efficient transfer |
| [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) | Audio bytes iterator | Real-time streaming with dynamic text (LLM responses, conversational AI) |
Use `convert()` for most use cases. Use `stream()` for memory efficiency when handling large files. Use `stream_websocket()` when text is generated dynamically in real-time.
## Basic Usage
Generate speech from text with a single function call:
```python Synchronous focus={6-9} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import save, play
client = FishAudio()
# Generate speech (returns bytes)
audio = client.tts.convert(text="Hello, welcome to Fish Audio!")
# Play or save the audio
play(audio)
save(audio, "output.mp3")
```
```python Asynchronous focus={8-11} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import save, play
async def main():
client = AsyncFishAudio()
# Generate speech (returns bytes)
audio = await client.tts.convert(text="Hello, welcome to Fish Audio!")
# Play or save the audio
play(audio)
save(audio, "output.mp3")
asyncio.run(main())
```
## Using Voice Models
Specify a voice model for consistent voice characteristics:
```python Synchronous focus={6-10} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# Use a specific voice
audio = client.tts.convert(
text="This uses a specific voice model",
reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian
)
play(audio)
```
```python Asynchronous focus={8-12} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Use a specific voice
audio = await client.tts.convert(
text="This uses a specific voice model",
reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian
)
play(audio)
asyncio.run(main())
```
### Finding Voice Models
Get voice model IDs from the Fish Audio website or programmatically:
```python Synchronous focus={5-16} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# List available voices
voices = client.voices.list(language="en", tags="male")
for voice in voices.items:
print(f"{voice.title}: {voice.id}")
# Use a voice from the list
audio = client.tts.convert(
text="Generated with discovered voice",
reference_id=voices.items[0].id
)
play(audio)
```
```python Asynchronous focus={7-18} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# List available voices
voices = await client.voices.list(language="en", tags="male")
for voice in voices.items:
print(f"{voice.title}: {voice.id}")
# Use a voice from the list
audio = await client.tts.convert(
text="Generated with discovered voice",
reference_id=voices.items[0].id
)
play(audio)
asyncio.run(main())
```
Learn more in the [Voice Cloning guide](/developer-guide/sdk-guide/python/voice-cloning).
## Emotions and Expressions
The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details.
Add emotional expressions to make speech more natural:
```python Synchronous focus={5-16} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
text = """
(happy) I'm excited to announce this!
(sad) Unfortunately, it didn't work out.
(angry) This is so frustrating!
(calm) Let me explain the details.
"""
audio = client.tts.convert(
text=text,
reference_id="933563129e564b19a115bedd57b7406a" # Sarah
)
play(audio)
```
```python Asynchronous focus={7-18} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
text = """
(happy) I'm excited to announce this!
(sad) Unfortunately, it didn't work out.
(angry) This is so frustrating!
(calm) Let me explain the details.
"""
audio = await client.tts.convert(
text=text,
reference_id="933563129e564b19a115bedd57b7406a" # Sarah
)
play(audio)
asyncio.run(main())
```
See the [Emotion Reference](/api-reference/emotion-reference) for all available emotions and [Fine-grained Control](/developer-guide/core-features/fine-grained-control) for advanced usage.
## Audio Formats
Choose the output format based on your needs:
```python Synchronous focus={5-21} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# MP3 (default) - good balance of quality and size
audio = client.tts.convert(
text="MP3 format",
format="mp3"
)
# WAV - uncompressed, highest quality
audio = client.tts.convert(
text="WAV format",
format="wav"
)
# PCM - raw audio data for streaming
audio = client.tts.convert(
text="PCM format",
format="pcm"
)
```
```python Asynchronous focus={7-23} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# MP3 (default) - good balance of quality and size
audio = await client.tts.convert(
text="MP3 format",
format="mp3"
)
# WAV - uncompressed, highest quality
audio = await client.tts.convert(
text="WAV format",
format="wav"
)
# PCM - raw audio data for streaming
audio = await client.tts.convert(
text="PCM format",
format="pcm"
)
asyncio.run(main())
```
## Prosody Control
Adjust speech speed and volume for natural-sounding output:
```python Synchronous focus={6-10} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# Simple speed adjustment
audio = client.tts.convert(
text="This will be spoken faster",
speed=1.5 # 1.5x speed (range: 0.5-2.0)
)
play(audio)
```
```python Asynchronous focus={8-12} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Simple speed adjustment
audio = await client.tts.convert(
text="This will be spoken faster",
speed=1.5 # 1.5x speed (range: 0.5-2.0)
)
play(audio)
asyncio.run(main())
```
For combined speed and volume control, use [`TTSConfig`](/api-reference/sdk/python/types#ttsconfig-objects) with [`Prosody`](/api-reference/sdk/python/types#prosody-objects):
```python Synchronous focus={7-17} theme={null}
from fishaudio import FishAudio
from fishaudio.types import TTSConfig, Prosody
from fishaudio.utils import play
client = FishAudio()
# Configure prosody with TTSConfig
audio = client.tts.convert(
text="Adjusted speech with custom speed and volume",
config=TTSConfig(
prosody=Prosody(
speed=1.2, # 20% faster
volume=5 # Louder (range: -20 to 20)
)
)
)
play(audio)
```
```python Asynchronous focus={9-19} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import TTSConfig, Prosody
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Configure prosody with TTSConfig
audio = await client.tts.convert(
text="Adjusted speech with custom speed and volume",
config=TTSConfig(
prosody=Prosody(
speed=1.2, # 20% faster
volume=5 # Louder (range: -20 to 20)
)
)
)
play(audio)
asyncio.run(main())
```
## Reusable TTS Configuration
Create a configuration once and reuse it across multiple generations:
```python Synchronous focus={5-18} theme={null}
from fishaudio import FishAudio
from fishaudio.types import TTSConfig, Prosody
client = FishAudio()
# Define config once
my_config = TTSConfig(
prosody=Prosody(speed=1.2, volume=-5),
reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian
format="wav",
latency="balanced"
)
# Reuse across multiple generations
audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config)
audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config)
audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config)
```
```python Asynchronous focus={7-20} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import TTSConfig, Prosody
async def main():
client = AsyncFishAudio()
# Define config once
my_config = TTSConfig(
prosody=Prosody(speed=1.2, volume=-5),
reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian
format="wav",
latency="balanced"
)
# Reuse across multiple generations
audio1 = await client.tts.convert(text="Welcome to our product demonstration.", config=my_config)
audio2 = await client.tts.convert(text="Let me show you the key features.", config=my_config)
audio3 = await client.tts.convert(text="Thank you for watching this tutorial.", config=my_config)
asyncio.run(main())
```
## Chunk-by-Chunk Streaming
Use `stream()` for memory-efficient transfer and progressive download. Chunks are network transmission units (not semantic audio segments):
```python Synchronous focus={5-8} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Collect all chunks efficiently
audio_stream = client.tts.stream(text="Long text here")
audio = audio_stream.collect() # Returns complete audio as bytes
```
```python Asynchronous focus={7-10} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Collect all chunks efficiently
audio_stream = await client.tts.stream(text="Long text here")
audio = await audio_stream.collect() # Returns complete audio as bytes
asyncio.run(main())
```
For streaming to files or network without buffering in memory:
```python Synchronous focus={5-9} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Stream directly to file (memory efficient for large audio)
audio_stream = client.tts.stream(text="Very long text...")
with open("output.mp3", "wb") as f:
for chunk in audio_stream:
f.write(chunk) # Write each chunk as it arrives
```
```python Asynchronous focus={7-11} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Stream directly to file (memory efficient for large audio)
audio_stream = await client.tts.stream(text="Very long text...")
with open("output.mp3", "wb") as f:
async for chunk in audio_stream:
f.write(chunk) # Write each chunk as it arrives
asyncio.run(main())
```
Use `stream()` when you have complete text upfront. For real-time streaming with dynamically generated text (LLMs, live captions), use `stream_websocket()` instead.
## Real-time WebSocket Streaming
For real-time applications where text is generated dynamically, use [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket). This is perfect for LLM integrations, conversational AI, and live captions:
### Basic WebSocket Streaming
```python Synchronous focus={5-15} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# Stream dynamically generated text
def text_chunks():
yield "Hello, "
yield "this is "
yield "streaming text!"
audio_stream = client.tts.stream_websocket(
text_chunks(),
latency="balanced"
)
play(audio_stream)
```
```python Asynchronous focus={7-16} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Stream dynamically generated text
async def text_chunks():
yield "Hello, "
yield "this is "
yield "streaming text!"
audio_stream = await client.tts.stream_websocket(
text_chunks(),
latency="balanced"
)
play(audio_stream)
asyncio.run(main())
```
### Understanding `FlushEvent`
The [`FlushEvent`](/api-reference/sdk/python/types#flushevent-objects) forces the TTS engine to immediately generate audio from the accumulated text buffer. This is useful when you want to ensure audio is generated at specific points, even if the buffer hasn't reached the optimal chunk size.
```python Synchronous focus={6-18} theme={null}
from fishaudio import FishAudio
from fishaudio.types import FlushEvent
client = FishAudio()
# Use FlushEvent to force immediate generation
def text_with_flush():
yield "This is the first sentence. "
yield "This is the second sentence. "
yield FlushEvent() # Force audio generation NOW
yield "This starts a new segment. "
yield "And continues here."
yield FlushEvent() # Force final generation
audio_stream = client.tts.stream_websocket(text_with_flush())
# Process each audio chunk as it arrives
for chunk in audio_stream:
print(f"Received audio chunk: {len(chunk)} bytes")
```
```python Asynchronous focus={8-20} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import FlushEvent
async def main():
client = AsyncFishAudio()
# Use FlushEvent to force immediate generation
async def text_with_flush():
yield "This is the first sentence. "
yield "This is the second sentence. "
yield FlushEvent() # Force audio generation NOW
yield "This starts a new segment. "
yield "And continues here."
yield FlushEvent() # Force final generation
audio_stream = await client.tts.stream_websocket(text_with_flush())
# Process each audio chunk as it arrives
async for chunk in audio_stream:
print(f"Received audio chunk: {len(chunk)} bytes")
asyncio.run(main())
```
Without `FlushEvent`, the engine automatically generates audio when the buffer reaches an optimal size. Use `FlushEvent` to control exactly when audio should be generated, which can reduce perceived latency in interactive applications.
### `TextEvent` vs Plain Strings
You can yield plain strings (recommended for simplicity) or use [`TextEvent`](/api-reference/sdk/python/types#textevent-objects) for explicit control:
```python Synchronous focus={6-17} theme={null}
from fishaudio import FishAudio
from fishaudio.types import TextEvent
client = FishAudio()
# Both approaches are equivalent
def text_as_strings():
yield "Hello, "
yield "world!"
def text_as_events():
yield TextEvent(text="Hello, ")
yield TextEvent(text="world!")
# Use whichever style you prefer
audio1 = client.tts.stream_websocket(text_as_strings())
audio2 = client.tts.stream_websocket(text_as_events())
```
```python Asynchronous focus={8-19} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import TextEvent
async def main():
client = AsyncFishAudio()
# Both approaches are equivalent
async def text_as_strings():
yield "Hello, "
yield "world!"
async def text_as_events():
yield TextEvent(text="Hello, ")
yield TextEvent(text="world!")
# Use whichever style you prefer
audio1 = await client.tts.stream_websocket(text_as_strings())
audio2 = await client.tts.stream_websocket(text_as_events())
asyncio.run(main())
```
### LLM Integration Pattern
WebSocket streaming shines when integrating with LLM streaming responses. The TTS engine acts as an accumulator, buffering text until it has enough to generate natural-sounding audio:
```python Synchronous focus={5-19} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# Simulate streaming LLM response
def llm_stream():
"""Simulates text chunks from an LLM"""
tokens = [
"The ", "weather ", "today ", "is ", "sunny ",
"with ", "clear ", "skies. ", "Perfect ",
"for ", "outdoor ", "activities!"
]
for token in tokens:
yield token
# Stream to speech in real-time
audio_stream = client.tts.stream_websocket(llm_stream())
play(audio_stream)
```
```python Asynchronous focus={7-21} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Simulate streaming LLM response
async def llm_stream():
"""Simulates text chunks from an LLM"""
tokens = [
"The ", "weather ", "today ", "is ", "sunny ",
"with ", "clear ", "skies. ", "Perfect ",
"for ", "outdoor ", "activities!"
]
for token in tokens:
yield token
# Stream to speech in real-time
audio_stream = await client.tts.stream_websocket(llm_stream())
play(audio_stream)
asyncio.run(main())
```
The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don't need to manually batch tokens unless you want to force generation at specific points using `FlushEvent`.
Learn more in the [WebSocket Streaming guide](/developer-guide/sdk-guide/python/websocket).
## Advanced Configuration
Comprehensive `TTSConfig` with all available parameters:
```python focus={3-24} theme={null}
from fishaudio.types import TTSConfig, Prosody
# All TTSConfig parameters
config = TTSConfig(
# Audio output settings
format="mp3",
sample_rate=44100, # Custom sample rate (optional)
mp3_bitrate=192, # 64, 128, or 192 kbps
opus_bitrate=64, # For Opus format: -1000, 24, 32, 48, or 64
normalize=True, # Normalize audio levels
# Generation settings
chunk_length=200, # Characters per chunk (100-300)
latency="balanced", # "normal" or "balanced"
# Voice/style settings
reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian
prosody=Prosody(speed=1.1, volume=0),
# references=[ReferenceAudio(...)] # For instant cloning
# Model parameters
temperature=0.7, # Randomness (0.0-1.0)
top_p=0.7 # Token selection (0.0-1.0)
)
# Use with any client
audio = client.tts.convert(text="Your text here", config=config)
```
`TTSConfig` works the same for both sync and async clients. See [TTSConfig API Reference](/api-reference/sdk/python/types#ttsconfig-objects) for detailed documentation on each parameter and their defaults.
## Error Handling
Handle common TTS errors gracefully:
```python theme={null}
from fishaudio import FishAudio
from fishaudio.exceptions import (
RateLimitError,
ValidationError,
NotFoundError,
FishAudioError
)
import time
client = FishAudio()
try:
audio = client.tts.convert(
text="Your text here",
reference_id="voice_id"
)
except RateLimitError:
print("Rate limit exceeded. Please wait before retrying.")
time.sleep(60) # Wait before retry
except NotFoundError:
print("Voice model not found. Check the reference_id")
except ValidationError as e:
print(f"Invalid request: {e}")
except FishAudioError as e:
print(f"API error: {e}")
```
Common exceptions include [`RateLimitError`](/api-reference/sdk/python/exceptions#ratelimiterror-objects), [`ValidationError`](/api-reference/sdk/python/exceptions#validationerror-objects), [`NotFoundError`](/api-reference/sdk/python/exceptions#notfounderror-objects), and [`FishAudioError`](/api-reference/sdk/python/exceptions#fishaudioerror-objects).
## Best Practices
For long texts, adjust `chunk_length` in `TTSConfig`:
```python theme={null}
from fishaudio import FishAudio
from fishaudio.types import TTSConfig
client = FishAudio()
audio = client.tts.convert(
text="Very long text...",
config=TTSConfig(chunk_length=250) # Larger chunks for efficiency
)
```
If you generate the same speech repeatedly, cache the results:
```python theme={null}
import os
from fishaudio import FishAudio
from fishaudio.utils import save
client = FishAudio()
def get_or_generate_speech(text, cache_file):
if os.path.exists(cache_file):
with open(cache_file, "rb") as f:
return f.read()
audio = client.tts.convert(text=text)
save(audio, cache_file)
return audio
```
Implement exponential backoff for rate limits:
```python theme={null}
from fishaudio import FishAudio
from fishaudio.exceptions import RateLimitError
import time
client = FishAudio()
def generate_with_retry(text, max_retries=3):
for attempt in range(max_retries):
try:
return client.tts.convert(text=text)
except RateLimitError as e:
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
```
Balance speed vs quality based on your use case:
```python theme={null}
from fishaudio import FishAudio
client = FishAudio()
# For real-time applications
audio = client.tts.convert(text="Fast response", latency="balanced")
# For highest quality
audio = client.tts.convert(text="Best quality", latency="normal")
```
## Next Steps
Create custom voice models
Real-time audio streaming
Phoneme-level control and paralanguage
Production tips and optimization
## Related Resources
* [TTS API Reference](/api-reference/sdk/python/resources#tts) - Complete API documentation
* [Audio Formats Guide](/developer-guide/core-features/text-to-speech#audio-formats) - Format comparison
* [Emotion Reference](/api-reference/emotion-reference) - All available emotions
* [Utils Reference](/api-reference/sdk/python/utils) - Audio utilities
# Voice Cloning
Source: https://docs.fish.audio/developer-guide/sdk-guide/python/voice-cloning
Clone voices and create custom voice models with the Fish Audio Python SDK
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Instant Voice Cloning
Clone a voice on-the-fly without creating a persistent model using [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects):
```python Synchronous focus={6-15} theme={null}
from fishaudio import FishAudio
from fishaudio.types import ReferenceAudio
from fishaudio.utils import play
client = FishAudio()
# Clone from reference audio
with open("reference_voice.wav", "rb") as f:
audio = client.tts.convert(
text="This will sound like the reference voice",
references=[ReferenceAudio(
audio=f.read(),
text="Text spoken in the reference audio"
)]
)
play(audio)
```
```python Asynchronous focus={8-17} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import ReferenceAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Clone from reference audio
with open("reference_voice.wav", "rb") as f:
audio = await client.tts.convert(
text="This will sound like the reference voice",
references=[ReferenceAudio(
audio=f.read(),
text="Text spoken in the reference audio"
)]
)
play(audio)
asyncio.run(main())
```
Instant voice cloning is perfect for one-time use cases. For repeated use of the same voice, create a persistent voice model instead.
## Multiple Reference Samples
Improve voice quality by providing multiple reference samples:
```python Synchronous focus={6-21} theme={null}
from fishaudio import FishAudio
from fishaudio.types import ReferenceAudio
from fishaudio.utils import play
client = FishAudio()
# Load multiple reference samples
references = []
samples = [
("sample1.wav", "First sample transcript"),
("sample2.wav", "Second sample transcript"),
("sample3.wav", "Third sample transcript")
]
for audio_file, transcript in samples:
with open(audio_file, "rb") as f:
references.append(ReferenceAudio(
audio=f.read(),
text=transcript
))
# Generate with multiple references
audio = client.tts.convert(
text="This voice is trained on multiple samples",
references=references
)
play(audio)
```
```python Asynchronous focus={8-23} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import ReferenceAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Load multiple reference samples
references = []
samples = [
("sample1.wav", "First sample transcript"),
("sample2.wav", "Second sample transcript"),
("sample3.wav", "Third sample transcript")
]
for audio_file, transcript in samples:
with open(audio_file, "rb") as f:
references.append(ReferenceAudio(
audio=f.read(),
text=transcript
))
# Generate with multiple references
audio = await client.tts.convert(
text="This voice is trained on multiple samples",
references=references
)
play(audio)
asyncio.run(main())
```
## Creating Persistent Voice Models
Create a reusable voice model for consistent voice characteristics using [`voices.create()`](/api-reference/sdk/python/resources#create):
```python Synchronous focus={5-20} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Prepare voice samples
voice_samples = []
with open("voice1.wav", "rb") as f1:
voice_samples.append(f1.read())
with open("voice2.wav", "rb") as f2:
voice_samples.append(f2.read())
# Create voice model
voice = client.voices.create(
title="My Custom Voice",
voices=voice_samples,
description="A custom voice for my project",
tags=["custom", "english"],
visibility="private"
)
print(f"Created voice: {voice.id}")
```
```python Asynchronous focus={7-22} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Prepare voice samples
voice_samples = []
with open("voice1.wav", "rb") as f1:
voice_samples.append(f1.read())
with open("voice2.wav", "rb") as f2:
voice_samples.append(f2.read())
# Create voice model
voice = await client.voices.create(
title="My Custom Voice",
voices=voice_samples,
description="A custom voice for my project",
tags=["custom", "english"],
visibility="private"
)
print(f"Created voice: {voice.id}")
asyncio.run(main())
```
### With Transcripts
Providing transcripts is faster and more accurate than automatic transcription. When you provide transcripts, the system skips running ASR (speech recognition), resulting in better performance and quality:
```python Synchronous focus={5-27} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Voice samples with transcripts
samples = [
("voice1.wav", "This is the first sample"),
("voice2.wav", "This is the second sample"),
("voice3.wav", "This is the third sample")
]
voices = []
texts = []
for audio_file, transcript in samples:
with open(audio_file, "rb") as f:
voices.append(f.read())
texts.append(transcript)
# Create voice with transcripts
voice = client.voices.create(
title="High Quality Voice",
voices=voices,
texts=texts,
description="Voice with accurate transcripts",
enhance_audio_quality=True
)
print(f"Created voice: {voice.id}")
```
```python Asynchronous focus={7-29} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Voice samples with transcripts
samples = [
("voice1.wav", "This is the first sample"),
("voice2.wav", "This is the second sample"),
("voice3.wav", "This is the third sample")
]
voices = []
texts = []
for audio_file, transcript in samples:
with open(audio_file, "rb") as f:
voices.append(f.read())
texts.append(transcript)
# Create voice with transcripts
voice = await client.voices.create(
title="High Quality Voice",
voices=voices,
texts=texts,
description="Voice with accurate transcripts",
enhance_audio_quality=True
)
print(f"Created voice: {voice.id}")
asyncio.run(main())
```
### Audio Quality Enhancement
Enable automatic audio enhancement to clean up noisy reference audio:
```python theme={null}
voice = client.voices.create(
title="Enhanced Voice",
voices=voice_samples,
enhance_audio_quality=True # Clean up background noise and normalize levels
)
```
Audio enhancement helps process noisy or lower-quality reference audio. If your audio is already clean and well-recorded, this may not provide additional benefit.
## Managing Voice Models
### List Voices
Discover available voices with filtering using [`voices.list()`](/api-reference/sdk/python/resources#list):
```python Synchronous focus={5-11} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# List all voices
voices = client.voices.list(page_size=20)
print(f"Total voices: {voices.total}")
for voice in voices.items:
print(f"{voice.title}: {voice.id}")
```
```python Asynchronous focus={7-13} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# List all voices
voices = await client.voices.list(page_size=20)
print(f"Total voices: {voices.total}")
for voice in voices.items:
print(f"{voice.title}: {voice.id}")
asyncio.run(main())
```
### Filter by Tags and Language
```python Synchronous focus={5-21} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Filter by tags
male_voices = client.voices.list(
tags=["male", "english"],
page_size=10
)
# Filter by language
chinese_voices = client.voices.list(
language="zh",
page_size=10
)
# Get only your own voices
my_voices = client.voices.list(
self_only=True,
page_size=20
)
```
```python Asynchronous focus={7-23} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Filter by tags
male_voices = await client.voices.list(
tags=["male", "english"],
page_size=10
)
# Filter by language
chinese_voices = await client.voices.list(
language="zh",
page_size=10
)
# Get only your own voices
my_voices = await client.voices.list(
self_only=True,
page_size=20
)
asyncio.run(main())
```
### Get Voice Details
Use [`voices.get()`](/api-reference/sdk/python/resources#get) to retrieve voice details:
```python Synchronous focus={5-11} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Get specific voice
voice = client.voices.get("bf322df2096a46f18c579d0baa36f41d") # Adrian
print(f"Title: {voice.title}")
print(f"Description: {voice.description}")
print(f"Tags: {voice.tags}")
print(f"Languages: {voice.languages}")
```
```python Asynchronous focus={7-13} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Get specific voice
voice = await client.voices.get("bf322df2096a46f18c579d0baa36f41d") # Adrian
print(f"Title: {voice.title}")
print(f"Description: {voice.description}")
print(f"Tags: {voice.tags}")
print(f"Languages: {voice.languages}")
asyncio.run(main())
```
### Update Voice Metadata
Update voice information using [`voices.update()`](/api-reference/sdk/python/resources#update):
```python Synchronous focus={5-11} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Update voice information
client.voices.update(
"bf322df2096a46f18c579d0baa36f41d", # Adrian
title="Updated Voice Name",
description="Updated description",
visibility="public", # "public", "unlist", or "private"
tags=["updated", "english", "male"]
)
```
```python Asynchronous focus={7-13} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Update voice information
await client.voices.update(
"bf322df2096a46f18c579d0baa36f41d", # Adrian
title="Updated Voice Name",
description="Updated description",
visibility="public", # "public", "unlist", or "private"
tags=["updated", "english", "male"]
)
asyncio.run(main())
```
### Delete Voice
Remove voice models using [`voices.delete()`](/api-reference/sdk/python/resources#delete):
```python Synchronous focus={5-7} theme={null}
from fishaudio import FishAudio
client = FishAudio()
# Delete a voice model
client.voices.delete("bf322df2096a46f18c579d0baa36f41d") # Adrian
print("Voice deleted successfully")
```
```python Asynchronous focus={7-9} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
async def main():
client = AsyncFishAudio()
# Delete a voice model
await client.voices.delete("bf322df2096a46f18c579d0baa36f41d") # Adrian
print("Voice deleted successfully")
asyncio.run(main())
```
Deleting a voice is permanent and cannot be undone. Make sure you have backups of any important voice models.
## Next Steps
Use cloned voices for speech generation
Stream audio with custom voices in real-time
Complete voice management API documentation
Production tips and optimization strategies
## Related Resources
* [Voice Types Reference](/api-reference/sdk/python/types#voices) - Voice model data structures
* [Audio Formats Guide](/developer-guide/core-features/text-to-speech#audio-formats) - Supported audio formats
* [Fine-grained Control](/developer-guide/core-features/fine-grained-control) - Advanced voice customization
# WebSocket Streaming
Source: https://docs.fish.audio/developer-guide/sdk-guide/python/websocket
Stream text-to-speech in real-time with WebSocket connections
## Prerequisites
Sign up for a free Fish Audio account to get started with our API.
1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup)
2. Fill in your details to create an account, complete steps to verify your account.
3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys)
Once you have an account, you'll need an API key to authenticate your requests.
1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/)
2. Navigate to the API Keys section
3. Click "Create New Key" and give it a descriptive name, set a expiration if desired
4. Copy your key and store it securely
Keep your API key secret! Never commit it to version control or share it publicly.
## Overview
Use [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) for real-time text streaming with LLMs and live captions. The connection automatically buffers incoming text and generates audio as it becomes available.
## Basic Usage
Stream text chunks and receive audio in real-time:
```python Synchronous focus={5-17} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# Define text generator
def text_chunks():
yield "Hello, "
yield "this is "
yield "real-time "
yield "streaming!"
# Stream audio via WebSocket
audio_stream = client.tts.stream_websocket(
text_chunks(),
latency="balanced" # Use "balanced" for real-time, "normal" for quality
)
# Play streamed audio
play(audio_stream)
```
```python Asynchronous focus={8-20} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Define async text generator
async def text_chunks():
yield "Hello, "
yield "this is "
yield "real-time "
yield "streaming!"
# Stream audio via WebSocket
audio_stream = await client.tts.stream_websocket(
text_chunks(),
latency="balanced" # Use "balanced" for real-time, "normal" for quality
)
# Play streamed audio
play(audio_stream)
asyncio.run(main())
```
For details on audio formats, voice selection, and advanced configuration options like `TTSConfig`, see the [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech).
## Using FlushEvent
Force immediate audio generation to create pauses using [`FlushEvent`](/api-reference/sdk/python/types#flushevent-objects):
```python Synchronous focus={6-12} theme={null}
from fishaudio import FishAudio
from fishaudio.types import FlushEvent
client = FishAudio()
def text_with_flush():
yield "First sentence. "
yield "Second sentence. "
yield FlushEvent() # Forces generation NOW
yield "Third sentence."
audio_stream = client.tts.stream_websocket(text_with_flush())
```
```python Asynchronous focus={8-14} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.types import FlushEvent
async def main():
client = AsyncFishAudio()
async def text_with_flush():
yield "First sentence. "
yield "Second sentence. "
yield FlushEvent() # Forces generation NOW
yield "Third sentence."
audio_stream = await client.tts.stream_websocket(text_with_flush())
asyncio.run(main())
```
See [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech#understanding-flushevent) for detailed FlushEvent usage and advanced examples.
## LLM Integration
WebSocket streaming is designed for integrating with LLM streaming responses. The TTS engine automatically buffers incoming text chunks and generates audio when it has enough context for natural speech:
```python Synchronous focus={5-21} theme={null}
from fishaudio import FishAudio
from fishaudio.utils import play
client = FishAudio()
# Simulate streaming LLM response
def llm_stream():
"""Simulates text chunks from an LLM."""
tokens = [
"The ", "weather ", "today ", "is ", "sunny ",
"with ", "clear ", "skies. ", "Perfect ",
"for ", "outdoor ", "activities!"
]
for token in tokens:
yield token
# Stream to speech in real-time
audio_stream = client.tts.stream_websocket(
llm_stream(),
latency="balanced"
)
play(audio_stream)
```
```python Asynchronous focus={7-23} theme={null}
import asyncio
from fishaudio import AsyncFishAudio
from fishaudio.utils import play
async def main():
client = AsyncFishAudio()
# Simulate streaming LLM response
async def llm_stream():
"""Simulates text chunks from an LLM."""
tokens = [
"The ", "weather ", "today ", "is ", "sunny ",
"with ", "clear ", "skies. ", "Perfect ",
"for ", "outdoor ", "activities!"
]
for token in tokens:
yield token
# Stream to speech in real-time
audio_stream = await client.tts.stream_websocket(
llm_stream(),
latency="balanced"
)
play(audio_stream)
asyncio.run(main())
```
The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don't need to manually batch tokens unless you want to force generation at specific points using `FlushEvent`.
## Next Steps
Learn about non-streaming TTS options, audio formats, TextEvent vs plain strings, and advanced configuration
Use custom voices in streams and learn about voice selection
Complete streaming API documentation
Production streaming optimization
## Related Resources
* [WebSocket Types](/api-reference/sdk/python/types#tts) - TextEvent, FlushEvent, and more
* [Utils Reference](/api-reference/sdk/python/utils) - Audio playback utilities
* [Error Handling](/api-reference/sdk/python/exceptions) - WebSocket exception handling
* [Fine-grained Control](/developer-guide/core-features/fine-grained-control) - Advanced speech control
# Docker Deployment
Source: https://docs.fish.audio/developer-guide/self-hosting/docker-deployment
Deploy Fish Audio models using Docker containers
Fish Audio provides Docker images for both WebUI and API server deployments. You can use pre-built images from Docker Hub or build custom images locally.
## Prerequisites
Before deploying with Docker, ensure you have:
* **Docker** and **Docker Compose** installed
* **NVIDIA Docker runtime** (for GPU support)
* At least **12GB GPU memory** for CUDA inference
* Downloaded model weights (see [Running Inference](/developer-guide/self-hosting/running-inference#download-weights))
## Pre-built Images
Fish Audio provides ready-to-use Docker images on Docker Hub:
| Image | Description | Best For |
| ------------------------------------------ | ----------------------- | -------------------------------- |
| `fishaudio/fish-speech:latest-webui-cuda` | WebUI with CUDA support | Interactive development with GPU |
| `fishaudio/fish-speech:latest-webui-cpu` | WebUI CPU-only | Testing without GPU |
| `fishaudio/fish-speech:latest-server-cuda` | API server with CUDA | Production deployments with GPU |
| `fishaudio/fish-speech:latest-server-cpu` | API server CPU-only | Low-traffic CPU deployments |
For production use, we recommend using specific version tags instead of `latest` to ensure consistency across deployments.
## Quick Start with Docker Run
The fastest way to get started is using `docker run`:
### WebUI Deployment
```bash theme={null}
# Create directories for model weights and reference audio
mkdir -p checkpoints references
# Start WebUI with CUDA support (recommended)
docker run -d \
--name fish-speech-webui \
--gpus all \
-p 7860:7860 \
-v ./checkpoints:/app/checkpoints \
-v ./references:/app/references \
-e COMPILE=1 \
fishaudio/fish-speech:latest-webui-cuda
# For CPU-only deployment
docker run -d \
--name fish-speech-webui-cpu \
-p 7860:7860 \
-v ./checkpoints:/app/checkpoints \
-v ./references:/app/references \
fishaudio/fish-speech:latest-webui-cpu
```
Access the WebUI at `http://localhost:7860`
### API Server Deployment
```bash theme={null}
# Start API server with CUDA support
docker run -d \
--name fish-speech-server \
--gpus all \
-p 8080:8080 \
-v ./checkpoints:/app/checkpoints \
-v ./references:/app/references \
-e COMPILE=1 \
fishaudio/fish-speech:latest-server-cuda
# For CPU-only deployment
docker run -d \
--name fish-speech-server-cpu \
-p 8080:8080 \
-v ./checkpoints:/app/checkpoints \
-v ./references:/app/references \
fishaudio/fish-speech:latest-server-cpu
```
Access the API documentation at `http://localhost:8080`
Enable the `COMPILE=1` environment variable for \~10x faster inference on CUDA deployments. This uses `torch.compile` to optimize the model.
## Docker Compose Deployment
For development or customization, Docker Compose provides easier configuration management:
### Setup
```bash theme={null}
# Clone the repository
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
```
### Start Services
```bash theme={null}
# Start WebUI with CUDA
docker compose --profile webui up
# Start WebUI with compile optimization
COMPILE=1 docker compose --profile webui up
# Start API server
docker compose --profile server up
# Start API server with compile optimization
COMPILE=1 docker compose --profile server up
# For CPU-only deployment
BACKEND=cpu docker compose --profile webui up
```
Run containers in detached mode by adding the `-d` flag: `docker compose --profile webui up -d`
### Environment Variables
Customize deployment using environment variables or a `.env` file:
```bash theme={null}
# .env file example
BACKEND=cuda # or cpu
COMPILE=1 # Enable compile optimization
GRADIO_PORT=7860 # WebUI port
API_PORT=8080 # API server port
UV_VERSION=0.8.15 # UV package manager version
```
## Manual Docker Build
For advanced users who need custom configurations:
### Build WebUI Image
```bash theme={null}
# Build with CUDA support
docker build \
--platform linux/amd64 \
-f docker/Dockerfile \
--build-arg BACKEND=cuda \
--build-arg CUDA_VER=12.6.0 \
--build-arg UV_EXTRA=cu126 \
--target webui \
-t fish-speech-webui:cuda .
# Build CPU-only (supports multi-platform)
docker build \
--platform linux/amd64,linux/arm64 \
-f docker/Dockerfile \
--build-arg BACKEND=cpu \
--target webui \
-t fish-speech-webui:cpu .
```
### Build API Server Image
```bash theme={null}
# Build with CUDA support
docker build \
--platform linux/amd64 \
-f docker/Dockerfile \
--build-arg BACKEND=cuda \
--build-arg CUDA_VER=12.6.0 \
--build-arg UV_EXTRA=cu126 \
--target server \
-t fish-speech-server:cuda .
```
### Build Development Image
```bash theme={null}
# Build development image with all tools
docker build \
--platform linux/amd64 \
-f docker/Dockerfile \
--build-arg BACKEND=cuda \
--target dev \
-t fish-speech-dev:cuda .
```
### Build Arguments
| Argument | Options | Default | Description |
| ------------ | ------------------------- | -------- | ------------------- |
| `BACKEND` | `cuda`, `cpu` | `cuda` | Compute backend |
| `CUDA_VER` | `12.6.0`, etc. | `12.6.0` | CUDA version |
| `UV_EXTRA` | `cu126`, `cu128`, `cu129` | `cu126` | UV extra for CUDA |
| `UBUNTU_VER` | `24.04`, etc. | `24.04` | Ubuntu base version |
| `PY_VER` | `3.12`, etc. | `3.12` | Python version |
## Volume Mounts
Both Docker run and Compose methods require these volume mounts:
| Host Path | Container Path | Purpose |
| --------------- | ------------------ | --------------------------------------- |
| `./checkpoints` | `/app/checkpoints` | Model weights directory |
| `./references` | `/app/references` | Reference audio files for voice cloning |
Ensure model weights are downloaded and placed in the `./checkpoints` directory before starting containers. See [Running Inference](/developer-guide/self-hosting/running-inference#download-weights) for download instructions.
## Environment Variables Reference
### WebUI Configuration
| Variable | Default | Description |
| -------------------- | --------- | ---------------------------- |
| `GRADIO_SERVER_NAME` | `0.0.0.0` | WebUI server host |
| `GRADIO_SERVER_PORT` | `7860` | WebUI server port |
| `GRADIO_SHARE` | `false` | Enable Gradio public sharing |
### API Server Configuration
| Variable | Default | Description |
| ----------------- | --------- | --------------- |
| `API_SERVER_NAME` | `0.0.0.0` | API server host |
| `API_SERVER_PORT` | `8080` | API server port |
### Model Configuration
| Variable | Default | Description |
| ------------------------- | ----------------------------------------- | -------------------------- |
| `LLAMA_CHECKPOINT_PATH` | `checkpoints/openaudio-s1-mini` | Path to model weights |
| `DECODER_CHECKPOINT_PATH` | `checkpoints/openaudio-s1-mini/codec.pth` | Path to decoder weights |
| `DECODER_CONFIG_NAME` | `modded_dac_vq` | Decoder configuration name |
### Performance Optimization
| Variable | Default | Description |
| --------- | ------- | -------------------------------------------------- |
| `COMPILE` | `0` | Enable torch.compile for \~10x speedup (CUDA only) |
## Container Management
### View Logs
```bash theme={null}
# Docker run
docker logs fish-speech-webui
# Docker Compose
docker compose logs webui
```
### Stop Containers
```bash theme={null}
# Docker run
docker stop fish-speech-webui
# Docker Compose
docker compose down
```
### Update Images
```bash theme={null}
# Pull latest images
docker pull fishaudio/fish-speech:latest-webui-cuda
# Restart containers with new image
docker compose --profile webui up -d
```
## GPU Support
### Prerequisites
Install NVIDIA Container Toolkit:
```bash theme={null}
# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
```
### Verify GPU Access
```bash theme={null}
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi
```
GPU support requires NVIDIA Docker runtime. For CPU-only deployment, remove the `--gpus all` flag and use CPU images.
## Troubleshooting
### Container Won't Start
Check logs for errors:
```bash theme={null}
docker logs fish-speech-webui
```
Common issues:
* Missing model weights in `./checkpoints`
* Port already in use (change port mapping)
* Insufficient GPU memory
### GPU Not Detected
Verify NVIDIA Docker runtime is installed:
```bash theme={null}
docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi
```
### Performance Issues
1. Enable compile optimization: `COMPILE=1`
2. Ensure GPU is being used (check with `nvidia-smi`)
3. Verify sufficient GPU memory is available
## Next Steps
* **[Run inference](/developer-guide/self-hosting/running-inference)** - Learn how to generate speech
* **[Download models](https://huggingface.co/fishaudio)** - Get pre-trained weights
* **[API documentation](/api-reference/introduction)** - Integrate with your applications
# Local Model Setup
Source: https://docs.fish.audio/developer-guide/self-hosting/local-setup
Install and configure Fish Audio models for local inference
This guide is for advanced users who want to self-host Fish Audio models. For most users, we recommend using the [Fish Audio API](https://fish.audio) for easier integration and automatic updates.
## Prerequisites
Before you begin, ensure you have:
* **GPU**: 12GB VRAM minimum (for inference)
* **OS**: Linux or WSL (Windows Subsystem for Linux)
* **System dependencies**: Audio processing libraries
Install required system packages:
```bash theme={null}
apt install portaudio19-dev libsox-dev ffmpeg
```
## Installation Methods
Fish Audio supports multiple installation methods. Choose the one that best fits your development environment.
### Conda Installation
Conda provides a stable, isolated Python environment:
```bash theme={null}
# Create a new environment with Python 3.12
conda create -n fish-speech python=3.12
conda activate fish-speech
# GPU installation (choose your CUDA version: cu126, cu128, cu129)
pip install -e .[cu129]
# CPU-only installation (slower, not recommended for production)
pip install -e .[cpu]
# Default installation (uses PyTorch default index)
pip install -e .
```
For best performance, match your CUDA version with your GPU driver. Use `nvidia-smi` to check your CUDA version.
### UV Installation
[UV](https://github.com/astral-sh/uv) provides faster dependency resolution and installation:
```bash theme={null}
# GPU installation (choose your CUDA version: cu126, cu128, cu129)
uv sync --python 3.12 --extra cu129
# CPU-only installation
uv sync --python 3.12 --extra cpu
```
UV is recommended for faster setup times, especially when working with large dependency trees.
### Intel Arc XPU Support
For Intel Arc GPU users, install with XPU support:
```bash theme={null}
# Create environment
conda create -n fish-speech python=3.12
conda activate fish-speech
# Install required C++ standard library
conda install libstdcxx -c conda-forge
# Install PyTorch with Intel XPU support
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu
# Install Fish Speech
pip install -e .
```
The `--compile` optimization flag is not supported on Windows and macOS. To use compile acceleration, you need to install Triton manually.
## Repository Setup
Clone the Fish Speech repository to get started:
```bash theme={null}
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
```
Then follow one of the installation methods above.
## Next Steps
Once installation is complete, you can:
* **[Set up Docker deployment](/developer-guide/self-hosting/docker-deployment)** - Use containerized deployment for easier management
* **[Run inference](/developer-guide/self-hosting/running-inference)** - Start generating speech with your local models
* **Download models** - Get pre-trained weights from [Hugging Face](https://huggingface.co/fishaudio)
## Hardware Recommendations
For optimal performance:
| Use Case | Recommended GPU | VRAM | Expected Speed |
| ----------- | --------------- | ----- | ----------------------- |
| Development | RTX 3060 | 12GB | \~1:15 real-time factor |
| Production | RTX 4090 | 24GB | \~1:7 real-time factor |
| Enterprise | A100 | 40GB+ | \~1:5 real-time factor |
Real-time factor indicates how much faster than real-time the model can generate audio. For example, 1:7 means generating 1 minute of audio takes \~8.5 seconds.
## Troubleshooting
### CUDA Out of Memory
If you encounter CUDA out of memory errors:
1. Reduce batch size in inference settings
2. Use `--half` flag for FP16 inference
3. Close other GPU-intensive applications
### Package Installation Errors
If you encounter dependency conflicts:
1. Try using UV instead of pip for better dependency resolution
2. Create a fresh conda environment
3. Ensure you're using Python 3.12 (other versions may have compatibility issues)
## Community Support
Need help with local setup?
* Join our [Discord community](https://discord.gg/dF9Db2Tt3Y) for community support
* Check [GitHub Issues](https://github.com/fishaudio/fish-speech/issues) for known problems
* Contact [enterprise support](mailto:support@fish.audio) for commercial deployments
# Running Inference
Source: https://docs.fish.audio/developer-guide/self-hosting/running-inference
Generate speech using self-hosted Fish Audio models
Fish Audio supports multiple inference methods: command line, HTTP API, WebUI, and GUI. Choose the method that best fits your workflow.
This guide assumes you have already [installed Fish Audio locally](/developer-guide/self-hosting/local-setup) or [set up Docker deployment](/developer-guide/self-hosting/docker-deployment).
## Download Weights
Before running inference, download the required model weights from Hugging Face:
```bash theme={null}
# Install Hugging Face CLI (if not already installed)
pip install huggingface_hub[cli]
# or
uv tool install huggingface_hub[cli]
# Download Fish Audio S1-mini weights
hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
```
**Fish Audio S1-mini** is the open-source distilled version (0.5B parameters) optimized for local deployment. The full **S1** model (4B parameters) is available exclusively on [Fish Audio cloud](https://fish.audio).
## Command Line Inference
Command line inference provides maximum control and is ideal for scripting and batch processing.
### Step 1: Extract VQ Tokens from Reference Audio
First, encode your reference audio to get voice characteristics:
```bash theme={null}
python fish_speech/models/dac/inference.py \
-i "reference_audio.wav" \
--checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
```
This generates two files:
* `fake.npy` - VQ tokens representing voice characteristics
* `fake.wav` - Reconstructed audio for verification
**Skip this step if you want random voice generation** - the model can generate speech without reference audio.
### Step 2: Generate Semantic Tokens from Text
Convert your text to semantic tokens using the language model:
```bash theme={null}
python fish_speech/models/text2semantic/inference.py \
--text "The text you want to convert to speech" \
--prompt-text "Transcription of your reference audio" \
--prompt-tokens "fake.npy" \
--compile
```
**Parameters:**
* `--text`: The text to synthesize
* `--prompt-text`: Transcription of the reference audio (for voice cloning)
* `--prompt-tokens`: Path to VQ tokens from Step 1 (for voice cloning)
* `--compile`: Enable kernel fusion for faster inference (\~10x speedup on RTX 4090)
For random voice generation, omit `--prompt-text` and `--prompt-tokens` parameters.
This creates a file named `codes_N.npy` (where N starts from 0) containing semantic tokens.
For GPUs that don't support bf16 (bfloat16), add the `--half` flag to use fp16 instead.
### Step 3: Generate Audio from Semantic Tokens
Finally, convert semantic tokens to audio:
```bash theme={null}
python fish_speech/models/dac/inference.py \
-i "codes_0.npy"
```
This generates the final audio file.
### Full Example
Here's a complete workflow for voice cloning:
```bash theme={null}
# 1. Encode reference audio
python fish_speech/models/dac/inference.py \
-i "my_voice.wav" \
--checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth"
# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py \
--text "Hello, this is a test of voice cloning." \
--prompt-text "This is my reference voice recording." \
--prompt-tokens "fake.npy" \
--compile
# 3. Generate final audio
python fish_speech/models/dac/inference.py \
-i "codes_0.npy"
```
## HTTP API Inference
The HTTP API provides a programmatic interface for integrations and production deployments.
### Start API Server
```bash theme={null}
# With local installation
python -m tools.api_server \
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
# With UV
uv run tools/api_server.py \
--listen 0.0.0.0:8080 \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
```
Add the `--compile` flag to enable torch.compile optimization for faster inference.
### Access API Documentation
Once the server is running, access the interactive API documentation at:
```
http://localhost:8080/docs
```
The API provides endpoints for:
* Text-to-speech synthesis
* Voice cloning with reference audio
* Batch processing
* Model information
### Example API Request
```bash theme={null}
curl -X POST "http://localhost:8080/v1/tts" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, this is a test",
"reference_audio": "base64_encoded_audio",
"reference_text": "Reference transcription"
}'
```
## WebUI Inference
The WebUI provides an intuitive interface for interactive testing and development.
### Start WebUI
```bash theme={null}
# With all parameters
python -m tools.run_webui \
--llama-checkpoint-path "checkpoints/openaudio-s1-mini" \
--decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \
--decoder-config-name modded_dac_vq
# Or use defaults (auto-detects models in checkpoints/)
python -m tools.run_webui
```
Add the `--compile` flag for faster inference during interactive sessions.
### Access WebUI
The WebUI starts on port 7860 by default. Access it at:
```
http://localhost:7860
```
### Configure with Environment Variables
Customize the WebUI using Gradio environment variables:
```bash theme={null}
# Enable public sharing
GRADIO_SHARE=1 python -m tools.run_webui
# Change server port
GRADIO_SERVER_PORT=8080 python -m tools.run_webui
# Change server name
GRADIO_SERVER_NAME=0.0.0.0 python -m tools.run_webui
```
### Using Reference Audio Library
For faster workflow, pre-save reference audio:
1. Create a `references/` directory in the project root
2. Create subdirectories named by voice ID: `references//`
3. Place files in each subdirectory:
* `sample.wav` - Reference audio file
* `sample.lab` - Text transcription of the audio
Example structure:
```
references/
├── alice/
│ ├── sample.wav
│ └── sample.lab
└── bob/
├── sample.wav
└── sample.lab
```
These references will appear as selectable options in the WebUI.
## GUI Inference
For users who prefer a native desktop application, a PyQt6-based GUI is available.
### Download GUI Client
Download the latest release from the [Fish Speech GUI repository](https://github.com/AnyaCoder/fish-speech-gui/releases).
**Supported platforms:**
* Linux
* Windows
* macOS
### Connect to API Server
The GUI client connects to a running API server (see [HTTP API Inference](#http-api-inference) above).
1. Start the API server
2. Launch the GUI client
3. Configure the API endpoint (default: `http://localhost:8080`)
## Docker Inference
If you're using Docker deployment, refer to the [Docker Deployment guide](/developer-guide/self-hosting/docker-deployment) for detailed instructions on:
* Running pre-built WebUI containers
* Running pre-built API server containers
* Customizing container configuration
* Volume mounts for models and references
Quick example:
```bash theme={null}
# Start WebUI with Docker
docker run -d \
--name fish-speech-webui \
--gpus all \
-p 7860:7860 \
-v ./checkpoints:/app/checkpoints \
-v ./references:/app/references \
-e COMPILE=1 \
fishaudio/fish-speech:latest-webui-cuda
```
## Performance Optimization
### Enable Compilation
Torch compilation provides \~10x speedup on compatible GPUs:
```bash theme={null}
# Add --compile flag to any inference command
python -m tools.api_server --compile ...
```
Compilation requires:
* CUDA-compatible GPU
* Triton library (not supported on Windows/macOS)
* First run will be slow due to compilation overhead
### Use Mixed Precision
For GPUs without bf16 support, use fp16:
```bash theme={null}
python fish_speech/models/text2semantic/inference.py --half ...
```
### Batch Processing
For multiple audio generations, use batch processing to amortize model loading overhead:
```python theme={null}
# Example batch processing script
import fish_speech
model = fish_speech.load_model("checkpoints/openaudio-s1-mini")
texts = ["First sentence", "Second sentence", "Third sentence"]
for text in texts:
audio = model.synthesize(text)
audio.save(f"output_{texts.index(text)}.wav")
```
## Emotion Control
Fish Audio S1 supports emotional markers for expressive speech synthesis:
### Basic Emotions
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
### Advanced Emotions
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
### Tone Markers
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
### Special Effects
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
### Example Usage
```bash theme={null}
python fish_speech/models/text2semantic/inference.py \
--text "(excited)This is amazing! (laughing)Ha ha ha!" \
--compile
```
Emotion control is currently supported for English, Chinese, and Japanese. More languages coming soon!
For more details, see the [Emotion Reference](/api-reference/emotion-reference).
## Troubleshooting
### Out of Memory Errors
If you encounter CUDA out of memory errors:
1. Reduce input text length
2. Use `--half` flag for fp16 inference
3. Close other GPU applications
4. Use a smaller batch size
### Slow Inference
To improve speed:
1. Enable `--compile` flag
2. Verify GPU is being used (check with `nvidia-smi`)
3. Ensure CUDA version matches PyTorch installation
4. Use fp16 instead of bf16 on older GPUs
### Poor Audio Quality
For better quality:
1. Use high-quality reference audio (clear, no background noise)
2. Ensure reference text accurately matches reference audio
3. Use 10-30 seconds of reference audio
4. See [Voice Cloning Best Practices](/developer-guide/best-practices/voice-cloning)
### Model Loading Errors
If models fail to load:
1. Verify model weights are downloaded completely
2. Check checkpoint paths are correct
3. Ensure sufficient disk space
4. Re-download weights if corrupted
## Next Steps
* **[Emotion Control Best Practices](/developer-guide/best-practices/emotion-control)** - Master expressive speech
* **[Voice Cloning Best Practices](/developer-guide/best-practices/voice-cloning)** - Optimize voice cloning quality
* **[API Reference](/api-reference/introduction)** - Integrate with your applications
* **[Cloud API](https://fish.audio)** - Compare with managed service performance
# Tutorials & Examples
Source: https://docs.fish.audio/developer-guide/tutorials/tutorials
Step-by-step guides and code examples for Fish Audio features
Coming soon! We're preparing comprehensive tutorials and examples to help you get the most out of Fish Audio.
We're working on tutorials for:
* Building your first TTS application
* Creating custom voice models
* Implementing real-time streaming
* Building interactive voice applications
* Advanced emotion and prosody control
* Multi-speaker conversations
In the meantime, check out:
* [Quickstart Guide](/developer-guide/getting-started/quickstart) for getting started
* [Python SDK Examples](/developer-guide/sdk-guide/python/text-to-speech) for code samples
* [JavaScript SDK Examples](/developer-guide/sdk-guide/javascript/text-to-speech) for code samples
* [Guide and Best Practices](/developer-guide/core-features/text-to-speech) for optimization tips
Join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates and community examples.