# Emotion Reference Source: https://docs.fish.audio/api-reference/emotion-reference Complete reference guide for all 64+ emotional expressions in Fish Audio ## Complete Emotion List This reference guide provides a comprehensive list of all 64+ supported emotional expressions and voice styles available in Fish Audio's S1 TTS model. The latest S2-Pro model supports free-form natural language emotion tags. The `(parenthesis)` syntax on this page applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. ## Basic Emotions (24) | Emotion | Tag | Description | Example Context | | ----------- | --------------- | ----------------------- | --------------------------- | | Happy | `(happy)` | Cheerful, upbeat tone | Good news, greetings | | Sad | `(sad)` | Melancholic, downcast | Sympathy, bad news | | Angry | `(angry)` | Frustrated, aggressive | Complaints, warnings | | Excited | `(excited)` | Energetic, enthusiastic | Announcements, celebrations | | Calm | `(calm)` | Peaceful, relaxed | Instructions, meditation | | Nervous | `(nervous)` | Anxious, uncertain | Disclaimers, apologies | | Confident | `(confident)` | Assertive, self-assured | Presentations, sales | | Surprised | `(surprised)` | Shocked, amazed | Reactions, discoveries | | Satisfied | `(satisfied)` | Content, pleased | Confirmations, reviews | | Delighted | `(delighted)` | Very pleased, joyful | Celebrations, compliments | | Scared | `(scared)` | Frightened, fearful | Warnings, horror stories | | Worried | `(worried)` | Concerned, troubled | Concerns, questions | | Upset | `(upset)` | Disturbed, distressed | Complaints, problems | | Frustrated | `(frustrated)` | Annoyed, exasperated | Technical issues, delays | | Depressed | `(depressed)` | Very sad, hopeless | Serious topics | | Empathetic | `(empathetic)` | Understanding, caring | Support, counseling | | Embarrassed | `(embarrassed)` | Ashamed, awkward | Apologies, mistakes | | Disgusted | `(disgusted)` | Repelled, revolted | Negative reviews | | Moved | `(moved)` | Emotionally touched | Heartfelt moments | | Proud | `(proud)` | Accomplished, satisfied | Achievements, praise | | Relaxed | `(relaxed)` | At ease, casual | Casual conversation | | Grateful | `(grateful)` | Thankful, appreciative | Thanks, appreciation | | Curious | `(curious)` | Inquisitive, interested | Questions, exploration | | Sarcastic | `(sarcastic)` | Ironic, mocking | Humor, criticism | ## Advanced Emotions (25) | Emotion | Tag | Description | Example Context | | ------------- | ----------------- | ------------------------ | ---------------------- | | Disdainful | `(disdainful)` | Contemptuous, scornful | Criticism, rejection | | Unhappy | `(unhappy)` | Discontent, dissatisfied | Complaints, feedback | | Anxious | `(anxious)` | Very worried, uneasy | Urgent matters | | Hysterical | `(hysterical)` | Uncontrollably emotional | Extreme reactions | | Indifferent | `(indifferent)` | Uncaring, neutral | Neutral responses | | Uncertain | `(uncertain)` | Doubtful, unsure | Speculation, questions | | Doubtful | `(doubtful)` | Skeptical, questioning | Disbelief, questioning | | Confused | `(confused)` | Puzzled, perplexed | Clarification requests | | Disappointed | `(disappointed)` | Let down, dissatisfied | Unmet expectations | | Regretful | `(regretful)` | Sorry, remorseful | Apologies, mistakes | | Guilty | `(guilty)` | Culpable, responsible | Confessions, apologies | | Ashamed | `(ashamed)` | Deeply embarrassed | Serious mistakes | | Jealous | `(jealous)` | Envious, resentful | Comparisons | | Envious | `(envious)` | Wanting what others have | Admiration with desire | | Hopeful | `(hopeful)` | Optimistic about future | Future plans | | Optimistic | `(optimistic)` | Positive outlook | Encouragement | | Pessimistic | `(pessimistic)` | Negative outlook | Warnings, doubts | | Nostalgic | `(nostalgic)` | Longing for the past | Memories, stories | | Lonely | `(lonely)` | Isolated, alone | Emotional content | | Bored | `(bored)` | Uninterested, weary | Disinterest | | Contemptuous | `(contemptuous)` | Showing contempt | Strong criticism | | Sympathetic | `(sympathetic)` | Showing sympathy | Condolences | | Compassionate | `(compassionate)` | Showing deep care | Support, help | | Determined | `(determined)` | Resolved, decided | Goals, commitments | | Resigned | `(resigned)` | Accepting defeat | Giving up, acceptance | ## Tone Markers (5) | Tone | Tag | Description | When to Use | | ---------- | ------------------- | -------------------- | -------------------------- | | Hurried | `(in a hurry tone)` | Rushed, urgent | Time-sensitive information | | Shouting | `(shouting)` | Loud, calling out | Getting attention | | Screaming | `(screaming)` | Very loud, panicked | Emergencies, fear | | Whispering | `(whispering)` | Very soft, secretive | Secrets, quiet scenes | | Soft | `(soft tone)` | Gentle, quiet | Comfort, lullabies | ## Audio Effects (10) | Effect | Tag | Description | Suggested Text | | ------------- | ----------------- | ---------------------------- | -------------- | | Laughing | `(laughing)` | Full laughter | Ha, ha, ha | | Chuckling | `(chuckling)` | Light laugh | Heh, heh | | Sobbing | `(sobbing)` | Crying heavily | (optional) | | Crying Loudly | `(crying loudly)` | Intense crying | (optional) | | Sighing | `(sighing)` | Exhale of relief/frustration | sigh | | Groaning | `(groaning)` | Sound of frustration | ugh | | Panting | `(panting)` | Out of breath | huff, puff | | Gasping | `(gasping)` | Sharp intake of breath | gasp | | Yawning | `(yawning)` | Tired sound | yawn | | Snoring | `(snoring)` | Sleep sound | zzz | ## Special Effects | Effect | Tag | Description | | ------------------- | ----------------------- | ------------------------ | | Audience Laughter | `(audience laughing)` | Crowd laughing sound | | Background Laughter | `(background laughter)` | Ambient laughter | | Crowd Laughter | `(crowd laughing)` | Large group laughing | | Short Pause | `(break)` | Brief pause in speech | | Long Pause | `(long-break)` | Extended pause in speech | ## Usage Examples ### Single Emotion ``` (happy) What a beautiful day! (sad) I'm sorry for your loss. (excited) We won the championship! ``` ### Combined Effects ``` (sad)(whispering) I'll miss you so much. (angry)(shouting) Get out of here now! (excited)(laughing) We did it! Ha ha ha! ``` ### Natural Expressions ``` That's hilarious! Ha ha ha! // Natural laughter (sighing) Sigh... what a long day. (panting) Huff... puff... almost there! ``` ## Quick Selection Guide ### For Customer Service * **Greetings**: `(friendly)`, `(cheerful)`, `(helpful)` * **Understanding**: `(empathetic)`, `(concerned)`, `(sympathetic)` * **Problem-solving**: `(confident)`, `(determined)`, `(professional)` * **Apologies**: `(apologetic)`, `(regretful)`, `(sincere)` ### For Storytelling * **Narration**: `(narrator)`, `(calm)`, `(mysterious)` * **Character emotions**: Any from basic/advanced lists * **Atmosphere**: `(whispering)`, `(dramatic)`, background effects * **Action**: `(shouting)`, `(panting)`, `(struggling)` ### For Educational Content * **Introduction**: `(enthusiastic)`, `(welcoming)`, `(friendly)` * **Explanations**: `(calm)`, `(clear)`, `(patient)` * **Questions**: `(curious)`, `(encouraging)`, `(thoughtful)` * **Praise**: `(proud)`, `(delighted)`, `(impressed)` ### For Marketing * **Excitement**: `(excited)`, `(enthusiastic)`, `(energetic)` * **Trust**: `(confident)`, `(professional)`, `(sincere)` * **Urgency**: `(urgent)`, `(in a hurry tone)`, `(important)` * **Celebration**: `(celebrating)`, `(triumphant)`, `(joyful)` ## Emotion Categories ### Positive Emotions `(happy)` `(excited)` `(delighted)` `(satisfied)` `(proud)` `(grateful)` `(confident)` `(relaxed)` `(hopeful)` `(optimistic)` `(moved)` `(compassionate)` ### Negative Emotions `(sad)` `(angry)` `(frustrated)` `(depressed)` `(upset)` `(worried)` `(scared)` `(nervous)` `(disappointed)` `(regretful)` `(guilty)` `(ashamed)` `(lonely)` `(bored)` ### Neutral/Complex Emotions `(calm)` `(curious)` `(surprised)` `(confused)` `(uncertain)` `(doubtful)` `(indifferent)` `(nostalgic)` `(sarcastic)` `(determined)` `(resigned)` ### Social/Interpersonal Emotions `(empathetic)` `(sympathetic)` `(embarrassed)` `(jealous)` `(envious)` `(disdainful)` `(contemptuous)` `(disgusted)` ## Model Support Matrix | Model | Basic | Advanced | Tones | Effects | Intensity | | ----------------- | ----- | -------- | ----- | ------- | --------- | | Fish Speech 1.5 | ✓ | Limited | ✓ | 6/10 | No | | Fish Audio S1 | ✓ | ✓ | ✓ | ✓ | ✓ | | Fish Audio S2-Pro | ✓ | ✓ | ✓ | ✓ | ✓ | ## Tips for Natural Speech 1. **Start Simple**: Begin with basic emotions before combining 2. **Test Variations**: Different voices handle emotions differently 3. **Context Matters**: Match emotions to content logically 4. **Less is More**: Avoid overusing emotions in short text 5. **Natural Flow**: Space out emotional changes 6. **Sound Effects**: Include appropriate text after audio tags 7. **Preview Often**: Test how emotions sound with your voice ## Common Mistakes to Avoid * ❌ Placing emotion tags mid-sentence in English * ❌ Forgetting parentheses around tags * ❌ Using unsupported custom tags * ❌ Mixing conflicting emotions * ❌ Overusing effects in short text * ❌ Missing text for sound effects * ❌ Using wrong language placement rules ## See Also * [Emotion Control Guide](/developer-guide/core-features/emotions) - Technical implementation * [Text-to-Speech Best Practices](/developer-guide/core-features/text-to-speech) * [API Reference](/api-reference/introduction) * [Try it live](https://fish.audio) - Test emotions in the playground # Create Model Source: https://docs.fish.audio/api-reference/endpoint/model/create-model post /model Create a new voice model Since this endpoint requires uploading file, it only accepts `multipart/form-data` and `application/msgpack`. # Delete Model Source: https://docs.fish.audio/api-reference/endpoint/model/delete-model delete /model/{id} Delete an existing model # Get Model Source: https://docs.fish.audio/api-reference/endpoint/model/get-model get /model/{id} Get details of a specific model # List Models Source: https://docs.fish.audio/api-reference/endpoint/model/list-models get /model Get a list of all models # Update Model Source: https://docs.fish.audio/api-reference/endpoint/model/update-model patch /model/{id} Update an existing model # Speech to Text Source: https://docs.fish.audio/api-reference/endpoint/openapi-v1/speech-to-text post /v1/asr Transcribe audio to text This BETA endpoint only accepts `application/form-data` and `application/msgpack`. # Text to Speech Source: https://docs.fish.audio/api-reference/endpoint/openapi-v1/text-to-speech post /v1/tts Convert text to speech This endpoint only accepts `application/json` and `application/msgpack`. For best results, upload reference audio using the [create model](/api-reference/endpoint/model/create-model) before using this one. This improves speech quality and reduces latency. To upload audio clips directly, without pre-uploading, serialize the request body with MessagePack as per the [instructions](/developer-guide/core-features/text-to-speech#direct-api-usage). Audio formats supported: * WAV / PCM * Sample Rate: 8kHz, 16kHz, 24kHz, 32kHz, 44.1kHz * Default Sample Rate: 44.1kHz * 16-bit, mono * MP3 * Sample Rate: 32kHz, 44.1kHz * Default Sample Rate: 44.1kHz * mono * Bitrate: 64kbps, 128kbps (default), 192kbps * Opus * Sample Rate: 48kHz * Default Sample Rate: 48kHz * mono * Bitrate: -1000 (auto), 24kbps, 32kbps (default), 48kbps, 64kbps # Get API Credit Source: https://docs.fish.audio/api-reference/endpoint/wallet/get-api-credit get /wallet/{user_id}/api-credit Get current API credit balance # Get User Premium Source: https://docs.fish.audio/api-reference/endpoint/wallet/get-user-package get /wallet/{user_id}/package Get current user premium information # WebSocket TTS Streaming Source: https://docs.fish.audio/api-reference/endpoint/websocket/tts-live Real-time text-to-speech streaming via WebSocket The WebSocket TTS endpoint enables bidirectional streaming for low-latency text-to-speech generation with MessagePack serialization. The `request` payload inside `StartEvent` uses the same parameters as the HTTP [Text to Speech API](/api-reference/endpoint/openapi-v1/text-to-speech). For more detailed field guidance, model-specific behavior, and examples, see that page. In WebSocket mode, `request.text` is typically empty in `StartEvent`, and the text content is sent through subsequent `TextEvent` messages. # Introduction Source: https://docs.fish.audio/api-reference/introduction How to use the Fish Audio API ## Welcome You can generate a new API key at [https://fish.audio/app/api-keys/](https://fish.audio/app/api-keys/). ## Quick Start See our [Quick Start](/developer-guide/getting-started/quickstart) guide to generate audio in under 2 minutes. ## Create a Voice Clone Use our [/model endpoint](/api-reference/endpoint/model/create-model) to create a voice clone model. ## Generate Speech Use our [/v1/tts endpoint](/api-reference/endpoint/openapi-v1/text-to-speech) to generate speech. ## Real-time Streaming Use our [Python SDK](/developer-guide/sdk-guide/python/websocket) or [JavaScript SDK](/developer-guide/sdk-guide/javascript/websocket) for real-time audio streaming with WebSocket. ## Rate Limits You can find the rate limits for each endpoint in the [Rate Limits](/developer-guide/models-pricing/pricing-and-rate-limits) section. # API Reference Source: https://docs.fish.audio/api-reference/sdk/javascript/api-reference Complete reference for Fish Audio JavaScript SDK ## Client Import and initialize the client: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); ``` ## Text to Speech ### convert() Generate speech from text. ```typescript theme={null} const audio = await fishAudio.textToSpeech.convert({ text: "Hello" }); ``` Parameters: `request` (TTSRequest), `model?` (Backends)
Returns: `Promise>` ### convertRealtime() Realtime streaming TTS over WebSocket. ```typescript theme={null} async function* textStream() { yield "Hello, "; yield "world!"; } const conn = await fishAudio.textToSpeech.convertRealtime({ text: "" }, textStream()); ``` Parameters: `request` (TTSRequest with `text: ""`), `textStream` (`AsyncIterable`), `backend?` (Backends)
Returns: `RealtimeConnection` (`EventEmitter`-like connection) emitting `RealtimeEvents` ## Speech to Text ### convert() Transcribe audio to text. ```typescript theme={null} const res = await fishAudio.speechToText.convert({ audio: myAudio }); console.log(res.text); ``` Parameters: `request` (STTRequest)
Returns: `STTResponse` ## Voices ### search() List/search available voice models. ```typescript theme={null} const results = await fishAudio.voices.search(); ``` Parameters: `request?` (ModelListRequest)
Returns: `ModelListResponse` ### get() Get model details. ```typescript theme={null} const model = await fishAudio.voices.get("model_id"); ``` Parameters: `voiceId` (string)
Returns: `ModelEntity` ### ivc.create() Create a new voice model from audio samples. ```typescript theme={null} const res = await fishAudio.voices.ivc.create({ title, voices: [file], cover_image: file }); ``` Parameters: `request` (ModelCreateRequest)
Returns: `ModelEntity` ### update() Update model metadata. ```typescript theme={null} await fishAudio.voices.update("model_id", { title: "New Title" }); ``` Parameters: `voiceId` (string), `request` (UpdateModelRequest)
Returns: `UpdateVoiceResponse` ### delete() Delete a model. ```typescript theme={null} await fishAudio.voices.delete("model_id"); ``` Parameters: `voiceId` (string)
Returns: `DeleteVoiceResponse` ## User ### get\_api\_credit() Check API credit balance. ```typescript theme={null} await fishAudio.user.get_api_credit(); ``` Returns: `APICreditResponse` ### get\_package() Get subscription package details. ```typescript theme={null} await fishAudio.user.get_package(); ``` Returns: `PackageResponse` ## Request Classes ### TTSRequest Text-to-speech parameters. ```typescript theme={null} { text: "Hello", reference_id: "model_id", references: [ { audio: File, text: "sample" } ], format: "mp3", prosody: { speed: 1.0, volume: 0 }, } ``` Fields: `text`, `reference_id`, `references`, `format`, `mp3_bitrate`, `opus_bitrate`, `sample_rate`, `prosody`, `latency`, `chunk_length`, `normalize`, `temperature`, `top_p` ### STTRequest Speech-to-text parameters. ```typescript theme={null} { audio: File, language?: "en", ignore_timestamps?: boolean } ``` Fields: `audio`, `language?`, `ignore_timestamps?` ### ReferenceAudio Reference audio for voice cloning. ```typescript theme={null} { audio: File, text: "spoken text" } ``` Fields: `audio`, `text` ### Prosody Speed and volume control. ```typescript theme={null} { speed: 1.2, volume: 5 } ``` Fields: `speed` (0.5–2.0), `volume` (-20 to 20) ### Backends The backend model to use. ```typescript theme={null} Backends = 's1' | 's2-pro'; ``` ## Response Classes ### STTResponse Transcription result. ```typescript theme={null} response.text // Complete transcription response.duration // Duration in seconds response.segments // ASRSegment[] ``` ### ASRSegment Timestamped text segment. Fields: `text` (string), `start` (number, seconds), `end` (number, seconds) ### ModelEntity Voice model information. Fields: `_id`, `title`, `description`, `visibility`, `created_at`, `updated_at`, `tags` ### ModelListResponse List response for voices. Fields: `items` (ModelEntity\[]), `total` (number) ### APICreditResponse API credit information. Fields: `_id` (string), `user_id` (string), `credit` (string), `created_at` (string), `updated_at` (string), `has_phone_sha256` (boolean), `has_free_credit?` (boolean) ### PackageResponse Subscription package details. Fields: `user_id` (string), `type` (string), `total` (number), `balance` (number), `created_at` (string), `updated_at` (string), `finished_at` (string) ## WebSocket Classes ### RealtimeEvents Events emitted by `convertRealtime` connections. | Event | Meaning | | ------------- | ---------------------- | | `OPEN` | Connection established | | `AUDIO_CHUNK` | Audio chunk received | | `ERROR` | Error occurred | | `CLOSE` | Connection closed | ## Event Classes ### StartEvent Stream start event. Fields: `event` ("start"), `request` (TTSRequest) ### TextEvent Text chunk event. Fields: `event` ("text"), `text` (string) ### FlushEvent Flush text chunks event. Fields: `event` ("flush") ### CloseEvent Stream close event. Fields: `event` ("stop") ## Exceptions ### FishAudioError Generic error with status code, body, rawResponse. ### FishAudioTimeoutError Connection timeout error. # Client Source: https://docs.fish.audio/api-reference/sdk/python/client # fishaudio.client Main Fish Audio client classes. ## FishAudio Objects ```python theme={null} class FishAudio() ``` Synchronous Fish Audio API client. **Example**: ```python theme={null} from fishaudio import FishAudio client = FishAudio(api_key="your_api_key") # Generate speech audio = client.tts.convert(text="Hello world") with open("output.mp3", "wb") as f: for chunk in audio: f.write(chunk) # List voices voices = client.voices.list(page_size=20) print(f"Found {voices.total} voices") ``` #### \_\_init\_\_ ```python theme={null} def __init__(*, api_key: Optional[str] = None, base_url: str = "https://api.fish.audio", timeout: float = 240.0, httpx_client: Optional[httpx.Client] = None) ``` Initialize Fish Audio client. **Arguments**: * `api_key` - API key (can also use FISH\_API\_KEY env var) * `base_url` - API base URL * `timeout` - Request timeout in seconds * `httpx_client` - Optional custom HTTP client #### tts ```python theme={null} @property def tts() -> TTSClient ``` Access TTS (text-to-speech) operations. #### asr ```python theme={null} @property def asr() -> ASRClient ``` Access ASR (speech-to-text) operations. #### voices ```python theme={null} @property def voices() -> VoicesClient ``` Access voice management operations. #### account ```python theme={null} @property def account() -> AccountClient ``` Access account/billing operations. #### close ```python theme={null} def close() -> None ``` Close the HTTP client. ## AsyncFishAudio Objects ```python theme={null} class AsyncFishAudio() ``` Asynchronous Fish Audio API client. **Example**: ```python theme={null} from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio(api_key="your_api_key") # Generate speech audio = client.tts.convert(text="Hello world") async with aiofiles.open("output.mp3", "wb") as f: async for chunk in audio: await f.write(chunk) # List voices voices = await client.voices.list(page_size=20) print(f"Found {voices.total} voices") asyncio.run(main()) ``` #### \_\_init\_\_ ```python theme={null} def __init__(*, api_key: Optional[str] = None, base_url: str = "https://api.fish.audio", timeout: float = 240.0, httpx_client: Optional[httpx.AsyncClient] = None) ``` Initialize async Fish Audio client. **Arguments**: * `api_key` - API key (can also use FISH\_API\_KEY env var) * `base_url` - API base URL * `timeout` - Request timeout in seconds * `httpx_client` - Optional custom async HTTP client #### tts ```python theme={null} @property def tts() -> AsyncTTSClient ``` Access TTS (text-to-speech) operations. #### asr ```python theme={null} @property def asr() -> AsyncASRClient ``` Access ASR (speech-to-text) operations. #### voices ```python theme={null} @property def voices() -> AsyncVoicesClient ``` Access voice management operations. #### account ```python theme={null} @property def account() -> AsyncAccountClient ``` Access account/billing operations. #### close ```python theme={null} async def close() -> None ``` Close the HTTP client. # Core Source: https://docs.fish.audio/api-reference/sdk/python/core # fishaudio.core.client\_wrapper HTTP client wrapper for managing requests and authentication. ## BaseClientWrapper Objects ```python theme={null} class BaseClientWrapper() ``` Base wrapper with shared logic for sync/async clients. #### get\_headers ```python theme={null} def get_headers( additional_headers: Optional[dict[str, str]] = None) -> dict[str, str] ``` Build headers including authentication and user agent. ## ClientWrapper Objects ```python theme={null} class ClientWrapper(BaseClientWrapper) ``` Wrapper for httpx.Client that handles authentication and error handling. #### request ```python theme={null} def request(method: str, path: str, *, request_options: Optional[RequestOptions] = None, **kwargs: Any) -> httpx.Response ``` Make an HTTP request with error handling. **Arguments**: * `method` - HTTP method (GET, POST, etc.) * `path` - API endpoint path * `request_options` - Optional request-level overrides * `**kwargs` - Additional arguments to pass to httpx.request **Returns**: httpx.Response object **Raises**: * `APIError` - On non-2xx responses #### client ```python theme={null} @property def client() -> httpx.Client ``` Get underlying httpx.Client for advanced usage (e.g., WebSockets). #### close ```python theme={null} def close() -> None ``` Close the HTTP client. ## AsyncClientWrapper Objects ```python theme={null} class AsyncClientWrapper(BaseClientWrapper) ``` Wrapper for httpx.AsyncClient that handles authentication and error handling. #### request ```python theme={null} async def request(method: str, path: str, *, request_options: Optional[RequestOptions] = None, **kwargs: Any) -> httpx.Response ``` Make an async HTTP request with error handling. **Arguments**: * `method` - HTTP method (GET, POST, etc.) * `path` - API endpoint path * `request_options` - Optional request-level overrides * `**kwargs` - Additional arguments to pass to httpx.request **Returns**: httpx.Response object **Raises**: * `APIError` - On non-2xx responses #### client ```python theme={null} @property def client() -> httpx.AsyncClient ``` Get underlying httpx.AsyncClient for advanced usage (e.g., WebSockets). #### close ```python theme={null} async def close() -> None ``` Close the HTTP client. # fishaudio.core.request\_options Request-level options for API calls. ## RequestOptions Objects ```python theme={null} class RequestOptions() ``` Options that can be provided on a per-request basis to override client defaults. **Attributes**: * `timeout` - Override the client's default timeout (in seconds) * `max_retries` - Override the client's default max retries * `additional_headers` - Additional headers to include in the request * `additional_query_params` - Additional query parameters to include #### get\_timeout ```python theme={null} def get_timeout() -> Optional[httpx.Timeout] ``` Convert timeout to httpx.Timeout if set. # fishaudio.core.iterators Audio stream wrappers with collection utilities. ## AudioStream Objects ```python theme={null} class AudioStream() ``` Wrapper for sync audio byte streams with collection utilities. This class wraps an iterator of audio bytes and provides a convenient `.collect()` method to gather all chunks into a single bytes object. **Examples**: ```python theme={null} from fishaudio import FishAudio client = FishAudio(api_key="...") # Collect all audio at once audio = client.tts.stream(text="Hello!").collect() # Or stream chunks manually for chunk in client.tts.stream(text="Hello!"): process_chunk(chunk) ``` #### \_\_init\_\_ ```python theme={null} def __init__(iterator: Iterator[bytes]) ``` Initialize the audio iterator wrapper. **Arguments**: * `iterator` - The underlying iterator of audio bytes #### \_\_iter\_\_ ```python theme={null} def __iter__() -> Iterator[bytes] ``` Allow direct iteration over audio chunks. #### collect ```python theme={null} def collect() -> bytes ``` Collect all audio chunks into a single bytes object. This consumes the iterator and returns all audio data as bytes. After calling this method, the iterator cannot be used again. **Returns**: Complete audio data as bytes **Examples**: ```python theme={null} audio = client.tts.stream(text="Hello!").collect() with open("output.mp3", "wb") as f: f.write(audio) ``` ## AsyncAudioStream Objects ```python theme={null} class AsyncAudioStream() ``` Wrapper for async audio byte streams with collection utilities. This class wraps an async iterator of audio bytes and provides a convenient `.collect()` method to gather all chunks into a single bytes object. **Examples**: ```python theme={null} from fishaudio import AsyncFishAudio client = AsyncFishAudio(api_key="...") # Collect all audio at once stream = await client.tts.stream(text="Hello!") audio = await stream.collect() # Or stream chunks manually async for chunk in await client.tts.stream(text="Hello!"): await process_chunk(chunk) ``` #### \_\_init\_\_ ```python theme={null} def __init__(async_iterator: AsyncIterator[bytes]) ``` Initialize the async audio iterator wrapper. **Arguments**: * `async_iterator` - The underlying async iterator of audio bytes #### \_\_aiter\_\_ ```python theme={null} def __aiter__() -> AsyncIterator[bytes] ``` Allow direct async iteration over audio chunks. #### collect ```python theme={null} async def collect() -> bytes ``` Collect all audio chunks into a single bytes object. This consumes the async iterator and returns all audio data as bytes. After calling this method, the iterator cannot be used again. **Returns**: Complete audio data as bytes **Examples**: ```python theme={null} stream = await client.tts.stream(text="Hello!") audio = await stream.collect() with open("output.mp3", "wb") as f: f.write(audio) ``` # fishaudio.core.websocket\_options WebSocket-level options for WebSocket connections. ## WebSocketOptions Objects ```python theme={null} class WebSocketOptions() ``` Options for configuring WebSocket connections. These options are passed directly to httpx\_ws's connect\_ws/aconnect\_ws functions. For complete documentation, see [https://frankie567.github.io/httpx-ws/reference/httpx\_ws/](https://frankie567.github.io/httpx-ws/reference/httpx_ws/) **Attributes**: * `keepalive_ping_timeout_seconds` - Maximum delay the client will wait for an answer to its Ping event. If the delay is exceeded, WebSocketNetworkError will be raised and the connection closed. Default: 20 seconds. * `keepalive_ping_interval_seconds` - Interval at which the client will automatically send a Ping event to keep the connection alive. Set to None to disable this mechanism. Default: 20 seconds. * `max_message_size_bytes` - Message size in bytes to receive from the server. * `Default` - 65536 bytes (64 KiB). * `queue_size` - Size of the queue where received messages will be held until they are consumed. If the queue is full, the client will stop receiving messages from the server until the queue has room available. Default: 512. **Notes**: Parameter descriptions adapted from httpx\_ws documentation. #### to\_httpx\_ws\_kwargs ```python theme={null} def to_httpx_ws_kwargs() -> dict[str, Any] ``` Convert to kwargs dict for httpx\_ws aconnect\_ws/connect\_ws. # fishaudio.core.omit OMIT sentinel for distinguishing None from not-provided parameters. # Exceptions Source: https://docs.fish.audio/api-reference/sdk/python/exceptions # fishaudio.exceptions Custom exceptions for the Fish Audio SDK. ## FishAudioError Objects ```python theme={null} class FishAudioError(Exception) ``` Base exception for all Fish Audio SDK errors. ## APIError Objects ```python theme={null} class APIError(FishAudioError) ``` Raised when the API returns an error response. ## AuthenticationError Objects ```python theme={null} class AuthenticationError(APIError) ``` Raised when authentication fails (401). ## PermissionError Objects ```python theme={null} class PermissionError(APIError) ``` Raised when permission is denied (403). ## NotFoundError Objects ```python theme={null} class NotFoundError(APIError) ``` Raised when a resource is not found (404). ## RateLimitError Objects ```python theme={null} class RateLimitError(APIError) ``` Raised when rate limit is exceeded (429). ## ServerError Objects ```python theme={null} class ServerError(APIError) ``` Raised when the server encounters an error (5xx). ## WebSocketError Objects ```python theme={null} class WebSocketError(FishAudioError) ``` Raised when WebSocket connection or streaming fails. ## ValidationError Objects ```python theme={null} class ValidationError(FishAudioError) ``` Raised when request validation fails. ## DependencyError Objects ```python theme={null} class DependencyError(FishAudioError) ``` Raised when a required dependency is missing. # Overview Source: https://docs.fish.audio/api-reference/sdk/python/overview Fish Audio Python SDK for text-to-speech and voice cloning ![python.png](https://raw.githubusercontent.com/fishaudio/fish-audio-python/refs/heads/main/.github/assets/python.png) # Fish Audio Python SDK [![PyPI version](https://img.shields.io/pypi/v/fish-audio-sdk.svg)](https://badge.fury.io/py/fish-audio-sdk) [![Python Version](https://img.shields.io/badge/python-3.9+-blue)](https://pypi.org/project/fish-audio-sdk/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/fish-audio-sdk)](https://pypi.org/project/fish-audio-sdk/) [![codecov](https://img.shields.io/codecov/c/github/fishaudio/fish-audio-python)](https://codecov.io/gh/fishaudio/fish-audio-python) [![License](https://img.shields.io/github/license/fishaudio/fish-audio-python)](https://github.com/fishaudio/fish-audio-python/blob/main/LICENSE) The official Python library for the Fish Audio API **Documentation:** [Python SDK Guide](https://docs.fish.audio/developer-guide/sdk-guide/python/) | [API Reference](https://docs.fish.audio/api-reference/sdk/python/) > \[!IMPORTANT] > > ## Changes to PyPI Versioning > > For existing users on Fish Audio Python SDK, please note that the starting version is now `1.0.0`. The last version before this was `2025.6.3`. You may need to adjust your version constraints accordingly. > > The original API in the `fish_audio_sdk` package has NOT been removed, but you will not receive any updates if you continue using the old versioning scheme. > > The simplest fix is to update your dependency to `fish-audio-sdk>=1.0.0` to continue receiving updates, or by pinning to a specific version like `fish-audio-sdk==1.0.0` when installing via your package manager. There are no changes to the API itself in this transition. > > If you're using the legacy `fish_audio_sdk` and would like to switch to the newer, more robust `fishaudio` package, see the [migration guide](https://docs.fish.audio/archive/python-sdk-legacy/migration-guide) to upgrade. ## Installation ```bash theme={null} pip install fish-audio-sdk # With audio playback utilities pip install fish-audio-sdk[utils] ``` ## Authentication Get your API key from [fish.audio/app/api-keys](https://fish.audio/app/api-keys): ```bash theme={null} export FISH_API_KEY=your_api_key_here ``` Or provide directly: ```python theme={null} from fishaudio import FishAudio client = FishAudio(api_key="your_api_key") ``` ## Quick Start **Synchronous:** ```python theme={null} from fishaudio import FishAudio from fishaudio.utils import play, save client = FishAudio() # Generate audio audio = client.tts.convert(text="Hello, world!") # Play or save play(audio) save(audio, "output.mp3") ``` **Asynchronous:** ```python theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play, save async def main(): client = AsyncFishAudio() audio = await client.tts.convert(text="Hello, world!") play(audio) save(audio, "output.mp3") asyncio.run(main()) ``` ## Core Features ### Text-to-Speech **With custom voice:** ```python theme={null} # Use a specific voice by ID audio = client.tts.convert( text="Custom voice", reference_id="802e3bc2b27e49c2995d23ef70e6ac89" ) ``` **With speed control:** ```python theme={null} audio = client.tts.convert( text="Speaking faster!", speed=1.5 # 1.5x speed ) ``` **Reusable configuration:** ```python theme={null} from fishaudio.types import TTSConfig, Prosody config = TTSConfig( prosody=Prosody(speed=1.2, volume=-5), reference_id="933563129e564b19a115bedd57b7406a", format="wav", latency="balanced" ) # Reuse across generations audio1 = client.tts.convert(text="First message", config=config) audio2 = client.tts.convert(text="Second message", config=config) ``` **Chunk-by-chunk processing:** ```python theme={null} # Stream and process chunks as they arrive for chunk in client.tts.stream(text="Long content..."): send_to_websocket(chunk) # Or collect all chunks audio = client.tts.stream(text="Hello!").collect() ``` [Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/text-to-speech) ### Speech-to-Text ```python theme={null} # Transcribe audio with open("audio.wav", "rb") as f: result = client.asr.transcribe(audio=f.read(), language="en") print(result.text) # Access timestamped segments for segment in result.segments: print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}") ``` [Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/speech-to-text) ### Real-time Streaming Stream dynamically generated text for conversational AI and live applications: **Synchronous:** ```python theme={null} def text_chunks(): yield "Hello, " yield "this is " yield "streaming!" audio_stream = client.tts.stream_websocket(text_chunks(), latency="balanced") play(audio_stream) ``` **Asynchronous:** ```python theme={null} async def text_chunks(): yield "Hello, " yield "this is " yield "streaming!" audio_stream = await client.tts.stream_websocket(text_chunks(), latency="balanced") play(audio_stream) ``` [Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/websocket) ### Voice Cloning **Instant cloning:** ```python theme={null} from fishaudio.types import ReferenceAudio # Clone voice on-the-fly with open("reference.wav", "rb") as f: audio = client.tts.convert( text="Cloned voice speaking", references=[ReferenceAudio( audio=f.read(), text="Text spoken in reference" )] ) ``` **Persistent voice models:** ```python theme={null} # Create voice model for reuse with open("voice_sample.wav", "rb") as f: voice = client.voices.create( title="My Voice", voices=[f.read()], description="Custom voice clone" ) # Use the created model audio = client.tts.convert( text="Using my saved voice", reference_id=voice.id ) ``` [Learn more](https://docs.fish.audio/developer-guide/sdk-guide/python/voice-cloning) ## Resource Clients | Resource | Description | Key Methods | | ---------------- | ------------------ | ----------------------------------------------------- | | `client.tts` | Text-to-speech | `convert()`, `stream()`, `stream_websocket()` | | `client.asr` | Speech recognition | `transcribe()` | | `client.voices` | Voice management | `list()`, `get()`, `create()`, `update()`, `delete()` | | `client.account` | Account info | `get_credits()`, `get_package()` | ## Error Handling ```python theme={null} from fishaudio.exceptions import ( AuthenticationError, RateLimitError, ValidationError, FishAudioError ) try: audio = client.tts.convert(text="Hello!") except AuthenticationError: print("Invalid API key") except RateLimitError: print("Rate limit exceeded") except ValidationError as e: print(f"Invalid request: {e}") except FishAudioError as e: print(f"API error: {e}") ``` ## Resources * **Documentation:** [SDK Guide](https://docs.fish.audio/developer-guide/sdk-guide/python/) | [API Reference](https://docs.fish.audio/api-reference/sdk/python/) * **Package:** [PyPI](https://pypi.org/project/fish-audio-sdk/) | [GitHub](https://github.com/fishaudio/fish-audio-python) * **Legacy SDK:** [Documentation](https://docs.fish.audio/archive/python-sdk-legacy) | [Migration Guide](https://docs.fish.audio/archive/python-sdk-legacy/migration-guide) ## License This project is licensed under the Apache-2.0 License - see the [LICENSE](LICENSE) file for details. # Resources Source: https://docs.fish.audio/api-reference/sdk/python/resources # fishaudio.resources.voices Voice management namespace client. ## VoicesClient Objects ```python theme={null} class VoicesClient() ``` Synchronous voice management operations. #### list ```python theme={null} def list( *, page_size: int = 10, page_number: int = 1, title: Optional[str] = OMIT, tags: Optional[Union[list[str], str]] = OMIT, self_only: bool = False, author_id: Optional[str] = OMIT, language: Optional[Union[list[str], str]] = OMIT, title_language: Optional[Union[list[str], str]] = OMIT, sort_by: str = "task_count", request_options: Optional[RequestOptions] = None ) -> PaginatedResponse[Voice] ``` List available voices/models. **Arguments**: * `page_size` - Number of results per page * `page_number` - Page number (1-indexed) * `title` - Filter by title * `tags` - Filter by tags (single tag or list) * `self_only` - Only return user's own voices * `author_id` - Filter by author ID * `language` - Filter by language(s) * `title_language` - Filter by title language(s) * `sort_by` - Sort field ("task\_count" or "created\_at") * `request_options` - Request-level overrides **Returns**: Paginated response with total count and voice items **Example**: ```python theme={null} client = FishAudio(api_key="...") # List all voices voices = client.voices.list(page_size=20) print(f"Total: {voices.total}") for voice in voices.items: print(f"{voice.title}: {voice.id}") # Filter by tags tagged = client.voices.list(tags=["male", "english"]) ``` #### get ```python theme={null} def get(voice_id: str, *, request_options: Optional[RequestOptions] = None) -> Voice ``` Get voice by ID. **Arguments**: * `voice_id` - Voice model ID * `request_options` - Request-level overrides **Returns**: Voice model details **Example**: ```python theme={null} client = FishAudio(api_key="...") voice = client.voices.get("voice_id_here") print(voice.title, voice.description) ``` #### create ```python theme={null} def create(*, title: str, voices: builtins.list[bytes], description: Optional[str] = OMIT, texts: Optional[builtins.list[str]] = OMIT, tags: Optional[builtins.list[str]] = OMIT, cover_image: Optional[bytes] = OMIT, visibility: Visibility = "private", train_mode: str = "fast", enhance_audio_quality: bool = True, request_options: Optional[RequestOptions] = None) -> Voice ``` Create/clone a new voice. **Arguments**: * `title` - Voice model name * `voices` - List of audio file bytes for training * `description` - Voice description * `texts` - Transcripts for voice samples * `tags` - Tags for categorization * `cover_image` - Cover image bytes * `visibility` - Visibility setting (public, unlist, private) * `train_mode` - Training mode (currently only "fast" supported) * `enhance_audio_quality` - Whether to enhance audio quality * `request_options` - Request-level overrides **Returns**: Created voice model **Example**: ```python theme={null} client = FishAudio(api_key="...") with open("voice1.wav", "rb") as f1, open("voice2.wav", "rb") as f2: voice = client.voices.create( title="My Voice", voices=[f1.read(), f2.read()], description="Custom voice clone", tags=["custom", "english"] ) print(f"Created: {voice.id}") ``` #### update ```python theme={null} def update(voice_id: str, *, title: Optional[str] = OMIT, description: Optional[str] = OMIT, cover_image: Optional[bytes] = OMIT, visibility: Optional[Visibility] = OMIT, tags: Optional[builtins.list[str]] = OMIT, request_options: Optional[RequestOptions] = None) -> None ``` Update voice metadata. **Arguments**: * `voice_id` - Voice model ID * `title` - New title * `description` - New description * `cover_image` - New cover image bytes * `visibility` - New visibility setting * `tags` - New tags * `request_options` - Request-level overrides **Example**: ```python theme={null} client = FishAudio(api_key="...") client.voices.update( "voice_id_here", title="Updated Title", visibility="public" ) ``` #### delete ```python theme={null} def delete(voice_id: str, *, request_options: Optional[RequestOptions] = None) -> None ``` Delete a voice. **Arguments**: * `voice_id` - Voice model ID * `request_options` - Request-level overrides **Example**: ```python theme={null} client = FishAudio(api_key="...") client.voices.delete("voice_id_here") ``` ## AsyncVoicesClient Objects ```python theme={null} class AsyncVoicesClient() ``` Asynchronous voice management operations. #### list ```python theme={null} async def list( *, page_size: int = 10, page_number: int = 1, title: Optional[str] = OMIT, tags: Optional[Union[list[str], str]] = OMIT, self_only: bool = False, author_id: Optional[str] = OMIT, language: Optional[Union[list[str], str]] = OMIT, title_language: Optional[Union[list[str], str]] = OMIT, sort_by: str = "task_count", request_options: Optional[RequestOptions] = None ) -> PaginatedResponse[Voice] ``` List available voices/models (async). See sync version for details. #### get ```python theme={null} async def get(voice_id: str, *, request_options: Optional[RequestOptions] = None) -> Voice ``` Get voice by ID (async). See sync version for details. #### create ```python theme={null} async def create(*, title: str, voices: builtins.list[bytes], description: Optional[str] = OMIT, texts: Optional[builtins.list[str]] = OMIT, tags: Optional[builtins.list[str]] = OMIT, cover_image: Optional[bytes] = OMIT, visibility: Visibility = "private", train_mode: str = "fast", enhance_audio_quality: bool = True, request_options: Optional[RequestOptions] = None) -> Voice ``` Create/clone a new voice (async). See sync version for details. #### update ```python theme={null} async def update(voice_id: str, *, title: Optional[str] = OMIT, description: Optional[str] = OMIT, cover_image: Optional[bytes] = OMIT, visibility: Optional[Visibility] = OMIT, tags: Optional[builtins.list[str]] = OMIT, request_options: Optional[RequestOptions] = None) -> None ``` Update voice metadata (async). See sync version for details. #### delete ```python theme={null} async def delete(voice_id: str, *, request_options: Optional[RequestOptions] = None) -> None ``` Delete a voice (async). See sync version for details. # fishaudio.resources.account Account namespace client for billing and credits. ## AccountClient Objects ```python theme={null} class AccountClient() ``` Synchronous account operations. #### get\_credits ```python theme={null} def get_credits(*, check_free_credit: Optional[bool] = OMIT, request_options: Optional[RequestOptions] = None) -> Credits ``` Get API credit balance. **Arguments**: * `check_free_credit` - Whether to check free credit availability * `request_options` - Request-level overrides **Returns**: Credits information **Example**: ```python theme={null} client = FishAudio(api_key="...") credits = client.account.get_credits() print(f"Available credits: {float(credits.credit)}") # Check free credit availability credits = client.account.get_credits(check_free_credit=True) if credits.has_free_credit: print("Free credits available!") ``` #### get\_package ```python theme={null} def get_package(*, request_options: Optional[RequestOptions] = None) -> Package ``` Get package information. **Arguments**: * `request_options` - Request-level overrides **Returns**: Package information **Example**: ```python theme={null} client = FishAudio(api_key="...") package = client.account.get_package() print(f"Balance: {package.balance}/{package.total}") ``` ## AsyncAccountClient Objects ```python theme={null} class AsyncAccountClient() ``` Asynchronous account operations. #### get\_credits ```python theme={null} async def get_credits( *, check_free_credit: Optional[bool] = OMIT, request_options: Optional[RequestOptions] = None) -> Credits ``` Get API credit balance (async). **Arguments**: * `check_free_credit` - Whether to check free credit availability * `request_options` - Request-level overrides **Returns**: Credits information **Example**: ```python theme={null} client = AsyncFishAudio(api_key="...") credits = await client.account.get_credits() print(f"Available credits: {float(credits.credit)}") # Check free credit availability credits = await client.account.get_credits(check_free_credit=True) if credits.has_free_credit: print("Free credits available!") ``` #### get\_package ```python theme={null} async def get_package(*, request_options: Optional[RequestOptions] = None ) -> Package ``` Get package information (async). **Arguments**: * `request_options` - Request-level overrides **Returns**: Package information **Example**: ```python theme={null} client = AsyncFishAudio(api_key="...") package = await client.account.get_package() print(f"Balance: {package.balance}/{package.total}") ``` # fishaudio.resources.tts TTS (Text-to-Speech) namespace client. ## TTSClient Objects ```python theme={null} class TTSClient() ``` Synchronous TTS operations. #### stream ```python theme={null} def stream(*, text: str, reference_id: Optional[str] = None, references: Optional[list[ReferenceAudio]] = None, format: Optional[AudioFormat] = None, latency: Optional[LatencyMode] = None, speed: Optional[float] = None, config: TTSConfig = TTSConfig(), model: Model = "s2-pro", request_options: Optional[RequestOptions] = None) -> AudioStream ``` Stream text-to-speech audio chunks. **Arguments**: * `text` - Text to synthesize * `reference_id` - Voice reference ID (overrides config.reference\_id if provided) * `references` - Reference audio samples (overrides config.references if provided) * `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided) * `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided) * `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided) * `config` - TTS configuration (audio settings, voice, model parameters) * `model` - TTS model to use * `request_options` - Request-level overrides **Returns**: AudioStream object that can be iterated for audio chunks **Example**: ```python theme={null} from fishaudio import FishAudio client = FishAudio(api_key="...") # Stream and process chunks for chunk in client.tts.stream(text="Hello world"): process_audio_chunk(chunk) # Or collect all at once audio = client.tts.stream(text="Hello world").collect() ``` #### convert ```python theme={null} def convert(*, text: str, reference_id: Optional[str] = None, references: Optional[list[ReferenceAudio]] = None, format: Optional[AudioFormat] = None, latency: Optional[LatencyMode] = None, speed: Optional[float] = None, config: TTSConfig = TTSConfig(), model: Model = "s2-pro", request_options: Optional[RequestOptions] = None) -> bytes ``` Convert text to speech and return complete audio as bytes. This is a convenience method that streams all audio chunks and combines them. For chunk-by-chunk processing, use stream() instead. **Arguments**: * `text` - Text to synthesize * `reference_id` - Voice reference ID (overrides config.reference\_id if provided) * `references` - Reference audio samples (overrides config.references if provided) * `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided) * `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided) * `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided) * `config` - TTS configuration (audio settings, voice, model parameters) * `model` - TTS model to use * `request_options` - Request-level overrides **Returns**: Complete audio as bytes **Example**: ```python theme={null} from fishaudio import FishAudio from fishaudio.utils import play, save client = FishAudio(api_key="...") # Get complete audio audio = client.tts.convert(text="Hello world") # Play it play(audio) # Or save it save(audio, "output.mp3") ``` #### stream\_websocket ```python theme={null} def stream_websocket( text_stream: Iterable[Union[str, TextEvent, FlushEvent]], *, reference_id: Optional[str] = None, references: Optional[list[ReferenceAudio]] = None, format: Optional[AudioFormat] = None, latency: Optional[LatencyMode] = None, speed: Optional[float] = None, config: TTSConfig = TTSConfig(), model: Model = "s2-pro", max_workers: int = 10, ws_options: Optional[WebSocketOptions] = None) -> Iterator[bytes] ``` Stream text and receive audio in real-time via WebSocket. Perfect for conversational AI, live captioning, and streaming applications. **Arguments**: * `text_stream` - Iterator of text chunks to stream * `reference_id` - Voice reference ID (overrides config.reference\_id if provided) * `references` - Reference audio samples (overrides config.references if provided) * `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided) * `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided) * `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided) * `config` - TTS configuration (audio settings, voice, model parameters) * `model` - TTS model to use * `max_workers` - ThreadPoolExecutor workers for concurrent sender * `ws_options` - WebSocket connection options for configuring timeouts, message size limits, etc. Useful for long-running generations that may exceed default timeout values. See WebSocketOptions class for available parameters. **Returns**: Iterator of audio bytes **Example**: ```python theme={null} from fishaudio import FishAudio, TTSConfig, ReferenceAudio, WebSocketOptions client = FishAudio(api_key="...") def text_generator(): yield "Hello, " yield "this is " yield "streaming text!" # Simple usage with defaults with open("output.mp3", "wb") as f: for audio_chunk in client.tts.stream_websocket(text_generator()): f.write(audio_chunk) # With format and speed parameters with open("output.wav", "wb") as f: for audio_chunk in client.tts.stream_websocket( text_generator(), format="wav", speed=1.3 ): f.write(audio_chunk) # With reference_id parameter with open("output.mp3", "wb") as f: for audio_chunk in client.tts.stream_websocket(text_generator(), reference_id="your_model_id"): f.write(audio_chunk) # With references parameter with open("output.mp3", "wb") as f: for audio_chunk in client.tts.stream_websocket( text_generator(), references=[ReferenceAudio(audio=audio_bytes, text="sample")] ): f.write(audio_chunk) # With WebSocket options for long-running generations # Useful if you're generating very long responses that may take >20 seconds ws_options = WebSocketOptions(keepalive_ping_timeout_seconds=60.0) with open("output.mp3", "wb") as f: for audio_chunk in client.tts.stream_websocket( text_generator(), ws_options=ws_options ): f.write(audio_chunk) # Parameters override config values config = TTSConfig(format="mp3", latency="balanced") with open("output.wav", "wb") as f: for audio_chunk in client.tts.stream_websocket( text_generator(), format="wav", # Parameter wins config=config ): f.write(audio_chunk) ``` ## AsyncTTSClient Objects ```python theme={null} class AsyncTTSClient() ``` Asynchronous TTS operations. #### stream ```python theme={null} async def stream( *, text: str, reference_id: Optional[str] = None, references: Optional[list[ReferenceAudio]] = None, format: Optional[AudioFormat] = None, latency: Optional[LatencyMode] = None, speed: Optional[float] = None, config: TTSConfig = TTSConfig(), model: Model = "s2-pro", request_options: Optional[RequestOptions] = None) -> AsyncAudioStream ``` Stream text-to-speech audio chunks (async). **Arguments**: * `text` - Text to synthesize * `reference_id` - Voice reference ID (overrides config.reference\_id if provided) * `references` - Reference audio samples (overrides config.references if provided) * `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided) * `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided) * `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided) * `config` - TTS configuration (audio settings, voice, model parameters) * `model` - TTS model to use * `request_options` - Request-level overrides **Returns**: AsyncAudioStream object that can be iterated for audio chunks **Example**: ```python theme={null} from fishaudio import AsyncFishAudio client = AsyncFishAudio(api_key="...") # Stream and process chunks async for chunk in await client.tts.stream(text="Hello world"): await process_audio_chunk(chunk) # Or collect all at once stream = await client.tts.stream(text="Hello world") audio = await stream.collect() ``` #### convert ```python theme={null} async def convert(*, text: str, reference_id: Optional[str] = None, references: Optional[list[ReferenceAudio]] = None, format: Optional[AudioFormat] = None, latency: Optional[LatencyMode] = None, speed: Optional[float] = None, config: TTSConfig = TTSConfig(), model: Model = "s2-pro", request_options: Optional[RequestOptions] = None) -> bytes ``` Convert text to speech and return complete audio as bytes (async). This is a convenience method that streams all audio chunks and combines them. For chunk-by-chunk processing, use stream() instead. **Arguments**: * `text` - Text to synthesize * `reference_id` - Voice reference ID (overrides config.reference\_id if provided) * `references` - Reference audio samples (overrides config.references if provided) * `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided) * `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided) * `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided) * `config` - TTS configuration (audio settings, voice, model parameters) * `model` - TTS model to use * `request_options` - Request-level overrides **Returns**: Complete audio as bytes **Example**: ```python theme={null} from fishaudio import AsyncFishAudio from fishaudio.utils import play, save client = AsyncFishAudio(api_key="...") # Get complete audio audio = await client.tts.convert(text="Hello world") # Play it play(audio) # Or save it save(audio, "output.mp3") ``` #### stream\_websocket ```python theme={null} async def stream_websocket(text_stream: AsyncIterable[Union[str, TextEvent, FlushEvent]], *, reference_id: Optional[str] = None, references: Optional[list[ReferenceAudio]] = None, format: Optional[AudioFormat] = None, latency: Optional[LatencyMode] = None, speed: Optional[float] = None, config: TTSConfig = TTSConfig(), model: Model = "s2-pro", ws_options: Optional[WebSocketOptions] = None) ``` Stream text and receive audio in real-time via WebSocket (async). Perfect for conversational AI, live captioning, and streaming applications. **Arguments**: * `text_stream` - Async iterator of text chunks to stream * `reference_id` - Voice reference ID (overrides config.reference\_id if provided) * `references` - Reference audio samples (overrides config.references if provided) * `format` - Audio format - "mp3", "wav", "pcm", or "opus" (overrides config.format if provided) * `latency` - Latency mode - "normal" or "balanced" (overrides config.latency if provided) * `speed` - Speech speed multiplier, e.g. 1.5 for 1.5x speed (overrides config.prosody.speed if provided) * `config` - TTS configuration (audio settings, voice, model parameters) * `model` - TTS model to use * `ws_options` - WebSocket connection options for configuring timeouts, message size limits, etc. Useful for long-running generations that may exceed default timeout values. See WebSocketOptions class for available parameters. **Returns**: Async iterator of audio bytes **Example**: ```python theme={null} from fishaudio import AsyncFishAudio, TTSConfig, ReferenceAudio, WebSocketOptions client = AsyncFishAudio(api_key="...") async def text_generator(): yield "Hello, " yield "this is " yield "async streaming!" # Simple usage with defaults async with aiofiles.open("output.mp3", "wb") as f: async for audio_chunk in client.tts.stream_websocket(text_generator()): await f.write(audio_chunk) # With format and speed parameters async with aiofiles.open("output.wav", "wb") as f: async for audio_chunk in client.tts.stream_websocket( text_generator(), format="wav", speed=1.3 ): await f.write(audio_chunk) # With reference_id parameter async with aiofiles.open("output.mp3", "wb") as f: async for audio_chunk in client.tts.stream_websocket(text_generator(), reference_id="your_model_id"): await f.write(audio_chunk) # With references parameter async with aiofiles.open("output.mp3", "wb") as f: async for audio_chunk in client.tts.stream_websocket( text_generator(), references=[ReferenceAudio(audio=audio_bytes, text="sample")] ): await f.write(audio_chunk) # With WebSocket options for long-running generations # Useful if you're generating very long responses that may take >20 seconds ws_options = WebSocketOptions(keepalive_ping_timeout_seconds=60.0) async with aiofiles.open("output.mp3", "wb") as f: async for audio_chunk in client.tts.stream_websocket( text_generator(), ws_options=ws_options ): await f.write(audio_chunk) # Parameters override config values config = TTSConfig(format="mp3", latency="balanced") async with aiofiles.open("output.wav", "wb") as f: async for audio_chunk in client.tts.stream_websocket( text_generator(), format="wav", # Parameter wins config=config ): await f.write(audio_chunk) ``` # fishaudio.resources.realtime Real-time WebSocket streaming helpers. #### iter\_websocket\_audio ```python theme={null} def iter_websocket_audio(ws) -> Iterator[bytes] ``` Process WebSocket audio messages (sync). Receives messages from WebSocket, yields audio chunks, handles errors. Unknown events are ignored and iteration continues. **Arguments**: * `ws` - WebSocket connection from httpx\_ws.connect\_ws **Yields**: Audio bytes **Raises**: * `WebSocketError` - On disconnect or error finish event #### aiter\_websocket\_audio ```python theme={null} async def aiter_websocket_audio(ws) -> AsyncIterator[bytes] ``` Process WebSocket audio messages (async). Receives messages from WebSocket, yields audio chunks, handles errors. Unknown events are ignored and iteration continues. **Arguments**: * `ws` - WebSocket connection from httpx\_ws.aconnect\_ws **Yields**: Audio bytes **Raises**: * `WebSocketError` - On disconnect or error finish event # fishaudio.resources.asr ASR (Automatic Speech Recognition) namespace client. ## ASRClient Objects ```python theme={null} class ASRClient() ``` Synchronous ASR operations. #### transcribe ```python theme={null} def transcribe( *, audio: bytes, language: Optional[str] = OMIT, include_timestamps: bool = True, request_options: Optional[RequestOptions] = None) -> ASRResponse ``` Transcribe audio to text. **Arguments**: * `audio` - Audio file bytes * `language` - Language code (e.g., "en", "zh"). Auto-detected if not provided. * `include_timestamps` - Whether to include timestamp information for segments * `request_options` - Request-level overrides **Returns**: ASRResponse with transcription text, duration, and segments **Example**: ```python theme={null} client = FishAudio(api_key="...") with open("audio.mp3", "rb") as f: audio_bytes = f.read() result = client.asr.transcribe(audio=audio_bytes, language="en") print(result.text) for segment in result.segments: print(f"{segment.start}-{segment.end}: {segment.text}") ``` ## AsyncASRClient Objects ```python theme={null} class AsyncASRClient() ``` Asynchronous ASR operations. #### transcribe ```python theme={null} async def transcribe( *, audio: bytes, language: Optional[str] = OMIT, include_timestamps: bool = True, request_options: Optional[RequestOptions] = None) -> ASRResponse ``` Transcribe audio to text (async). **Arguments**: * `audio` - Audio file bytes * `language` - Language code (e.g., "en", "zh"). Auto-detected if not provided. * `include_timestamps` - Whether to include timestamp information for segments * `request_options` - Request-level overrides **Returns**: ASRResponse with transcription text, duration, and segments **Example**: ```python theme={null} client = AsyncFishAudio(api_key="...") async with aiofiles.open("audio.mp3", "rb") as f: audio_bytes = await f.read() result = await client.asr.transcribe(audio=audio_bytes, language="en") print(result.text) for segment in result.segments: print(f"{segment.start}-{segment.end}: {segment.text}") ``` # Types Source: https://docs.fish.audio/api-reference/sdk/python/types # fishaudio.types.voices Voice and model management types. ## Sample Objects ```python theme={null} class Sample(BaseModel) ``` A sample audio for a voice model. **Attributes**: * `title` - Title/name of the audio sample * `text` - Transcription of the spoken content in the sample * `task_id` - Unique identifier for the sample task * `audio` - URL or path to the audio file ## Author Objects ```python theme={null} class Author(BaseModel) ``` Voice model author information. **Attributes**: * `id` - Unique author identifier * `nickname` - Author's display name * `avatar` - URL to author's avatar image ## Voice Objects ```python theme={null} class Voice(BaseModel) ``` A voice model. Represents a TTS voice that can be used for synthesis. **Attributes**: * `id` - Unique voice model identifier (use as reference\_id in TTS) * `type` - Model type. Options: "svc" (singing voice conversion), "tts" (text-to-speech) * `title` - Voice model title/name * `description` - Detailed description of the voice model * `cover_image` - URL to the voice model's cover image * `train_mode` - Training mode used. Options: "fast" * `state` - Current model state (e.g., "ready", "training", "failed") * `tags` - List of tags for categorization (e.g., \["male", "english", "young"]) * `samples` - List of audio samples demonstrating the voice * `created_at` - Timestamp when the model was created * `updated_at` - Timestamp when the model was last updated * `languages` - List of supported language codes (e.g., \["en", "zh"]) * `visibility` - Model visibility. Options: "public", "private", "unlist" * `lock_visibility` - Whether visibility setting is locked * `like_count` - Number of likes the model has received * `mark_count` - Number of bookmarks/favorites * `shared_count` - Number of times the model has been shared * `task_count` - Number of times the model has been used for generation * `liked` - Whether the current user has liked this model. Default: False * `marked` - Whether the current user has bookmarked this model. Default: False * `author` - Information about the voice model's creator # fishaudio.types.account Account-related types (credits, packages, etc.). ## Credits Objects ```python theme={null} class Credits(BaseModel) ``` User's API credit balance. **Attributes**: * `id` - Unique credits record identifier * `user_id` - User identifier * `credit` - Current credit balance (decimal for precise accounting) * `created_at` - Timestamp when the credits record was created * `updated_at` - Timestamp when the credits were last updated * `has_phone_sha256` - Whether the user has a verified phone number. Optional * `has_free_credit` - Whether the user has received free credits. Optional ## Package Objects ```python theme={null} class Package(BaseModel) ``` User's prepaid package information. **Attributes**: * `id` - Unique package identifier * `user_id` - User identifier * `type` - Package type identifier * `total` - Total units in the package * `balance` - Remaining units in the package * `created_at` - Timestamp when the package was purchased * `updated_at` - Timestamp when the package was last updated * `finished_at` - Timestamp when the package was fully consumed. None if still active # fishaudio.types.tts TTS-related types. ## ReferenceAudio Objects ```python theme={null} class ReferenceAudio(BaseModel) ``` Reference audio for voice cloning/style. **Attributes**: * `audio` - Audio file bytes for the reference sample * `text` - Transcription of what is spoken in the reference audio. Should match exactly what's spoken and include punctuation for proper prosody. ## Prosody Objects ```python theme={null} class Prosody(BaseModel) ``` Speech prosody settings (speed and volume). **Attributes**: * `speed` - Speech speed multiplier. Range: 0.5-2.0. Default: 1.0. * `Examples` - 1.5 = 50% faster, 0.8 = 20% slower * `volume` - Volume adjustment in decibels. Range: -20.0 to 20.0. Default: 0.0 (no change). Positive values increase volume, negative values decrease it. #### from\_speed\_override ```python theme={null} @classmethod def from_speed_override(cls, speed: float, base: Optional["Prosody"] = None) -> "Prosody" ``` Create Prosody with speed override, preserving volume from base. **Arguments**: * `speed` - Speed value to use * `base` - Base prosody to preserve volume from (if any) **Returns**: New Prosody instance with overridden speed ## TTSConfig Objects ```python theme={null} class TTSConfig(BaseModel) ``` TTS generation configuration. Reusable configuration for text-to-speech requests. Create once, use multiple times. All parameters have sensible defaults. **Attributes**: * `format` - Audio output format. Options: "mp3", "wav", "pcm", "opus". Default: "mp3" * `sample_rate` - Audio sample rate in Hz. If None, uses format-specific default. * `mp3_bitrate` - MP3 bitrate in kbps. Options: 64, 128, 192. Default: 128 * `opus_bitrate` - Opus bitrate in kbps. Options: -1000, 24, 32, 48, 64. Default: 32 * `normalize` - Whether to normalize/clean the input text. Default: True * `chunk_length` - Characters per generation chunk. Range: 100-300. Default: 200. Lower values = faster initial response, higher values = better quality * `latency` - Generation mode. Options: "normal" (higher quality), "balanced" (faster). Default: "balanced" * `reference_id` - Voice model ID from fish.audio (e.g., "802e3bc2b27e49c2995d23ef70e6ac89"). Find IDs in voice URLs or via voices.list() * `references` - List of reference audio samples for instant voice cloning. Default: \[] * `prosody` - Speech speed and volume settings. Default: None (uses natural prosody) * `top_p` - Nucleus sampling parameter for token selection. Range: 0.0-1.0. Default: 0.7 * `temperature` - Randomness in generation. Range: 0.0-1.0. Default: 0.7. Higher = more varied, lower = more consistent * `max_new_tokens` - Maximum number of tokens to generate. Default: 1024 * `repetition_penalty` - Penalty for repeated tokens. Default: 1.2 * `min_chunk_length` - Minimum chunk length for generation. Default: 50 * `condition_on_previous_chunks` - Whether to condition generation on previous chunks. Default: True * `early_stop_threshold` - Threshold for early stopping. Default: 1.0 ## TTSRequest Objects ```python theme={null} class TTSRequest(BaseModel) ``` Request parameters for text-to-speech generation. This model is used internally for WebSocket streaming. For the HTTP API, parameters are passed directly to methods. **Attributes**: * `text` - Text to synthesize into speech * `chunk_length` - Characters per generation chunk. Range: 100-300. Default: 200 * `format` - Audio output format. Options: "mp3", "wav", "pcm", "opus". Default: "mp3" * `sample_rate` - Audio sample rate in Hz. If None, uses format-specific default * `mp3_bitrate` - MP3 bitrate in kbps. Options: 64, 128, 192. Default: 128 * `opus_bitrate` - Opus bitrate in kbps. Options: -1000, 24, 32, 48, 64. Default: 32 * `references` - List of reference audio samples for voice cloning. Default: \[] * `reference_id` - Voice model ID for using a specific voice. Default: None * `normalize` - Whether to normalize/clean the input text. Default: True * `latency` - Generation mode. Options: "normal", "balanced". Default: "balanced" * `prosody` - Speech speed and volume settings. Default: None * `top_p` - Nucleus sampling for token selection. Range: 0.0-1.0. Default: 0.7 * `temperature` - Randomness in generation. Range: 0.0-1.0. Default: 0.7 * `max_new_tokens` - Maximum number of tokens to generate. Default: 1024 * `repetition_penalty` - Penalty for repeated tokens. Default: 1.2 * `min_chunk_length` - Minimum chunk length for generation. Default: 50 * `condition_on_previous_chunks` - Whether to condition generation on previous chunks. Default: True * `early_stop_threshold` - Threshold for early stopping. Default: 1.0 ## StartEvent Objects ```python theme={null} class StartEvent(BaseModel) ``` WebSocket start event to initiate TTS streaming. **Attributes**: * `event` - Event type identifier, always "start" * `request` - TTS configuration for the streaming session ## TextEvent Objects ```python theme={null} class TextEvent(BaseModel) ``` WebSocket event to send a text chunk for synthesis. **Attributes**: * `event` - Event type identifier, always "text" * `text` - Text chunk to synthesize ## FlushEvent Objects ```python theme={null} class FlushEvent(BaseModel) ``` WebSocket event to force immediate audio generation from buffered text. Use this to ensure all buffered text is synthesized without waiting for more input. **Attributes**: * `event` - Event type identifier, always "flush" ## CloseEvent Objects ```python theme={null} class CloseEvent(BaseModel) ``` WebSocket event to end the streaming session. **Attributes**: * `event` - Event type identifier, always "stop" # fishaudio.types.shared Shared types used across the SDK. ## PaginatedResponse Objects ```python theme={null} class PaginatedResponse(BaseModel, Generic[T]) ``` Generic paginated response. **Attributes**: * `total` - Total number of items across all pages * `items` - List of items on the current page #### warn\_if\_deprecated\_model ```python theme={null} def warn_if_deprecated_model(model: str) -> None ``` Emit a deprecation warning if a legacy model is used. # fishaudio.types.asr ASR (Automatic Speech Recognition) related types. ## ASRSegment Objects ```python theme={null} class ASRSegment(BaseModel) ``` A timestamped segment of transcribed text. **Attributes**: * `text` - The transcribed text for this segment * `start` - Segment start time in seconds * `end` - Segment end time in seconds ## ASRResponse Objects ```python theme={null} class ASRResponse(BaseModel) ``` Response from speech-to-text transcription. **Attributes**: * `text` - Complete transcription of the entire audio * `duration` - Total audio duration in milliseconds * `segments` - List of timestamped text segments. Empty if include\_timestamps=False #### duration Duration in milliseconds # Utils Source: https://docs.fish.audio/api-reference/sdk/python/utils # fishaudio.utils.play Audio playback utility. #### play ```python theme={null} def play(audio: Union[bytes, Iterable[bytes]], *, notebook: bool = False, use_ffmpeg: bool = True) -> None ``` Play audio using various playback methods. **Arguments**: * `audio` - Audio bytes or iterable of bytes * `notebook` - Use Jupyter notebook playback (IPython.display.Audio) * `use_ffmpeg` - Use ffplay for playback (default, falls back to sounddevice) **Raises**: * `DependencyError` - If required playback tool is not installed **Examples**: ```python theme={null} from fishaudio import FishAudio, play client = FishAudio(api_key="...") audio = client.tts.convert(text="Hello world") # Play directly play(audio) # In Jupyter notebook play(audio, notebook=True) # Force sounddevice fallback play(audio, use_ffmpeg=False) ``` # fishaudio.utils.save Audio saving utility. #### save ```python theme={null} def save(audio: Union[bytes, Iterable[bytes]], filename: str) -> None ``` Save audio to a file. **Arguments**: * `audio` - Audio bytes or iterable of bytes * `filename` - Path to save the audio file **Examples**: ```python theme={null} from fishaudio import FishAudio, save client = FishAudio(api_key="...") audio = client.tts.convert(text="Hello world") # Save to file save(audio, "output.mp3") # Works with iterators too audio_stream = client.tts.convert(text="Another example") save(audio_stream, "another.mp3") ``` # fishaudio.utils.stream Audio streaming utility. #### stream ```python theme={null} def stream(audio_stream: Iterator[bytes]) -> bytes ``` Stream audio in real-time while playing it with mpv. This function plays the audio as it's being generated and simultaneously captures it to return the complete audio buffer. **Arguments**: * `audio_stream` - Iterator of audio byte chunks **Returns**: Complete audio bytes after streaming finishes **Raises**: * `DependencyError` - If mpv is not installed **Examples**: ```python theme={null} from fishaudio import FishAudio, stream client = FishAudio(api_key="...") audio_stream = client.tts.convert(text="Hello world") # Stream and play in real-time, get complete audio complete_audio = stream(audio_stream) # Save the captured audio with open("output.mp3", "wb") as f: f.write(complete_audio) ``` # Legacy Source: https://docs.fish.audio/archive/python-sdk-legacy/index Archived documentation for the legacy Session-based Python SDK This documentation is for the legacy Python SDK using the Session-based API. This API is deprecated. **Please migrate to the [new Python SDK](/developer-guide/sdk-guide/python)** which uses a modern client-based architecture. See the [migration guide](/archive/python-sdk-legacy/migration-guide) for help upgrading. ## About the Legacy SDK This archive contains documentation for the `fish_audio_sdk` module using the Session-based API. While this API still functions, it is no longer actively maintained and lacks the modern features available in the new SDK. ### What's Different in the New SDK The new Python SDK (`fishaudio` module) offers: * **Modern client-based architecture** - More intuitive and consistent with modern Python libraries * **Full async support** - Native asyncio integration for better performance * **Better type safety** - Comprehensive type hints and better IDE support * **Improved error handling** - More detailed error messages and exception hierarchy * **Enhanced utilities** - Built-in audio playback, streaming, and file management * **Active maintenance** - Regular updates and new features ### Migration Path We strongly recommend migrating to the new SDK. The [migration guide](/archive/python-sdk-legacy/migration-guide) provides: * Side-by-side code comparisons * Complete list of breaking changes * Common migration patterns * Troubleshooting tips ## Migration Complete guide to upgrading from the legacy SDK to the new client-based API ## Legacy Documentation Pages How to install the legacy SDK Session initialization and API keys TTS with the Session-based API Reference audio and voice models ASR transcription with legacy API Real-time streaming with WebSocketSession # Contributing Source: https://docs.fish.audio/contributing Help improve Fish Audio and contribute to our open source projects. # Contributing to Fish Audio First off, thanks for taking the time to contribute! All types of contributions are encouraged and valued. See the sections below for different ways to help and details about how this project handles them. Please make sure to read the relevant section before making your contribution. It will make it a lot easier for us maintainers and smooth out the experience for all involved. The community looks forward to your contributions. If you like the project but don't have time to contribute, there are other easy ways to support Fish Audio: * Star our repositories * Tweet about it * Reference Fish Audio in your project's readme * Mention the project at local meetups and tell your friends/colleagues ## Code of Conduct This project and everyone participating in it is governed by the Fish Audio Code of Conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to our community team. ## I Have a Question Before you ask a question, please read the available [Documentation](https://docs.fish.audio). It's best to search for existing [Issues](https://github.com/fishaudio) that might help you. In case you have found a suitable issue and still need clarification, you can write your question in that issue. It is also advisable to search the internet for answers first. If you still need to ask a question: 1. Open an [Issue](https://github.com/fishaudio) in the relevant repository 2. Provide as much context as you can about what you're running into 3. Provide project and platform versions (Node.js, Python, OS, etc.), depending on what seems relevant We will take care of the issue as soon as possible. ## I Want To Contribute **Legal Notice** When contributing to this project, you must agree that you have authored 100% of the content, that you have the necessary rights to the content, and that the content you contribute may be provided under the project license. ### Reporting Bugs #### Before Submitting a Bug Report A good bug report shouldn't leave others needing to chase you up for more information. Please investigate carefully, collect information, and describe the issue in detail: * Make sure you are using the latest version * Determine if your bug is really a bug and not an error on your side (e.g., incompatible environment components/versions) * Check if there is already a bug report for your issue in the bug tracker * Search the internet (including Stack Overflow) to see if others have discussed the issue * Collect information about the bug: * Stack trace (Traceback) * OS, Platform and Version (Windows, Linux, macOS, x86, ARM) * Version of the interpreter, compiler, SDK, runtime environment, package manager * Your input and the output * Can you reliably reproduce the issue? Can you reproduce it with older versions? #### How Do I Submit a Good Bug Report? You must never report security-related issues, vulnerabilities, or bugs including sensitive information to the issue tracker. Instead, sensitive bugs must be sent by email to our security team. We use GitHub issues to track bugs and errors. If you run into an issue: 1. Open an [Issue](https://github.com/fishaudio) in the relevant repository 2. Explain the behavior you would expect and the actual behavior 3. Provide as much context as possible and describe the **reproduction steps** that someone else can follow to recreate the issue on their own 4. Provide the information you collected in the previous section Once filed: * The project team will label the issue accordingly * A team member will try to reproduce the issue with your provided steps * If there are no reproduction steps, the team will ask for them and mark the issue as `needs-repro` * If the team reproduces the issue, it will be marked `needs-fix` and left to be implemented ### Suggesting Enhancements This section guides you through submitting an enhancement suggestion for Fish Audio, including completely new features and minor improvements to existing functionality. #### Before Submitting an Enhancement * Make sure you are using the latest version * Read the [documentation](https://docs.fish.audio) carefully to see if the functionality already exists * Perform a [search](https://github.com/fishaudio) to see if the enhancement has already been suggested * Consider whether your idea fits with the scope and aims of the project #### How Do I Submit a Good Enhancement Suggestion? Enhancement suggestions are tracked as GitHub issues: * Use a **clear and descriptive title** for the issue * Provide a **step-by-step description** of the suggested enhancement in as many details as possible * **Describe the current behavior** and **explain which behavior you expected to see instead** and why * Include **screenshots or screen recordings** if applicable * **Explain why this enhancement would be useful** to most Fish Audio users ### Your First Code Contribution We welcome first-time contributors! Here's how to get started: 1. **Fork the repository** you want to contribute to 2. **Clone your fork** locally 3. **Create a new branch** for your changes 4. **Make your changes** following our styleguides 5. **Test your changes** thoroughly 6. **Commit your changes** with clear commit messages 7. **Push to your fork** and submit a pull request Look for issues labeled `good first issue` or `help wanted` for beginner-friendly tasks. ### Improving The Documentation Documentation improvements are always welcome! This includes: * Fixing typos and grammatical errors * Adding missing information or clarifications * Improving code examples * Adding new guides or tutorials * Translating documentation See our [documentation repository](https://github.com/fishaudio/fish-docs) to get started. ## Styleguides ### Commit Messages * Use clear and meaningful commit messages * Start with a verb in the present tense (e.g., "Add", "Fix", "Update", "Remove") * Keep the first line under 72 characters * Reference issues and pull requests when relevant * Provide additional context in the commit body if needed Example: ``` Add voice cloning support for Python SDK - Implement VoiceCloneClient class - Add comprehensive error handling - Include usage examples in docstrings Closes #123 ``` ### Code Style * Follow the existing code style in each repository * Use meaningful variable and function names * Add comments for complex logic * Write tests for new features * Ensure all tests pass before submitting ## Attribution This contribution guide is based on the **contributing.md** generator. Fish Audio is committed to open source and welcomes contributions from developers worldwide. # Emotion & Expression Control Source: https://docs.fish.audio/developer-guide/best-practices/emotion-control Make your AI voices express emotions naturally ## Overview Control how your AI voice expresses emotions, from happy and excited to sad and contemplative. Add natural pauses, laughter, and other human-like elements to make speech more engaging. The `(parenthesis)` syntax on this page applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. ## How to Use Simply wrap emotion tags in parentheses before your text: ``` (happy) What a beautiful day! (sad) I'm sorry to hear that. (excited) This is amazing news! ``` Include tone markers or audio effects: ``` (whispering) Let me tell you something. (laughing) Ha ha ha, wow that's so funny! ``` ## Important Rules ### Placement Matters **For all languages:** * Emotion tags MUST go at the beginning of sentences * Tone controls can go anywhere in the text * Sound effects can go anywhere in the text **Correct:** ``` (happy) What a wonderful day! ``` **Incorrect:** ``` What a (happy) wonderful day! ``` ## Best Practices **Do:** * Use one emotion per sentence * Add sounds after relevant words * Keep tags simple and clear * Test different combinations **Don't:** * Overuse tags in short text * Mix conflicting emotions * Create custom tags * Forget the parentheses ## Available Emotions See the [Emotion Reference](/api-reference/emotion-reference) for the full list of supported emotions. ## Scene Examples **Customer Service:** ``` (friendly) Hello! How can I help you today? (empathetic) I understand your frustration. (confident) I'll resolve this for you right away. ``` **Storytelling:** ``` (mysterious)(whispering) Once upon a midnight dreary... (excited) Suddenly, the door burst open! (scared)(shouting) Run for your lives! ``` **Educational Content:** ``` (enthusiastic) Welcome to today's lesson! (curious) Have you ever wondered why the sky is blue? (proud) Great job! You got it right! ``` ## Real-World Examples ### Virtual Assistant ``` (friendly) Good morning! (helpful) I've prepared your schedule for today. (concerned) You have three urgent emails. (encouraging) Let's tackle them together! ``` ### Audiobook Narration ``` (narrator) Chapter One: The Beginning (mysterious) The old house stood silent in the fog. (scared)(whispering) "Is anyone there?" she asked. (relieved)(sighing) No one answered. Phew. ``` ### Game Character ``` (brave) I'll defeat the dragon! (struggling)(panting) This is... harder than... I thought! (triumphant)(shouting) Victory is mine! (laughing) Ha ha ha! ``` ## Advanced Techniques ### Emotion Transitions Gradually change emotions: ``` (happy) I got the promotion! (uncertain) But... it means moving away. (sad) I'll miss everyone here. ``` ### Background Effects Add atmosphere: ``` The comedy show was amazing (audience laughing) Everyone was having fun (background laughter) The crowd loved it (crowd laughing) ``` ## Troubleshooting ### Emotion Not Working? 1. Check tag placement (beginning of sentence for emotions) 2. Verify spelling exactly matches the list 3. Don't use quotes around tags 4. Include parentheses ### Unnatural Sound? * Add appropriate text after sound tags * Don't overuse in short sentences * Space out emotional changes * Test with different voices ### Tips for Success 1. **Start simple** - Use basic emotions first 2. **Preview often** - Test how it sounds 3. **Be consistent** - Keep character emotions logical 4. **Less is more** - Don't overuse tags ## Get Creative Experiment with combinations to create unique character voices and engaging narratives. The key is finding the right balance between emotional expression and natural speech flow. ## Support Need help with emotions? * **Try it live:** [fish.audio](https://fish.audio) * **Community:** [Discord](https://discord.gg/fish-audio) * **Email:** [support@fish.audio](mailto:support@fish.audio) # Real-time Voice Streaming Source: https://docs.fish.audio/developer-guide/best-practices/real-time-streaming Stream voice generation in real-time for interactive applications ## Overview Real-time streaming lets you generate speech as you type or speak, perfect for chatbots, virtual assistants, and live applications. ## When to Use Streaming **Perfect for:** * Live chat applications * Virtual assistants * Interactive storytelling * Real-time translations * Gaming dialogue **Not ideal for:** * Pre-recorded content * Batch processing ## Getting Started ### Web Playground Try real-time streaming instantly: 1. Visit [fish.audio](https://fish.audio) 2. Enable "Streaming Mode" 3. Start typing and hear voice generation in real-time ### Using the SDK Stream text as it's being written: ```python theme={null} from fishaudio import FishAudio # Initialize client client = FishAudio(api_key="your_api_key") # Stream text word by word def stream_text(): text = "Hello, this is being generated in real time" for word in text.split(): yield word + " " # Generate speech as text streams audio_stream = client.tts.stream_websocket( stream_text(), reference_id="your_voice_model_id", temperature=0.7, # Controls variation top_p=0.7, # Controls diversity latency="balanced" ) with open("output.mp3", "wb") as f: for audio_chunk in audio_stream: f.write(audio_chunk) ``` ```javascript theme={null} import { FishAudioClient, RealtimeEvents } from "fish-audio"; import { writeFile } from "fs/promises"; import path from "path"; const apiKey = "your_api_key"; const referenceId = "your_voice_model_id"; async function* makeTextStream() { const chunks = [ "Hello from Fish Audio! ", "This is a realtime text-to-speech test. ", "We are streaming multiple chunks over WebSocket.", ]; for (const chunk of chunks) { yield chunk; await new Promise((r) => setTimeout(r, 200)); } } async function main() { const client = new FishAudioClient({ apiKey }); // For realtime, set text to "" and stream content via makeTextStream const request = { text: "", reference_id: referenceId, }; const connection = await client.textToSpeech.convertRealtime( request, makeTextStream() ); // Collect audio and write to a file when the stream ends const chunks = []; connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened")); connection.on(RealtimeEvents.AUDIO_CHUNK, (audio) => { if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) { chunks.push(Buffer.from(audio)); } }); connection.on(RealtimeEvents.ERROR, (err) => console.error("WebSocket error:", err) ); connection.on(RealtimeEvents.CLOSE, async () => { const outPath = path.resolve(process.cwd(), "out.mp3"); await writeFile(outPath, Buffer.concat(chunks)); console.log("Saved to", outPath); }); } main().catch((err) => { console.error(err); process.exit(1); }); ``` ## Configuration Options ### Speed vs Quality **Latency Modes:** * **Normal:** Best quality, \~500ms latency * **Balanced:** Good quality, \~300ms latency ```python theme={null} # Use latency parameter with stream_websocket audio_stream = client.tts.stream_websocket( text_chunks(), reference_id="model_id", latency="balanced" # For faster response ) ``` ```javascript theme={null} const request = { text: "", reference_id: "model_id", latency: "balanced", // For faster response }; ``` ### Voice Control **Temperature** (0.1 - 1.0): * Lower: More consistent, predictable * Higher: More varied, expressive **Top-p** (0.1 - 1.0): * Lower: More focused * Higher: More diverse ## Real-time Applications ### Chatbot Integration Stream responses as they're generated: ```python theme={null} def chatbot_response(user_input): # Get AI response (streaming) ai_text = get_ai_response(user_input) # Convert to speech in real-time audio_stream = client.tts.stream_websocket(ai_text) for audio_chunk in audio_stream: play_audio(audio_chunk) ``` ```javascript theme={null} async function chatbotResponse(userInput) { // Get AI response (streaming) const aiTextStream = getAiResponse(userInput); // async iterable of strings // Convert to speech in real-time for await (const textChunk of aiTextStream) { for await (const audioChunk of ttsStream(textChunk)) { playAudio(audioChunk); } } } ``` ### Live Translation Translate and speak simultaneously: ```python theme={null} def live_translate(source_audio): # Transcribe source audio text = transcribe(source_audio) # Translate text translated = translate(text, target_language) # Stream translated speech for chunk in stream_text(translated): generate_speech(chunk) ``` ```javascript theme={null} async function liveTranslate(sourceAudio) { // Transcribe source audio const text = await transcribe(sourceAudio); // Translate text const translated = await translate(text, targetLanguage); // Stream translated speech for await (const chunk of streamText(translated)) { generateSpeech(chunk); } } ``` ## Best Practices ### Text Buffering **Do:** * Send complete words with spaces * Use punctuation for natural pauses * Buffer 5-10 words for smoothness **Don't:** * Send individual characters * Forget spaces between words * Send huge chunks at once ### Connection Management 1. **Keep connections alive** for multiple generations 2. **Handle disconnections** gracefully 3. **Implement retry logic** for reliability ### Audio Playback For smooth playback: * Buffer 2-3 audio chunks * Use cross-fading between chunks * Handle network delays gracefully ## Common Use Cases ### Interactive Story ```python theme={null} def interactive_story(): story_parts = [ "Once upon a time,", "in a land far away,", "there lived a brave knight..." ] for part in story_parts: # Generate and play each part stream_speech(part) # Wait for user input user_choice = get_user_input() # Continue based on choice ``` ```javascript theme={null} function interactiveStory() { const storyParts = [ "Once upon a time,", "in a land far away,", "there lived a brave knight...", ]; for (const part of storyParts) { // Generate and play each part streamSpeech(part); // Wait for user input const userChoice = getUserInput(); // Continue based on choice } } ``` ### Virtual Assistant ```python theme={null} def virtual_assistant(): while True: # Listen for wake word if detect_wake_word(): # Start streaming response response = process_command() stream_speech(response) ``` ```javascript theme={null} async function virtualAssistant() { while (true) { // Listen for wake word if (detectWakeWord()) { // Start streaming response const response = processCommand(); streamSpeech(response); } } } ``` ### Live Commentary ```python theme={null} def live_commentary(event_stream): for event in event_stream: # Generate commentary commentary = generate_commentary(event) # Stream immediately stream_speech(commentary) ``` ```javascript theme={null} async function liveCommentary(eventStream) { for await (const event of eventStream) { // Generate commentary const commentary = generateCommentary(event); // Stream immediately streamSpeech(commentary); } } ``` ## Troubleshooting ### Audio Gaps **Problem:** Gaps between audio chunks
**Solution:** * Increase buffer size * Use balanced latency mode * Check network connection ### Delayed Response **Problem:** Long wait before audio starts
**Solution:** * Use balanced latency mode * Send initial text immediately * Reduce chunk size ### Choppy Playback **Problem:** Audio cuts in and out
**Solution:** * Buffer more chunks before playing * Check network stability * Use consistent chunk sizes ## Advanced Features ### Dynamic Voice Switching Change voices mid-stream: ```python theme={null} # Start with one voice def text1(): yield "Hello from voice one." audio1 = client.tts.stream_websocket(text1(), reference_id="voice1") for chunk in audio1: play_audio(chunk) # Switch to another def text2(): yield "And now voice two!" audio2 = client.tts.stream_websocket(text2(), reference_id="voice2") for chunk in audio2: play_audio(chunk) ``` ```javascript theme={null} // Start with one voice const request1 = { reference_id: "voice1" }; streamSpeech("Hello from voice one.", request1); // Switch to another const request2 = { reference_id: "voice2" }; streamSpeech("And now voice two!", request2); ``` ### Emotion Injection Add emotions dynamically: ```python theme={null} def emotional_speech(text, emotion): emotional_text = f"({emotion}) {text}" stream_speech(emotional_text) ``` ```javascript theme={null} function emotionalSpeech(text, emotion) { const emotionalText = `(${emotion}) ${text}`; streamSpeech(emotionalText); } ``` ### Speed Control Adjust speaking speed: ```python theme={null} from fishaudio.types import Prosody # Use speed and volume with stream_websocket audio_stream = client.tts.stream_websocket( text_chunks(), speed=1.5 # 1.5x speed ) # Note: For full prosody control including volume, use TTSConfig ``` ```javascript theme={null} const request = { text: "", prosody: { speed: 1.5, // 1.5x speed volume: 0, // Normal volume }, }; ``` ## Performance Tips 1. **Pre-load voices** for instant start 2. **Use connection pooling** for multiple streams 3. **Monitor latency** and adjust settings 4. **Cache common phrases** for instant playback ## Get Support Need help with streaming? * **Discord Community:** [Join our Discord](https://discord.gg/fish-audio) * **Email Support:** [support@fish.audio](mailto:support@fish.audio) * **Status Page:** [status.fish.audio](https://status.fish.audio) # Voice Cloning Best Practices Source: https://docs.fish.audio/developer-guide/best-practices/voice-cloning Simple tips to get the best voice cloning results with Fish Audio ## Getting Started Voice cloning lets you create a digital version of any voice. Use at least 10 seconds of audio recording for studio-quality results right in the Playground or via the API. ## Recording Your Voice ### Find a Quiet Space **Good places to record:** * A bedroom with curtains and carpet * Inside a parked car * A quiet office or study room * Any room with soft furniture **Avoid recording near:** * Open windows with traffic noise * Running appliances (AC, fans, refrigerators) * Other people talking * TVs or music playing ### Use What You Have **Best options:** * USB microphone or gaming headset * Phone voice recorder app (place it on a stable surface) * Earbuds with microphone (hold them steady) **Quick tip:** Keep the microphone about a hand's width from your mouth and speak normally. ## What to Say **Best approach:** Record 2-3 clips of 15-20 seconds each that form a complete paragraph. Here's a sample script you can read naturally: ``` "Hello, my name is Alex, and I enjoy reading books about technology and science. Yesterday, I walked through the park, observing the beautiful autumn leaves. The weather was quite pleasant, with a gentle breeze and warm sunshine. I often think about how amazing our world is, full of interesting discoveries waiting to be made." ``` ### Recording Tips **Must Have:** * Only one person speaking * Steady volume throughout * Consistent tone and emotion * Small pauses between sentences (about half a second) **Nice to Have:** * No background noise * No room echo * Professional mic (but phone is fine too!) **Avoid:** * Multiple speakers in one recording * Big changes in volume or emotion * Background music or TV * Rushing through without pauses ## Troubleshooting ### Common Problems **Voice sounds robotic?** * Try recording for longer, 30-60 seconds * Speak more naturally and add pauses **Voice doesn't sound like you?** * Make sure you're the only person speaking in the recording * Check that there's no background music or TV **Poor audio quality?** * Find a quieter room to record * Move closer to your microphone * Try using a different recording device ## Important: Getting Permission Only clone voices you have permission to use: * Your own voice * Someone who gave you written permission * Never use voices from the internet without permission * Never use celebrity or public figure voices without permission ## How to Upload Your Recording Visit [fish.audio](https://fish.audio) and log in Find the voice creation button in your dashboard Select your recorded file and give your voice a name It usually takes just a few seconds Type some text and hear your cloned voice speak! ## Making Different Voices Want to create character voices or different styles? Try these: ### Different Emotions Record the same text with different feelings: * Happy and energetic * Calm and relaxed * Serious and professional ### Different Characters Create unique voices for: * Storytelling and audiobooks * Game characters * Educational content * Podcast intros ## Get Help Need assistance? We're here to help: * **Community Forum**: [Join our Discord](https://discord.gg/fish-audio) * **Email Support**: [support@fish.audio](mailto:support@fish.audio) * **Video Tutorials**: Coming soon! # Creating Voice Models Source: https://docs.fish.audio/developer-guide/core-features/creating-models Learn how to create custom voice models with Fish Audio ## Overview Create custom voice models to generate consistent, high-quality speech. You can create models through our web interface or programmatically via API. ## Web Interface The easiest way to create a voice model: Visit [fish.audio](https://fish.audio) and log in Click on "Models" in your dashboard Select "Create New Model" Add 1 or more voice samples (at least 10 seconds each) Choose privacy settings and training options Click "Create" and wait for processing ## Using the API ### Using the SDK Create models with the Python or JavaScript SDK: First, install the SDK: ```bash theme={null} pip install fish-audio-sdk ``` Then create a model: ```python theme={null} from fish_audio_sdk import Session # Initialize session with your API key session = Session("your_api_key") # Create the model model = session.create_model( title="My Voice Model", description="Custom voice for storytelling", voices=[ voice_file1.read(), voice_file2.read() ], cover_image=image_file.read() # Optional ) print(f"Model created: {model.id}") ``` First, install the SDK: ```bash theme={null} npm install fish-audio ``` Then create a model: ```javascript theme={null} import { FishAudioClient } from "fish-audio"; import { createReadStream } from "fs"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); const title = "My Voice Model"; const audioFile1 = createReadStream("sample1.mp3"); // Optionally add more samples: // const audioFile2 = createReadStream("sample2.wav"); const coverImageFile = createReadStream("cover.png"); // optional try { const response = await fishAudio.voices.ivc.create({ title, voices: [audioFile1], cover_image: coverImageFile, description: "Custom voice for storytelling", visibility: "private", }); console.log("Voice created:", { id: response._id, title: response.title, state: response.state, }); } catch (err) { console.error("Create voice request failed:", err); } ``` ### Direct API Create models directly using the REST API: ```python theme={null} import requests response = requests.post( "https://api.fish.audio/model", files=[ ("voices", open("sample1.mp3", "rb")), ("voices", open("sample2.wav", "rb")) ], data=[ ("title", "My Voice Model"), ("description", "Custom voice model"), ("visibility", "private"), ("type", "tts"), ("train_mode", "fast"), ("enhance_audio_quality", "true") ], headers={ "Authorization": "Bearer YOUR_API_KEY" } ) result = response.json() print(f"Model ID: {result['id']}") ``` ```javascript theme={null} import { readFile } from "fs/promises"; const form = new FormData(); form.append("title", "My Voice Model"); form.append("description", "Custom voice model"); form.append("visibility", "private"); form.append("type", "tts"); form.append("train_mode", "fast"); form.append("enhance_audio_quality", "true"); const v1 = await readFile("sample1.mp3"); const v2 = await readFile("sample2.wav"); form.append("voices", new File([v1], "sample1.mp3")); form.append("voices", new File([v2], "sample2.wav")); const res = await fetch("https://api.fish.audio/model", { method: "POST", headers: { Authorization: "Bearer " }, body: form, }); const result = await res.json(); console.log("Model ID:", result.id); ``` ## Model Settings ### Required Parameters | Parameter | Description | Type | Options | | ----------------- | --------------------------------------------------------------------- | -------------- | ----------------------- | | **title** | Name of your model | `string` | Any text | | **voices** | Audio samples | `Array` | .mp3, .wav, .m4a, .opus | | **type**\* | Model type | `enum` | `tts` | | **train\_mode**\* | Model train mode, fast means model instantly available after creation | `enum` | `fast` | \*Automatically set by Python and JavaScript SDKs ### Optional Parameters | Parameter | Description | Type | Options | | --------------------------- | -------------------------------------------------- | --------------- | ---------------------------------------------------- | | **visibility** | Who can use your model | `enum` | `private`, `public`, `unlist`
`default: public` | | **description** | Model description | `string` | Any text | | **cover\_image** | Model cover image, required if the model is public | `File` | .jpg, .png | | **texts** | Transcripts of audio samples | `Array` | Must match number of audio files | | **tags** | Tags for your model | `string[]` | Any text | | **enhance\_audio\_quality** | Remove background noise | `boolean` | `true`, `false`
`default: false` | For detailed explanations view our [API reference](/api-reference/endpoint/model/create-model). ## Audio Requirements ### Quality Guidelines **Minimum Requirements:** * At least 1 audio sample * 10+ seconds per sample **Best Practices:** * Use multiple diverse samples * 1 consistent speaker throughout * Include different emotions and tones * Record in a quiet environment * Maintain steady volume ## Adding Transcripts Including text transcripts improves model quality: ```python theme={null} response = requests.post( "https://api.fish.audio/model", files=[ ("voices", open("hello.mp3", "rb")), ("voices", open("world.wav", "rb")) ], data=[ ("title", "Enhanced Model"), ("texts", "Hello, this is my first recording."), ("texts", "Welcome to the world of AI voices."), # ... other parameters ], headers={"Authorization": "Bearer YOUR_API_KEY"} ) ``` ```javascript theme={null} import { FishAudioClient } from "fish-audio"; import { createReadStream } from "fs"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); const response = await fishAudio.voices.ivc.create({ title: "Enhanced Model", voices: [ createReadStream("hello.mp3"), createReadStream("world.wav"), ], texts: [ "Hello, this is my first recording.", "Welcome to the world of AI voices.", ], // other optional fields: // visibility: "private", // enhance_audio_quality: true, }); console.log("Model ID:", response._id); ``` Text transcripts must match the exact number of audio files. If you provide 3 audio files, you must provide exactly 3 text transcripts. ## Using Your Model Once training is complete: ```python theme={null} # Generate speech with your model response = requests.post( "https://api.fish.audio/v1/tts", json={ "text": "Hello from my custom voice!", "model_id": model_id, "format": "mp3" }, headers={"Authorization": "Bearer YOUR_API_KEY"} ) # Save the audio with open("output.mp3", "wb") as f: f.write(response.content) ``` ```javascript theme={null} import { FishAudioClient } from "fish-audio"; import { writeFile } from "fs/promises"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); const audio = await fishAudio.textToSpeech.convert({ text: "Hello from my custom voice!", model_id: "your_model_id_here", format: "mp3", }); const buffer = Buffer.from(await new Response(audio).arrayBuffer()); await writeFile("output.mp3", buffer); console.log("✓ Audio saved to output.mp3"); ``` ## Troubleshooting ### Common Issues **Model training fails:** * Check audio quality and format * Ensure single speaker in all samples * Verify files are not corrupted **Poor voice quality:** * Add more diverse audio samples * Enable audio enhancement * Use higher quality recording ## Best Practices 1. **Start Simple:** Begin with 2-3 samples in fast mode to test 2. **Iterate:** Refine with more samples and quality mode 3. **Document:** Keep track of which samples work best 4. **Test Thoroughly:** Try different texts and emotions 5. **Privacy First:** Keep personal models private ## Support Need help creating models? * **API Documentation:** [Full API Reference](/api-reference/introduction) * **Discord Community:** [Join our Discord](https://discord.gg/fish-audio) * **Email Support:** [support@fish.audio](mailto:support@fish.audio) # Emotion Control Source: https://docs.fish.audio/developer-guide/core-features/emotions Add natural emotions and expressions to your AI-generated speech ## Overview Fish Audio models support 64+ emotional expressions and voice styles that can be controlled through text markers in your input. Add natural pauses, laughter, and other human-like elements to make speech more engaging and realistic. The `(parenthesis)` syntax on this page applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. ## How It Works Simply wrap emotion tags in parentheses within your text: ``` (happy) What a beautiful day! (sad) I'm sorry to hear that. (excited) This is amazing news! ``` The TTS models will automatically recognize these markers and adjust the voice accordingly. ## Complete Emotion Reference ### Basic Emotions (24 expressions) | Emotion | Tag | Description | Example Context | | ----------- | --------------- | ----------------------- | --------------------------- | | Happy | `(happy)` | Cheerful, upbeat tone | Good news, greetings | | Sad | `(sad)` | Melancholic, downcast | Sympathy, bad news | | Angry | `(angry)` | Frustrated, aggressive | Complaints, warnings | | Excited | `(excited)` | Energetic, enthusiastic | Announcements, celebrations | | Calm | `(calm)` | Peaceful, relaxed | Instructions, meditation | | Nervous | `(nervous)` | Anxious, uncertain | Disclaimers, apologies | | Confident | `(confident)` | Assertive, self-assured | Presentations, sales | | Surprised | `(surprised)` | Shocked, amazed | Reactions, discoveries | | Satisfied | `(satisfied)` | Content, pleased | Confirmations, reviews | | Delighted | `(delighted)` | Very pleased, joyful | Celebrations, compliments | | Scared | `(scared)` | Frightened, fearful | Warnings, horror stories | | Worried | `(worried)` | Concerned, troubled | Concerns, questions | | Upset | `(upset)` | Disturbed, distressed | Complaints, problems | | Frustrated | `(frustrated)` | Annoyed, exasperated | Technical issues, delays | | Depressed | `(depressed)` | Very sad, hopeless | Serious topics | | Empathetic | `(empathetic)` | Understanding, caring | Support, counseling | | Embarrassed | `(embarrassed)` | Ashamed, awkward | Apologies, mistakes | | Disgusted | `(disgusted)` | Repelled, revolted | Negative reviews | | Moved | `(moved)` | Emotionally touched | Heartfelt moments | | Proud | `(proud)` | Accomplished, satisfied | Achievements, praise | | Relaxed | `(relaxed)` | At ease, casual | Casual conversation | | Grateful | `(grateful)` | Thankful, appreciative | Thanks, appreciation | | Curious | `(curious)` | Inquisitive, interested | Questions, exploration | | Sarcastic | `(sarcastic)` | Ironic, mocking | Humor, criticism | ### Advanced Emotions (25 expressions) | Emotion | Tag | Description | Example Context | | ------------- | ----------------- | ------------------------ | ---------------------- | | Disdainful | `(disdainful)` | Contemptuous, scornful | Criticism, rejection | | Unhappy | `(unhappy)` | Discontent, dissatisfied | Complaints, feedback | | Anxious | `(anxious)` | Very worried, uneasy | Urgent matters | | Hysterical | `(hysterical)` | Uncontrollably emotional | Extreme reactions | | Indifferent | `(indifferent)` | Uncaring, neutral | Neutral responses | | Uncertain | `(uncertain)` | Doubtful, unsure | Speculation, questions | | Doubtful | `(doubtful)` | Skeptical, questioning | Disbelief, questioning | | Confused | `(confused)` | Puzzled, perplexed | Clarification requests | | Disappointed | `(disappointed)` | Let down, dissatisfied | Unmet expectations | | Regretful | `(regretful)` | Sorry, remorseful | Apologies, mistakes | | Guilty | `(guilty)` | Culpable, responsible | Confessions, apologies | | Ashamed | `(ashamed)` | Deeply embarrassed | Serious mistakes | | Jealous | `(jealous)` | Envious, resentful | Comparisons | | Envious | `(envious)` | Wanting what others have | Admiration with desire | | Hopeful | `(hopeful)` | Optimistic about future | Future plans | | Optimistic | `(optimistic)` | Positive outlook | Encouragement | | Pessimistic | `(pessimistic)` | Negative outlook | Warnings, doubts | | Nostalgic | `(nostalgic)` | Longing for the past | Memories, stories | | Lonely | `(lonely)` | Isolated, alone | Emotional content | | Bored | `(bored)` | Uninterested, weary | Disinterest | | Contemptuous | `(contemptuous)` | Showing contempt | Strong criticism | | Sympathetic | `(sympathetic)` | Showing sympathy | Condolences | | Compassionate | `(compassionate)` | Showing deep care | Support, help | | Determined | `(determined)` | Resolved, decided | Goals, commitments | | Resigned | `(resigned)` | Accepting defeat | Giving up, acceptance | ### Tone Markers (5 expressions) Control volume and intensity: | Tone | Tag | Description | When to Use | | ---------- | ------------------- | -------------------- | -------------------------- | | Hurried | `(in a hurry tone)` | Rushed, urgent | Time-sensitive information | | Shouting | `(shouting)` | Loud, calling out | Getting attention | | Screaming | `(screaming)` | Very loud, panicked | Emergencies, fear | | Whispering | `(whispering)` | Very soft, secretive | Secrets, quiet scenes | | Soft | `(soft tone)` | Gentle, quiet | Comfort, lullabies | ### Audio Effects (10 expressions) Add natural human sounds: | Effect | Tag | Description | Suggested Text | | ------------- | ----------------- | ---------------------------- | -------------- | | Laughing | `(laughing)` | Full laughter | Ha, ha, ha | | Chuckling | `(chuckling)` | Light laugh | Heh, heh | | Sobbing | `(sobbing)` | Crying heavily | (optional) | | Crying Loudly | `(crying loudly)` | Intense crying | (optional) | | Sighing | `(sighing)` | Exhale of relief/frustration | sigh | | Groaning | `(groaning)` | Sound of frustration | ugh | | Panting | `(panting)` | Out of breath | huff, puff | | Gasping | `(gasping)` | Sharp intake of breath | gasp | | Yawning | `(yawning)` | Tired sound | yawn | | Snoring | `(snoring)` | Sleep sound | zzz | ### Special Effects Additional markers for atmosphere and context: | Effect | Tag | Description | | ------------------- | ----------------------- | ------------------------ | | Audience Laughter | `(audience laughing)` | Crowd laughing sound | | Background Laughter | `(background laughter)` | Ambient laughter | | Crowd Laughter | `(crowd laughing)` | Large group laughing | | Short Pause | `(break)` | Brief pause in speech | | Long Pause | `(long-break)` | Extended pause in speech | You can also use natural expressions like "Ha,ha,ha" for laughter without tags. ## Usage Guidelines ### Placement Rules **For English and Most Languages:** * Emotion tags MUST go at the beginning of sentences * Tone controls can go anywhere in the text * Sound effects can go anywhere in the text **Correct:** ``` (happy) What a wonderful day! ``` **Incorrect:** ``` What a (happy) wonderful day! ``` ## Advanced Techniques ### Combining Effects You can layer multiple emotions for complex expressions: ``` (sad)(whispering) I miss you so much. (angry)(shouting) Get out of here now! (excited)(laughing) We won! Ha ha! ``` ### Emotion Transitions Create natural emotional progressions: ``` (happy) I got the promotion! (uncertain) But... it means relocating. (sad) I'll miss everyone here. (hopeful) Though it's a great opportunity. (determined) I'm going to make it work! ``` ### Background Effects Add atmospheric sounds: ``` The comedy show was amazing (audience laughing) Everyone was having fun (background laughter) The crowd loved it (crowd laughing) ``` ### Intensity Modifiers Fine-tune emotional intensity with descriptive modifiers: ``` (slightly sad) I'm a bit disappointed. (very excited) This is absolutely amazing! (extremely angry) This is unacceptable! ``` ## Language Support All 13 supported languages can use emotion markers. Emotions must be at sentence start for these languages: * **English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese** ## Best Practices ### Do's * Use one primary emotion per sentence * Test different emotion combinations * Match emotions to context logically * Add appropriate text after sound effects (e.g., "Ha ha" after laughing) * Use natural expressions when possible * Space out emotional changes for realism ### Don'ts * Don't overuse emotion tags in short text * Don't mix conflicting emotions * Don't create custom tags - use only supported ones * Don't forget parentheses * Don't place emotion tags mid-sentence in English ## Common Use Cases ### Customer Service ``` (friendly) Hello! How can I help you today? (empathetic) I understand your frustration. (confident) I'll resolve this for you right away. (grateful) Thank you for your patience! ``` ### Storytelling ``` (narrator) Once upon a time... (mysterious)(whispering) The old house stood silent. (scared) "Is anyone there?" she called out. (relieved)(sighing) No one answered. Phew. ``` ### Educational Content ``` (enthusiastic) Welcome to today's lesson! (curious) Have you ever wondered why? (encouraging) That's a great question! (proud) Excellent work! ``` ### Marketing & Sales ``` (excited) Introducing our newest product! (confident) You won't find better quality anywhere. (urgent) Limited time offer! (satisfied) Join thousands of happy customers! ``` ## Troubleshooting ### Emotion Not Working? 1. **Check placement** - Emotions must be at the beginning of sentences for English 2. **Verify spelling** - Tags must match exactly as listed 3. **Include parentheses** - Tags must be wrapped in parentheses ### Unnatural Sound? * Space out emotional changes * Use appropriate intensity * Test with different voices * Add context text after sound effects ### Performance Notes * Emotion markers don't count toward token limits * No additional latency for emotion processing * All emotions available on all pricing tiers * Maximum of 3 combined emotions per sentence recommended ## Quick Reference Tables ### Emotion Intensity Scale | Base Emotion | Mild | Moderate | Intense | | ------------ | ------------ | -------- | --------- | | Happy | satisfied | happy | delighted | | Sad | disappointed | sad | depressed | | Angry | frustrated | angry | furious | | Scared | nervous | scared | terrified | | Excited | interested | excited | ecstatic | ### Common Combinations | Scenario | Emotion Combo | Example | | ---------------- | ------------------------ | ------------------------------------- | | Whispered Secret | (mysterious)(whispering) | "I have something to tell you..." | | Angry Shout | (angry)(shouting) | "Stop right there!" | | Sad Sigh | (sad)(sighing) | "I wish things were different. Sigh." | | Excited Laugh | (excited)(laughing) | "We did it! Ha ha!" | | Nervous Question | (nervous)(uncertain) | "Are you sure about this?" | ## See Also * [Emotion Reference Guide](/api-reference/emotion-reference) - Complete emotion list with examples * [API Reference](/api-reference/introduction) - Implementation details * [Text-to-Speech Guide and Best Practices](/developer-guide/core-features/text-to-speech) # Fine-grained Control Source: https://docs.fish.audio/developer-guide/core-features/fine-grained-control Advanced control over speech generation ## Getting Started To use fine-grained control, you can use either our SDK, API, or Playground. SDK/API: We recommend disabling normalization by setting `"normalize": false` in the request body. This ensures that the API doesn't alter the intonation of control tags. Playground: You can use V1.6 Control Model, without setting any other options. Disabling normalization may reduce the stability of reading numbers, dates, and URLs. You'll need to handle these cases manually for best results. ## Phoneme Control Phoneme control allows you to specify exact pronunciations for words or characters. Currently, we support: * CMU Arpabet (for English) * Pinyin (for Chinese) To use phoneme control, wrap the desired pronunciation in `<|phoneme_start|>` and `<|phoneme_end|>` tags. Each tag should contain a single word or character. ### English Example Standard: "I am an engineer." With phoneme control: "I am an `<|phoneme_start|>EH N JH AH N IH R<|phoneme_end|>`." ### Chinese Example Standard: "我是一个工程师。" With phoneme control: "我是一个`<|phoneme_start|>gong1<|phoneme_end|><|phoneme_start|>cheng2<|phoneme_end|><|phoneme_start|>shi1<|phoneme_end|>`。" ## Paralanguage Paralanguage controls allow you to add natural speech elements and pauses to make the generated speech sound more human-like. There are two main types of controls: ### Pause Words You can use common pause words like "um", "uh", "嗯", "啊" to control the rhythm of the speech. ### Special Effects The following special effects can be added using parentheses: | Effect | Description | First Available | Stage | | ---------------- | ------------------ | --------------- | ------------ | | `(break)` | Short pause | V1.6 | Experimental | | `(long-break)` | Extended pause | V1.6 | Experimental | | `(breath)` | Breathing sound | V1.6 | Experimental | | `(laugh)` | Laughter sound | V1.6 | Experimental | | `(cough)` | Coughing sound | V1.6 | Experimental | | `(lip-smacking)` | Lip smacking sound | V1.6 | Experimental | | `(sigh)` | Sighing sound | V1.6 | Experimental | The effects `(laugh)`, `(cough)`, `(lip-smacking)`, and `(sigh)` are developing. You may need to repeat them multiple times for better results. Example: Standard: "I am an engineer." With paralanguage: "I am, um, an (break) engineer." # Speech to Text Guide Source: https://docs.fish.audio/developer-guide/core-features/speech-to-text Convert audio recordings into accurate text transcriptions ## Overview Transform any audio recording into text with Fish Audio's speech recognition. Perfect for transcriptions, subtitles, and voice commands. ## Getting Started ### Web Interface Transcribe audio instantly: Go to [fish.audio](https://fish.audio) and log in Click on "Speech to Text" in your dashboard Select your audio file (MP3, WAV, M4A) Click "Transcribe" and copy your text ## Supported Formats ### Audio Files **Accepted formats:** * MP3 (recommended) * WAV * M4A * OGG * FLAC * AAC **File requirements:** * Maximum size: 20MB * Maximum duration: 60 minutes * Minimum duration: 1 second ## Language Support ### Automatic Detection The system automatically detects the language spoken in your audio. No configuration needed! ### Manual Selection For better accuracy, specify the language: **Major Languages:** * English (en) * Chinese (zh) * Japanese (ja) With **additional languages** to be supported soon! ## Audio Quality Tips ### For Best Results **Recording Environment:** * Quiet room with minimal echo * No background music * Clear, consistent speaking voice * One speaker at a time **Audio Settings:** * Sample rate: 16kHz or higher * Bit rate: 128kbps or higher * Mono or stereo (mono preferred) ### Common Issues **Poor transcription quality?** * Remove background noise * Increase microphone volume * Speak clearly and not too fast * Avoid multiple speakers talking over each other ## Use Cases ### Meeting Transcription Convert recorded meetings into searchable text: 1. Record your meeting (Zoom, Teams, etc.) 2. Export the audio file 3. Upload to Fish Audio 4. Get formatted transcription with timestamps ### Podcast Transcripts Create written versions of your podcasts: * Generate show notes automatically * Create searchable content * Improve accessibility * Enable translations ### Video Subtitles Generate subtitles for your videos: 1. Extract audio from video 2. Transcribe with Fish Audio 3. Get timestamped text 4. Import into video editor ### Voice Notes Convert voice memos to text: * Dictate ideas quickly * Transcribe later for editing * Search through voice notes * Share as text documents ## Advanced Features ### Timestamps Get precise timing for each spoken segment: ``` [00:00:00] Welcome to our podcast. [00:00:03] Today we're discussing AI technology. [00:00:07] Let's dive right in. ``` Perfect for: * Creating subtitles * Navigating long recordings * Synchronizing with video * Building searchable archives ### Speaker Detection Identify different speakers in conversations: ``` Speaker 1: "What do you think about the proposal?" Speaker 2: "I think it has potential." Speaker 1: "Let's discuss the details." ``` ### Punctuation & Formatting Automatic formatting includes: * Sentence capitalization * Punctuation marks * Paragraph breaks * Number formatting ## Tips for Different Content ### Interviews **Best practices:** * Use a good microphone for each speaker * Record in a quiet environment * Speak one at a time * Keep consistent volume levels ### Lectures & Presentations **Optimize for:** * Clear articulation of technical terms * Pause between topics * Repeat important points * Avoid reading too fast ### Phone Calls **Considerations:** * Phone audio is lower quality * Expect slightly lower accuracy * Speak clearly and slowly * Avoid speakerphone if possible ## Accuracy Expectations ### What Affects Accuracy **Positive factors:** * Clear audio quality * Native speaker accent * Common vocabulary * Single speaker **Challenging factors:** * Heavy accents * Technical jargon * Multiple speakers * Background noise ### Typical Accuracy Rates * **Professional recording:** 95-98% * **Clean amateur recording:** 90-95% * **Phone/video calls:** 85-90% * **Noisy environments:** 75-85% ## Post-Processing Tips ### Editing Transcriptions After transcription: 1. **Review for accuracy** - Check names and technical terms 2. **Add formatting** - Break into paragraphs 3. **Correct errors** - Fix any misheard words 4. **Add context** - Include speaker names ### Export Options Save your transcriptions as: * Plain text (.txt) * Word document (.docx) * Subtitle file (.srt) * PDF document ## Common Applications ### Business * Meeting minutes * Interview transcripts * Call recordings * Training materials ### Education * Lecture notes * Research interviews * Student recordings * Language learning ### Content Creation * Video scripts * Podcast show notes * Social media captions * Blog post drafts ### Accessibility * Hearing impaired support * Multi-language content * Searchable archives * Documentation ## Troubleshooting ### No Text Output **Check:** * Audio file isn't corrupted * File format is supported * Audio contains speech * Volume is audible ### Incorrect Language **Solutions:** * Manually select the correct language * Ensure majority of audio is in one language * Separate multi-language content ### Missing Words **Common causes:** * Speaking too fast * Mumbling or unclear speech * Technical terms not recognized * Very quiet sections ## Privacy & Security ### Your Data * Audio files are processed securely * Transcriptions are private to your account * Files are not used for training * Delete anytime from your account ### Sensitive Content For confidential audio: * Use on-premise solutions if available * Review privacy policy * Consider redacting sensitive information * Download and delete after processing ## Best Practices Summary 1. **Start with quality audio** - Good input = good output 2. **Choose the right environment** - Quiet spaces work best 3. **Speak clearly** - Articulate and consistent pace 4. **Review and edit** - All transcriptions benefit from review 5. **Use appropriate tools** - Different content needs different approaches ## Get Support Need help with transcription? * **Try it free:** [fish.audio](https://fish.audio) * **Community:** [Discord](https://discord.gg/fish-audio) * **Email:** [support@fish.audio](mailto:support@fish.audio) * **Status:** [status.fish.audio](https://status.fish.audio) # Text to Speech Source: https://docs.fish.audio/developer-guide/core-features/text-to-speech Convert text to natural-sounding speech with Fish Audio ## Overview Transform any text into natural, expressive speech using Fish Audio's advanced TTS models. Choose from pre-made voices or use your own cloned voices. Discover the world's best cloned voices models on our [Discovery](https://fish.audio/discovery) page. ## Quick Start ### Web Interface The easiest way to generate speech: Go to [fish.audio](https://fish.audio) and log in Type or paste the text you want to convert Select from available voices or use your own Click "Generate" and download your audio ## Using the SDK ```bash theme={null} pip install fish-audio-sdk ``` Generate speech with just a few lines of code: ```python theme={null} from fishaudio import FishAudio from fishaudio.utils import save # Initialize client client = FishAudio(api_key="your_api_key_here") # Generate speech audio = client.tts.convert( text="Hello, world!", reference_id="your_voice_model_id" ) save(audio, "output.mp3") print("✓ Audio saved to output.mp3") ``` ```bash theme={null} npm install fish-audio ``` Generate speech with just a few lines of code: ```javascript theme={null} import { FishAudioClient } from "fish-audio"; import { writeFile } from "fs/promises"; // Initialize session const fishAudio = new FishAudioClient({ apiKey: "your_api_key_here" }); const audio = await fishAudio.textToSpeech.convert({ text: "Hello, world!", reference_id: "your_voice_model_id", }); const buffer = Buffer.from(await new Response(audio).arrayBuffer()); await writeFile("output.mp3", buffer); console.log("✓ Audio saved to output.mp3"); ``` ## Voice Options ### Using Pre-made Voices Browse and select voices from the playground: ```python theme={null} # Use a voice from the playground audio = client.tts.convert( text="Welcome to Fish Audio!", reference_id="7f92f8afb8ec43bf81429cc1c9199cb1" ) ``` ```javascript theme={null} # Use a voice from the playground const audio = await fishAudio.textToSpeech.convert({ text: "Welcome to Fish Audio!", reference_id: "7f92f8afb8ec43bf81429cc1c9199cb1", }); ``` ### Using Your Cloned Voice Use voices you've created: ```python theme={null} # Use your own cloned voice audio = client.tts.convert( text="This is my custom voice speaking", reference_id="your_model_id" ) ``` ```javascript theme={null} # Use your own cloned voice const audio = await fishAudio.textToSpeech.convert({ text: "This is my custom voice speaking", reference_id: "your_model_id", }); ``` ### Using Reference Audio Provide reference audio directly: ```python theme={null} from fishaudio.types import ReferenceAudio # Use reference audio on-the-fly with open("voice_sample.wav", "rb") as f: audio = client.tts.convert( text="Hello from reference audio", references=[ ReferenceAudio( audio=f.read(), text="Sample text from the audio" ) ] ) ``` ```javascript theme={null} // Use reference audio on-the-fly const fileBuffer = await readFile("voice_sample.wav"); const voiceFile = new File([fileBuffer], "voice_sample.wav"); const audio = await fishAudio.textToSpeech.convert({ text: "Hello from reference audio", references: [ { audio: voiceFile, text: "Sample text from the audio" } ] }); ``` ## Model Selection Choose the right model for your needs: | Model | Best For | Quality | Speed | | ---------- | --------------- | --------- | ------- | | **s1** | Prototyping | Excellent | Fast | | **s2-pro** | Latest features | Excellent | Fastest | Specify a model in your request: ```python theme={null} # Using the latest model (default) audio = client.tts.convert(text="Hello world") ``` ```javascript theme={null} // Using the latest S2-Pro model const audio = await fishAudio.textToSpeech.convert( { text: "Hello world" }, "s2-pro" ); ``` ## Advanced Options ### Audio Formats Choose your output format: ```python theme={null} audio = client.tts.convert( text="Your text here", format="mp3", # Options: "mp3", "wav", "pcm", "opus" mp3_bitrate=128 # For MP3: 64, 128, or 192 ) ``` ```javascript theme={null} const audio = await fishAudio.textToSpeech.convert({ text: "Your text here", format: "mp3", // Options: "mp3", "wav", "pcm", "opus" mp3_bitrate: 128, // For MP3: 64, 128, or 192 }); ``` ### Chunk Length Control text processing chunks: ```python theme={null} audio = client.tts.convert( text="Long text content...", chunk_length=200 # 100-300 characters per chunk ) ``` ```javascript theme={null} const audio = await fishAudio.textToSpeech.convert({ text: "Long text content...", chunk_length: 200, // 100-300 characters per chunk }); ``` ### Latency Mode Optimize for speed or quality: ```python theme={null} audio = client.tts.convert( text="Quick response needed", latency="balanced" # "normal" or "balanced" ) ``` ```javascript theme={null} const audio = await fishAudio.textToSpeech.convert({ text: "Quick response needed", latency: "balanced", // "normal" or "balanced" }); ``` Balanced mode reduces latency to \~300ms but may slightly decrease stability. ## Direct API Usage For direct API calls without the SDK: ```python theme={null} import httpx import ormsgpack # Prepare request request_data = { "text": "Hello, world!", "reference_id": "your_model_id", "format": "mp3" } # Make API call with httpx.Client() as client: response = client.post( "https://api.fish.audio/v1/tts", content=ormsgpack.packb(request_data), headers={ "authorization": "Bearer YOUR_API_KEY", "content-type": "application/msgpack", "model": "s2-pro" } ) # Save audio with open("output.mp3", "wb") as f: f.write(response.content) ``` ```javascript theme={null} import { encode } from "@msgpack/msgpack"; import { writeFile } from "fs/promises"; const body = encode({ text: "Hello, world!", reference_id: "your_model_id", format: "mp3", }); const res = await fetch("https://api.fish.audio/v1/tts", { method: "POST", headers: { Authorization: "Bearer ", "Content-Type": "application/msgpack", model: "s2-pro", }, body, }); const buffer = Buffer.from(await res.arrayBuffer()); await writeFile("output.mp3", buffer); ``` ## Streaming Audio Stream audio for real-time applications: ```python theme={null} # Stream audio chunks audio_stream = client.tts.stream( text="Streaming this text in real-time", reference_id="model_id" ) with open("stream_output.mp3", "wb") as f: for chunk in audio_stream: f.write(chunk) # Process chunk immediately for real-time playback ``` ```javascript theme={null} // Use a Websocket to stream real-time audio import { FishAudioClient, RealtimeEvents } from "fish-audio"; import { writeFile } from "fs/promises"; import path from "path"; // Simple async generator that yields text chunks async function* makeTextStream() { const chunks = [ "Hello from Fish Audio! ", "This is a realtime text-to-speech test. ", "We are streaming multiple chunks over WebSocket.", ]; for (const chunk of chunks) { yield chunk; } } const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); // For realtime, set text to "" and stream the content via makeTextStream const request = { text: "" }; const connection = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream()); // Collect audio and write to a file when the stream ends const chunks: Buffer[] = []; connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened")); connection.on(RealtimeEvents.AUDIO_CHUNK, (audio: unknown): void => { if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) { chunks.push(Buffer.from(audio)); } }); connection.on(RealtimeEvents.ERROR, (err) => console.error("WebSocket error:", err)); connection.on(RealtimeEvents.CLOSE, async () => { const outPath = path.resolve(process.cwd(), "out.mp3"); await writeFile(outPath, Buffer.concat(chunks)); console.log("Saved to", outPath); }); ``` ## Adding Emotions The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. Make your speech more expressive: ```python theme={null} # Add emotion markers to your text emotional_text = """ (excited) I just won the lottery! (sad) But then I lost the ticket. (laughing) Just kidding, I found it! """ audio = client.tts.convert( text=emotional_text, reference_id="model_id" ) ``` ```javascript theme={null} // Add emotion markers to your text const emotionalText = `(excited) I just won the lottery! (sad) But then I lost the ticket. (laughing) Just kidding, I found it!`; const audio = await fishAudio.textToSpeech.convert({ text: emotionalText, reference_id: "model_id", }); ``` Available emotions: * Basic: `(happy)`, `(sad)`, `(angry)`, `(excited)`, `(calm)` * Tones: `(shouting)`, `(whispering)`, `(soft tone)` * Effects: `(laughing)`, `(sighing)`, `(crying)` For more precise control over pronunciation and additional paralanguage features like pauses and breathing, see [Fine-grained Control](/developer-guide/core-features/fine-grained-control). ## Best Practices ### Text Preparation **Do:** * Use proper punctuation for natural pauses * Add emotion markers for expression * Break long texts into paragraphs * Use consistent formatting **Don't:** * Use ALL CAPS (unless shouting) * Mix multiple languages randomly * Include special characters unnecessarily * Forget punctuation ### Performance Tips 1. **Batch Processing:** Process multiple texts efficiently 2. **Cache Models:** Store frequently used model IDs 3. **Optimize Chunk Size:** Use 200 characters for best balance 4. **Handle Errors:** Implement retry logic for network issues ### Quality Optimization For best results: * Use high-quality reference audio for cloning * Choose appropriate emotion markers * Test different latency modes * Monitor API rate limits ## Troubleshooting ### Common Issues **No audio output:** * Check API key validity * Verify model ID exists * Ensure proper audio format **Poor quality:** * Use better reference audio * Try normal latency mode * Check text formatting **Slow generation:** * Use balanced latency mode * Reduce chunk length * Check network connection ## Code Examples ### Batch Processing ```python theme={null} from fishaudio.utils import save texts = [ "First announcement", "Second announcement", "Third announcement" ] for i, text in enumerate(texts): audio = client.tts.convert( text=text, reference_id="model_id" ) save(audio, f"output_{i}.mp3") ``` ```javascript theme={null} const texts = [ "First announcement", "Second announcement", "Third announcement", ]; for (let i = 0; i < texts.length; i++) { const audio = await fishAudio.textToSpeech.convert({ text: texts[i], reference_id: "model_id", }); const buffer = Buffer.from(await new Response(audio).arrayBuffer()); await writeFile(`output_${i}.mp3`, buffer); } ``` ### Error Handling ```python theme={null} import time from fishaudio.exceptions import FishAudioError def generate_with_retry(text, max_retries=3): for attempt in range(max_retries): try: audio = client.tts.convert( text=text, reference_id="model_id" ) return audio except FishAudioError as e: if attempt < max_retries - 1: time.sleep(2 ** attempt) # Exponential backoff else: raise e ``` ```javascript theme={null} async function generateWithRetry(text, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { const audio = await fishAudio.textToSpeech.convert({ text, reference_id: "model_id", }); const buffer = Buffer.from(await new Response(audio).arrayBuffer()); return buffer; } catch (err) { if (attempt < maxRetries - 1) { const delayMs = 2 ** attempt * 1000; await new Promise((r) => setTimeout(r, delayMs)); } else { throw err; } } } } const buffer = await generateWithRetry("Hello with retry"); await writeFile("retry_output.mp3", buffer); ``` ## API Reference ### Request Parameters | Parameter | Type | Description | Default | | ----------------- | ------- | -------------------- | -------- | | **text** | string | Text to convert | Required | | **reference\_id** | string | Model/voice ID | None | | **format** | string | Audio format | "mp3" | | **chunk\_length** | integer | Characters per chunk | 200 | | **normalize** | boolean | Normalize text | true | | **latency** | string | Speed vs quality | "normal" | ### Response Returns audio data in the specified format as binary stream. ## Get Support Need help with text-to-speech? * [API Reference](/api-reference/introduction) * **Discord Community:** [Join our Discord](https://discord.gg/fish-audio) * **Email Support:** [support@fish.audio](mailto:support@fish.audio) # Changelog Source: https://docs.fish.audio/developer-guide/getting-started/changelog Complete release history and version updates for all Fish Audio products ## Fish Audio S2 Next-generation text-to-speech model with inline emotion cues, multi-speaker dialogue support, and 80+ languages. S2 introduces `[bracket]` syntax for natural language control over emotion and paralinguistic cues (e.g., `[whisper]`, `[laugh]`, `[emphasis]`). Tags are treated as standard text rather than dedicated control tokens, so you are not limited to a fixed set of expressions. Built on the Qwen3-4B backbone and fully open-source. Use model ID `s2-pro` in the API. S1 remains supported for existing integrations. [GitHub](https://github.com/fishaudio/fish-speech) | [HuggingFace](https://huggingface.co/fishaudio) ## Fish Audio S1 Historic rebrand from Fish Speech to Fish Audio. #1 ranking on TTS-Arena2 with industry-leading performance. S1 (4B params): 0.008 WER, 0.004 CER - Available on Fish Audio Playground S1-mini (0.5B params): 0.011 WER, 0.005 CER - Open source on Hugging Face 64+ emotional expressions with RLHF integration and multilingual support for English, Chinese, Japanese, and more. [Read More about S1](https://fish.audio/blog/introducing-s1/) ## v1.5.1 Fixed critical PyTorch security settings and improved inference speed significantly. Added ONNX export support for better deployment options and enhanced text processing for Arabic and Hebrew languages. Includes bug fixes for Apple Silicon (MPS) compatibility and reorganized library structure for cleaner codebase. ## v1.5.0 Introduced v1.5 model architecture with improved dataset handling and bearer token authentication for APIs. Added reference audio caching by hash for faster performance and better Apple Silicon support. Includes OpenAPI documentation refactoring and base64 reference data support in JSON format. ## v1.4.3 Introduced Fish Agent for conversational AI with streaming capabilities and real-time interactions. Added comprehensive Korean language documentation and fixed critical non-English speech issues. Improved WebUI streaming functionality and PyTorch version compatibility. ## v1.4.2 Documentation-focused release with comprehensive updates for v1.4, macOS support, and multiple language translations. Improved Docker support and API enhancements for JSON format handling. Added audio selection to WebUI and fixed various stability issues including cache handling and backend performance. ## v1.4.1 Infrastructure improvements focused on Docker optimization and multi-platform builds. Updated PyTorch version and replaced audio backend from sox for better performance. Enhanced CI/CD pipeline with buildx support and fixed various Docker-related issues. ## v1.4.0 Major release with new VQGAN architecture for improved audio quality and faster inference. Updated WebUI with enhanced interface and better language switching. Added Japanese documentation translation and fixed inference warmup issues for better performance. ## v1.2.1 Replaced Whisper with SenseVoice for better ASR and added native Apple Silicon support. Includes Portuguese (Brazil) localization, streaming audio functionality, and CPU-only inference improvements. Pinned PyTorch to 2.3.1 to fix inference speed issues and aligned API with official closed-source version. ## v1.2 Introduced auto-reranking system for better results along with bilingual support and model quantization. Replaced standard Whisper with Faster Whisper for improved speed and added Japanese documentation. Enhanced model stability and inference performance with optimized v1.2 architecture. ## v1.1.2 Minor release adding Chinese text normalization support and a streaming audio download button in the WebUI. Fixed LoRA merging issues and improved Firefly performance. ## v1.1.1 Breaking changes: Replaced zibai with uvicorn for API server, new text-splitter with byte-based length calculation, and license change to CC-BY-NC-SA 4.0. Added Apple Silicon (MPS) support, Windows one-click installation, and automatic model downloading with resume capability. Improved WebUI with better file selection and download progress indicators. ## v1.1.0 Added VITS decoder integration with full streaming support and queue management for real-time audio generation. Introduced internationalization (i18n) with Spanish translation and improved Windows packaging. Optimized GPU memory usage and CPU-only inference performance while adding LoRA support to the Gradio UI. ## v1.0.0 Major milestone release introducing new VQ-GAN architecture with VITS decoder support, LoRA fine-tuning, and streaming inference capabilities. Breaking changes include removal of the Rust-based data server, new tokenizer replacing phonemizer, and updated model architecture (VQ + DiT + Reflow). Achieved 4x memory reduction during loading and added WebUI for training and annotation. ## v0.2.0 First public release of Fish Speech featuring a complete text-to-speech pipeline with VQ-GAN audio codec and LLAMA-based language model. Includes multi-language support (Chinese, English, Japanese), Gradio WebUI for inference, HTTP API server, and Docker support. Added special optimizations for Chinese users including mirror downloads and localized documentation. # Overview of Fish Audio Source: https://docs.fish.audio/developer-guide/getting-started/introduction Discover Fish Audio's powerful voice generation platform and what you can build ## What is Fish Audio? Fish Audio is a cutting-edge AI platform for voice generation, voice cloning, and audio storytelling. Our technology brings dynamic, natural-sounding voices to your applications, enabling immersive experiences across industries. Introducing our latest generation voice models: **Fish Audio S2-Pro:** Our latest model delivers unparalleled naturalness and emotion, setting a new standard for AI-generated speech. [Learn more about our models →](/developer-guide/models-pricing/models-overview) ## Core Capabilities Generate natural, expressive speech from text in multiple languages and styles Create custom voice models from as little as 15 seconds of audio Build multi-character narratives with emotion and dynamic voice switching ## Try It Now Test our voices in the interactive playground - no code required Browse available voice models and their capabilities ## Ready to Start? Get your API key and make your first API call in minutes. Generate your first AI voice in under 5 minutes ## Platform Capabilities Fish Audio empowers developers to create innovative voice experiences across diverse industries. Whether you're building consumer apps, enterprise solutions, or creative tools, our platform provides the flexibility and power you need. ### What You Can Build Automate podcast production, YouTube narration, and audiobook generation Create dynamic NPC dialogue and real-time character voices Build interactive language learning tools and accessible educational content Deploy natural-sounding IVR systems and support agents Develop screen readers and voice restoration tools Generate ASMR content, music vocals, interactive stories, and adult content ### Key Features Stream audio in real-time for live applications Industry-leading naturalness and clarity Generate speech in 30+ languages Fine-tune prosody, emotion, and speaking style RESTful API with SDKs for Python, Node.js, and more Handle everything from prototypes to production workloads ## Learn More * [Models & Pricing](/developer-guide/models-pricing/models-overview) - Explore voice models and pricing options * [Core Features](/developer-guide/core-features/text-to-speech) - Deep dive into TTS and voice cloning * [SDKs & Tools](/developer-guide/sdk-guide/python/installation) - Install language-specific libraries * [Best Practices](/developer-guide/best-practices/voice-cloning) - Production-ready tips and optimization for voice cloning, emotion and expression control, and real-time voice streaming # Quick Start Source: https://docs.fish.audio/developer-guide/getting-started/quickstart Generate your first AI voice with Fish Audio in under 5 minutes ## Overview This guide will walk you through generating your first text-to-speech audio with Fish Audio. By the end, you'll have converted text into natural-sounding speech using our API. ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Your First TTS Request Choose your preferred method to generate speech: Store your API key as an environment variable (recommended approach): ```bash theme={null} export FISH_API_KEY="replace_me" ``` Run this [cURL](https://curl.se/) command to generate your first speech: ```bash theme={null} curl -X POST https://api.fish.audio/v1/tts \ -H "Authorization: Bearer $FISH_API_KEY" \ -H "Content-Type: application/json" \ -H "model: s2-pro" \ -d '{ "text": "Hello! Welcome to Fish Audio. This is my first AI-generated voice.", "format": "mp3" }' \ --output welcome.mp3 ``` The audio has been saved as `welcome.mp3`. You can play it by: * Double-clicking the file or opening it in any media player * Or using the command line: ```bash theme={null} # On macOS afplay welcome.mp3 # On Linux mpg123 welcome.mp3 # On Windows start welcome.mp3 ``` ```bash theme={null} pip install fish-audio-sdk ``` Create a Python script: ```python theme={null} from fishaudio import FishAudio from fishaudio.utils import save # Initialize with your API key client = FishAudio(api_key="your_api_key_here") # Generate speech audio = client.tts.convert(text="Hello! Welcome to Fish Audio.") save(audio, "welcome.mp3") print("✓ Audio saved to welcome.mp3") ``` ```bash theme={null} python generate_speech.py ``` The audio has been saved as `welcome.mp3`. You can play it by: * Double-clicking the file or opening it in any media player * Or using the command line: ```bash theme={null} # On macOS afplay welcome.mp3 # On Linux mpg123 welcome.mp3 # On Windows start welcome.mp3 ``` ```bash theme={null} npm install fish-audio ``` Create a JavaScript script: ```javascript theme={null} import { FishAudioClient } from "fish-audio"; import { writeFile } from "fs/promises"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); const audio = await fishAudio.textToSpeech.convert({ text: "Hello, world!", }); const buffer = Buffer.from(await new Response(audio).arrayBuffer()); await writeFile("welcome.mp3", buffer); console.log("✓ Audio saved to welcome.mp3"); ``` ```bash theme={null} node generate_speech.mjs ``` The audio has been saved as `welcome.mp3`. You can play it by: * Double-clicking the file or opening it in any media player * Or using the command line: ```bash theme={null} # On macOS afplay welcome.mp3 # On Linux mpg123 welcome.mp3 # On Windows start welcome.mp3 ``` ## Customizing Your Voice The examples above use the default voice. To use a different voice, add the `reference_id` parameter with a model ID from [fish.audio](https://fish.audio). You can find the model ID in the URL or use the copy button when viewing any voice. Choose a voice to try: From: [https://fish.audio/m/8ef4a238714b45718ce04243307c57a7](https://fish.audio/m/8ef4a238714b45718ce04243307c57a7) ```bash theme={null} export REFERENCE_ID="8ef4a238714b45718ce04243307c57a7" ``` From: [https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89](https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89) ```bash theme={null} export REFERENCE_ID="802e3bc2b27e49c2995d23ef70e6ac89" ``` Then generate speech with your chosen voice: ```bash theme={null} curl -X POST https://api.fish.audio/v1/tts \ -H "Authorization: Bearer $FISH_API_KEY" \ -H "Content-Type: application/json" \ -H "model: s2" \ -d '{ "text": "This is a custom voice from Fish Audio! You can explore hundreds of different voices on the platform, or even create your own.", "reference_id": "'"$REFERENCE_ID"'", "format": "mp3" }' \ --output custom_voice.mp3 ``` ```python theme={null} import os from fishaudio import FishAudio from fishaudio.utils import save client = FishAudio(api_key="your_api_key_here") # Generate speech with custom voice audio = client.tts.convert( text="This is a custom voice from Fish Audio! You can explore hundreds of different voices on the platform, or even create your own.", reference_id=os.environ.get("REFERENCE_ID") ) save(audio, "custom_voice.mp3") ``` ```javascript theme={null} import { FishAudioClient } from "fish-audio"; import { writeFile } from "fs/promises"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); const audio = await fishAudio.textToSpeech.convert({ text: "This is a custom voice from Fish Audio! You can explore hundreds of different voices on the platform, or even create your own.", reference_id: process.env.REFERENCE_ID, }); const buffer = Buffer.from(await new Response(audio).arrayBuffer()); await writeFile("custom_voice.mp3", buffer); console.log("✓ Audio saved to custom_voice.mp3"); ``` ## Support Need help? Check out these resources: * [API Reference](/api-reference/introduction) - Complete API documentation * [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model * [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech * [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming * [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community * [Support Email](mailto:support@fish.audio) - Contact our support team # LiveKit Source: https://docs.fish.audio/developer-guide/integrations/livekit Build real-time voice AI agents with Fish Audio and LiveKit [LiveKit Agents](https://github.com/livekit/agents) is an open source framework for building real-time voice and multimodal AI agents. It handles streaming audio pipelines, turn detection, interruptions, and LLM orchestration so you can focus on your agent's behavior. Fish Audio integrates with LiveKit through the `fishaudio` plugin, providing text-to-speech synthesis with support for both chunked and real-time WebSocket streaming modes. ## Prerequisites * A [Fish Audio account](https://fish.audio) with an API key * Python 3.9 or higher ## Installation Install LiveKit Agents with Fish Audio support: ```bash theme={null} pip install "livekit-agents[fishaudio]" ``` ## Configuration Set your Fish Audio API key as an environment variable: ```bash theme={null} export FISH_API_KEY=your_api_key_here ``` ## Basic usage Add Fish Audio TTS to your LiveKit agent: ```python theme={null} from livekit.plugins.fishaudio import TTS tts = TTS( reference_id="your_voice_model_id", # Optional: use a specific voice model="s1", sample_rate=24000, latency_mode="balanced" ) ``` ### Key parameters | Parameter | Description | | --------------- | ------------------------------------------------------------------------- | | `api_key` | Your Fish Audio API key (or use `FISH_API_KEY` env var) | | `model` | TTS model/backend to use (default: `s1`) | | `reference_id` | Voice model ID from the [Fish Audio library](https://fish.audio/discover) | | `output_format` | Audio format: `pcm`, `mp3`, `wav`, or `opus` (default: `pcm`) | | `sample_rate` | Audio sample rate in Hz (default: `24000`) | | `num_channels` | Number of audio channels (default: `1`) | | `base_url` | Custom API endpoint (default: `https://api.fish.audio`) | | `latency_mode` | `normal` (\~500ms) or `balanced` (\~300ms, default) | ### Streaming modes The plugin supports two synthesis modes: ```python theme={null} # Chunked (non-streaming) synthesis stream = tts.synthesize("Hello, world!") # Real-time WebSocket streaming stream = tts.stream() ``` ## Resources * [LiveKit Agents Documentation](https://docs.livekit.io/agents/) * [LiveKit GitHub](https://github.com/livekit/agents) * [Fish Audio Plugin Reference](https://docs.livekit.io/reference/python/v1/livekit/plugins/fishaudio/index.html) * [Fish Audio Voice Library](https://fish.audio/discovery) # n8n Source: https://docs.fish.audio/developer-guide/integrations/n8n Automate workflows with Fish Audio and n8n [n8n](https://n8n.io/) is a fair-code licensed workflow automation platform. The Fish Audio community node brings text-to-speech, speech-to-text, and voice cloning capabilities to your n8n workflows. ## Installation Install from n8n community nodes: 1. Go to **Settings** > **Community Nodes** 2. Select **Install** 3. Enter `n8n-nodes-fishaudio` 4. Accept the risks and install See the [n8n community nodes guide](https://docs.n8n.io/integrations/community-nodes/installation/) for details. ## Configuration 1. Go to **Credentials** > **Add Credential** 2. Search for "Fish Audio API" 3. Enter your API key from [fish.audio/app/api-keys](https://fish.audio/app/api-keys) ## Features The node supports: * **Text-to-Speech** — Generate audio from text using any voice model * **Speech-to-Text** — Transcribe audio files * **Voice Models** — List, create, and manage custom voices * **Account** — Check credit balance The node is also available as an AI tool for use with n8n's AI Agent nodes. ## Resources * [npm package](https://www.npmjs.com/package/n8n-nodes-fishaudio) * [GitHub](https://github.com/fishaudio/fish-audio-n8n) * [n8n Community Nodes](https://docs.n8n.io/integrations/community-nodes/) # Pipecat Source: https://docs.fish.audio/developer-guide/integrations/pipecat Build voice AI agents with Fish Audio and Pipecat [Pipecat](https://github.com/pipecat-ai/pipecat) is an open source framework for building voice and multimodal conversational AI. It handles the orchestration of audio, AI services, and conversation pipelines so you can focus on what makes your agent unique. Fish Audio integrates with Pipecat through `FishAudioTTSService`, which provides real-time text-to-speech synthesis using WebSocket streaming for low-latency conversational applications. ## Prerequisites * A [Fish Audio account](https://fish.audio) with an API key * Python 3.9 or higher ## Installation Install Pipecat with Fish Audio support: ```bash theme={null} pip install "pipecat-ai[fish]" ``` ## Configuration Set your Fish Audio API key as an environment variable: ```bash theme={null} export FISH_API_KEY=your_api_key_here ``` ## Basic usage Add `FishAudioTTSService` to your Pipecat pipeline: ```python theme={null} from pipecat.services.fish import FishAudioTTSService tts = FishAudioTTSService( api_key=os.getenv("FISH_API_KEY"), reference_id="your_voice_model_id", # Optional: use a specific voice model_id="s1", params=FishAudioTTSService.InputParams( latency="normal", prosody_speed=1.0 ) ) ``` ### Key parameters | Parameter | Description | | --------------- | ------------------------------------------------------------------------- | | `api_key` | Your Fish Audio API key | | `reference_id` | Voice model ID from the [Fish Audio library](https://fish.audio/discover) | | `model_id` | TTS model version (default: `s1`) | | `output_format` | Audio format: `pcm`, `mp3`, `wav`, or `opus` | ### Prosody controls Customize speech characteristics with `InputParams`: ```python theme={null} params=FishAudioTTSService.InputParams( latency="balanced", # "normal" or "balanced" prosody_speed=1.2, # 0.5 to 2.0 prosody_volume=0, # Volume adjustment in dB normalize=True # Audio normalization ) ``` ## Resources * [Pipecat Documentation](https://docs.pipecat.ai/server/services/tts/fish) * [Pipecat GitHub](https://github.com/pipecat-ai/pipecat) * [Fish Audio Voice Library](https://fish.audio/discovery) # Choosing a Model Source: https://docs.fish.audio/developer-guide/models-pricing/choosing-a-model Select the right Fish Audio model for your use case and requirements We recommend using **Fish Audio S2-Pro** for all projects - our flagship model with industry-leading quality and performance. ## Support Need help? Check out these resources: * [API Reference](/api-reference/introduction) - Complete API documentation * [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model * [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech * [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming * [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community * [Support Email](mailto:support@fish.audio) - Contact our support team # Model Deprecations Source: https://docs.fish.audio/developer-guide/models-pricing/deprecations Track deprecated models and migration timelines for Fish Audio services ## Available Models Currently available models: * **Fish Audio S2** (Recommended) - Latest generation with best performance * **Fish Audio S1** - Highly expressive and natural sounding ## Deprecated Models * **speech-1.6** - Fish Speech v1.6 has been deprecated on February, 28th, 2026 * **speech-1.5** - Fish Speech v1.5 has been deprecated on February, 28th, 2026 We strongly recommend using **Fish Audio S1** for all new projects to access the latest capabilities and performance improvements. ## Support Need help? Check out these resources: * [API Reference](/api-reference/introduction) - Complete API documentation * [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model * [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech * [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming * [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community * [Support Email](mailto:support@fish.audio) - Contact our support team # Models Overview Source: https://docs.fish.audio/developer-guide/models-pricing/models-overview Explore Fish Audio's voice generation models and their capabilities ## Available Models Fish Audio offers state-of-the-art text-to-speech models optimized for different use cases and performance requirements. ### Recommended Model **Fish Audio S2-Pro** - Our next-generation TTS model with best-in-class performance * Natural language control with `[bracket]` syntax — not limited to a fixed set (e.g., `[whispers sweetly]`, `[laughing nervously]`) * Multi-speaker dialogue support **(S2-Pro exclusive)** * 80+ languages * 100ms time-to-first-audio * Full SGLang-based serving stack * Open-source We recommend using `s2-pro` for all new projects to access the latest capabilities and performance improvements. S1 remains available for existing integrations. ### Previous Model **Fish Audio S1** - High-quality voice generation * 4 billion parameters * 0.008 WER (0.8% word error rate) * Full emotional control capabilities with `(parenthesis)` syntax ## Model Specifications ### Fish Audio S1 Performance Metrics * **Word Error Rate (WER)**: 0.008 (0.8%) * **Character Error Rate (CER)**: 0.004 (0.4%) * **Real-time Factor**: \~1:7 on standard hardware * **TTS-Arena2 Ranking**: #1 worldwide ## Supported Languages ### S2-Pro S2-Pro supports 80+ languages with automatic language detection and inline emotion and paralinguistic cue support. Language detection is automatic - simply provide text in your target language. ### S1 S1 supports text-to-speech generation in 13 languages with full emotional expression capabilities. ``` English, Chinese, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese ``` ## Voice Styles and Emotions Fish Audio models support emotional expressions and voice styles that can be controlled through text markers in your input. ### S2-Pro Natural Language Control S2-Pro treats `[bracket]` tags as standard text rather than dedicated control tokens. Through training on massive datasets, the model learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as `[whispers sweetly]` or `[laughing nervously]`. Common examples include: ``` [whisper] [laugh] [emphasis] [sigh] [gasp] [pause] [angry] [excited] [sad] [surprised] [inhale] [exhale] ``` S2-Pro cues can be placed anywhere in your text to control emotion at specific positions. For example: `"I can't believe it [gasp] you actually did it [laugh]"` ### S1 Voice Styles and Emotions S1 supports 64+ emotional expressions using `(parenthesis)` syntax. ### Basic Emotions (24 expressions) ``` (angry) (sad) (excited) (surprised) (satisfied) (delighted) (scared) (worried) (upset) (nervous) (frustrated) (depressed) (empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) ``` ### Advanced Emotions (25 expressions) ``` (disdainful) (unhappy) (anxious) (hysterical) (indifferent) (impatient) (guilty) (scornful) (panicked) (furious) (reluctant) (keen) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused) ``` ### Tone Markers (5 expressions) ``` (in a hurry tone) (shouting) (screaming) (whispering) (soft tone) ``` ### Audio Effects (10 expressions) ``` (laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing) ``` You can also use natural expressions like "Ha,ha,ha" for laughter. Experiment with combinations to achieve the perfect emotional tone for your application. ## Support Need help? Check out these resources: * [API Reference](/api-reference/introduction) - Complete API documentation * [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model * [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech * [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming * [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community * [Support Email](mailto:support@fish.audio) - Contact our support team # Pricing & Rate Limits Source: https://docs.fish.audio/developer-guide/models-pricing/pricing-and-rate-limits Understand Fish Audio pricing plans, usage costs, and API rate limits ## API Pricing The Fish Audio API uses pay-as-you-go pricing based on actual usage. There are no subscription fees or monthly minimums for API access. ### Text-to-Speech (TTS) Models TTS pricing is based on the size of input text, measured in millions of UTF-8 bytes. | Model Name | Price (USD) | | ---------- | ----------------------- | | `s2-pro` | \$15.00 / M UTF-8 bytes | | `s1` | \$15.00 / M UTF-8 bytes | 1M UTF-8 bytes is approximately 180,000 English words, or about 12 hours of speech ### Automatic Speech Recognition (ASR) Models | Model Name | Price (USD) | | -------------- | ------------------- | | `transcribe-1` | \$0.36 / audio hour | **How ASR billing works:** * Charges are based on the duration of audio processed * Duration is rounded up to the nearest second ## Rate Limits These limits help us ensure fair usage and maintain service quality for all users. ### Concurrent Request Limits | Tier | Spending Threshold | Concurrent Requests | | ----------- | ------------------ | ------------------- | | Starter | \< \$100 paid | 5 requests | | Elevated | ≥ \$100 paid | 15 requests | | High Volume | ≥ \$1,000 paid | 50 requests | | Enterprise | Custom | Custom limits | Concurrency tiers unlock as soon as your total prepaid amount reaches the threshold. You do not need to spend the full balance first. If your workload needs a higher concurrency tier, you can top up in advance to unlock the next tier immediately. Please reach out to our team to enable enterprise volume pricing, rate limits, and billing. ## Support Need help? Check out these resources: * [API Reference](/api-reference/introduction) - Complete API documentation * [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model * [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech * [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming * [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community * [Support Email](mailto:support@fish.audio) - Contact our support team # Story Studio Source: https://docs.fish.audio/developer-guide/products/story-studio Build immersive audio stories and narratives Coming soon! We're preparing comprehensive documentation for Story Studio. In the meantime, you can: * Visit the [Fish Audio Playground](https://fish.audio) to explore our storytelling features * Check back soon for detailed guides and tutorials Join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. # Text to Speech Source: https://docs.fish.audio/developer-guide/products/tts Convert text into natural-sounding speech with Fish Audio's AI voices Coming soon! We're preparing comprehensive documentation for our Text-to-Speech web interface. In the meantime, you can: * Visit the [Fish Audio Playground](https://fish.audio) to try our TTS features * Check our [API documentation](/api-reference/endpoint/openapi-v1/text-to-speech) for programmatic access * Read our [TTS Guide and Best Practices](/developer-guide/core-features/text-to-speech) Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. # Voice Cloning Source: https://docs.fish.audio/developer-guide/products/voice-cloning Create custom voice models from audio samples Coming soon! We're preparing comprehensive documentation for our Voice Cloning web interface. In the meantime, you can: * Visit the [Fish Audio Playground](https://fish.audio) to try voice cloning * View our [Python SDK voice cloning guide](/developer-guide/sdk-guide/python/voice-cloning) * Read our [voice cloning best practices](/developer-guide/best-practices/voice-cloning) Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. # Agent Quickstart Source: https://docs.fish.audio/developer-guide/resources/agent-quickstart Low-noise entry points and canonical URLs for AI agents using Fish Audio documentation ## Purpose This page is the recommended starting point for AI agents, RAG pipelines, and documentation crawlers that need accurate Fish Audio references with minimal markup noise. ## Built-In Agent Indexes This documentation site already provides built-in LLM-friendly indexes: * [llms.txt](https://docs.fish.audio/llms.txt) for the curated documentation index * [llms-full.txt](https://docs.fish.audio/llms-full.txt) for broader site context In most cases, agents should read `llms.txt` first and only fetch `llms-full.txt` when they need wider context across the whole documentation set. ## Install the Agent Skill For coding agents that support [Agent Skills](https://github.com/vercel-labs/skills) (Claude Code, Cursor, Windsurf, Codex, and others), install the ready-made raw-API skill with a single command: ```bash theme={null} npx skills add https://docs.fish.audio --skill fish-audio-api ``` The skill teaches the agent how to call the Fish Audio REST and WebSocket APIs directly from `curl`, Python, Node.js, or any HTTP client — no SDK required. It covers authentication, every endpoint in our [OpenAPI schema](https://docs.fish.audio/api-reference/openapi.json), MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, and the WebSocket streaming protocol. Discovery endpoint: [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json). Run `npx skills add https://docs.fish.audio` (without `--skill`) to install every skill published here, including the auto-generated product overview skill. ## Retrieval Order 1. Read [llms.txt](https://docs.fish.audio/llms.txt) for the curated documentation index. 2. Read [llms-full.txt](https://docs.fish.audio/llms-full.txt) when broad site context is needed. 3. Read [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) for REST schemas, parameters, and examples. 4. Read [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) for the WebSocket streaming protocol. 5. Fetch individual `.md` pages only after narrowing to a specific task. ## Canonical API Facts * Base API URL: `https://api.fish.audio` * Authentication: `Authorization: Bearer ` * TTS model selection: send a required `model` header. Recommended default: `s2-pro` * Main REST endpoints: * `POST /v1/tts` * `POST /v1/asr` * `GET /model` * `POST /model` * `GET /model/{id}` * `PATCH /model/{id}` * `DELETE /model/{id}` * Real-time streaming endpoint: `wss://api.fish.audio/v1/tts/live` ## High-Value URLs ### Start Here * [Agent Quickstart](https://docs.fish.audio/developer-guide/resources/agent-quickstart.md) * [Quick Start](https://docs.fish.audio/developer-guide/getting-started/quickstart.md) * [AI Coding Agents](https://docs.fish.audio/developer-guide/resources/coding-agents.md) ### API Specs * [OpenAPI](https://docs.fish.audio/api-reference/openapi.json) * [AsyncAPI](https://docs.fish.audio/api-reference/asyncapi.yml) * [API Introduction](https://docs.fish.audio/api-reference/introduction.md) ### Authentication And SDK Setup * [Python Authentication](https://docs.fish.audio/developer-guide/sdk-guide/python/authentication.md) * [JavaScript Authentication](https://docs.fish.audio/developer-guide/sdk-guide/javascript/authentication.md) * [Python SDK Overview](https://docs.fish.audio/developer-guide/sdk-guide/python/overview.md) * [JavaScript Installation](https://docs.fish.audio/developer-guide/sdk-guide/javascript/installation.md) ### Core Product Tasks * [Text to Speech Guide](https://docs.fish.audio/developer-guide/core-features/text-to-speech.md) * [Speech to Text Guide](https://docs.fish.audio/developer-guide/core-features/speech-to-text.md) * [Creating Voice Models](https://docs.fish.audio/developer-guide/core-features/creating-models.md) * [Emotion Control](https://docs.fish.audio/developer-guide/core-features/emotions.md) * [Fine-grained Control](https://docs.fish.audio/developer-guide/core-features/fine-grained-control.md) ### Real-Time And Integrations * [WebSocket TTS Streaming](https://docs.fish.audio/api-reference/endpoint/websocket/tts-live.md) * [Real-time Voice Streaming Best Practices](https://docs.fish.audio/developer-guide/best-practices/real-time-streaming.md) * [Python WebSocket Streaming](https://docs.fish.audio/developer-guide/sdk-guide/python/websocket.md) * [JavaScript WebSocket](https://docs.fish.audio/developer-guide/sdk-guide/javascript/websocket.md) * [LiveKit Integration](https://docs.fish.audio/developer-guide/integrations/livekit.md) * [Pipecat Integration](https://docs.fish.audio/developer-guide/integrations/pipecat.md) ### Models, Pricing, And Lifecycle * [Models Overview](https://docs.fish.audio/developer-guide/models-pricing/models-overview.md) * [Choosing a Model](https://docs.fish.audio/developer-guide/models-pricing/choosing-a-model.md) * [Pricing And Rate Limits](https://docs.fish.audio/developer-guide/models-pricing/pricing-and-rate-limits.md) * [Model Deprecations](https://docs.fish.audio/developer-guide/models-pricing/deprecations.md) ## Task Routing * If the task is "generate speech", start with Quick Start, the Text to Speech guide, and `POST /v1/tts`. * If the task is "transcribe audio", start with the Speech to Text guide and `POST /v1/asr`. * If the task is "clone or manage voices", start with Creating Voice Models and the `/model` endpoints. * If the task is "stream audio in real time", start with AsyncAPI, WebSocket TTS Streaming, and the WebSocket SDK guides. * If the task is "pick the right model or estimate cost", start with Models Overview and Pricing And Rate Limits. ## Notes For Agents * Prefer `openapi.json` and `asyncapi.yml` for machine-readable schemas. * Prefer `.md` URLs when you need a single human-authored page in Markdown form. * Some richer pages use interactive MDX widgets. If a fetched page contains UI or component noise, fall back to this page, `llms.txt`, `llms-full.txt`, or the API spec files first. * Treat this page as the canonical low-noise entry point for Fish Audio documentation retrieval. # Brand Guidelines Source: https://docs.fish.audio/developer-guide/resources/brand Design guidelines for using Fish Audio brand assets ## Logo ### Wordmark Our preferred logo format combines the [Fish Audio Icon](#icon) with the wordmark side by side. This is the primary version of our logo and should be used whenever possible for maximum brand recognition and clarity. Fish Audio Clearspace Wordmark Fish Audio Clearspace Wordmark ### Icon Our icon features a whale composed of audio bars and sound waves, symbolizing the fusion of marine life with audio technology. This design represents our brand's commitment to natural, flowing, and powerful voice generation. The Fish Audio icon should only be used when space constraints or context make it impractical to display the full wordmark. Always prefer the wordmark with icon combination when possible. Fish Audio Clearspace Logo Fish Audio Clearspace Logo ### Avoid To maintain the integrity of our brand identity, please do not alter our logo in any of the following ways: Incorrect logo usage - distorted Incorrect logo usage - distorted Incorrect logo usage - rotated Incorrect logo usage - rotated Incorrect logo usage - wrong colors Incorrect logo usage - wrong colors Incorrect logo usage - effects Incorrect logo usage - effects ## Colors Our official brand colors consist of black and white for primary logo applications, complemented by secondary grays for subtle variations and an accent purple for visual highlights in marketing materials.
## Typography Our brand uses **Onest Semibold** in the logo wordmark. This documentation is also set in Onest, so you're experiencing our brand typography right now. [Download Onest on Google Fonts](https://fonts.google.com/specimen/Onest) ## Usage Guidelines The Fish Audio name and logos are trademarks of Hanabi AI Inc. You may freely use and redistribute our brand assets when referencing Fish Audio. By using our brand assets, you agree that we own them and that any goodwill generated by your use benefits Fish Audio. ### Do * Use our brand assets freely in your projects, applications, and content * Share our brand assets in blog posts, tutorials, documentation, and educational materials * Follow the visual guidelines shown above (spacing, colors, sizing) * Link to fish.audio when using our brand online ### Don't * Use our logo as part of your own product name or branding * Imply partnership, sponsorship, or endorsement without permission * Feature our logo more prominently than your own brand ### Questions? If you're unsure whether your use case is appropriate or need special permission, please contact us at [support@fish.audio](mailto:support@fish.audio). ## Download Assets # AI Coding Agents Source: https://docs.fish.audio/developer-guide/resources/coding-agents Connect AI coding assistants to Fish Audio documentation via MCP for real-time API guidance ## Overview Integrate Fish Audio's comprehensive documentation directly into your AI coding assistants. Using MCP (Model Context Protocol), coding agents like Claude Code, Cursor, and Windsurf can access our latest API references, guides, and examples in real-time. The Fish Audio MCP server provides instant access to: * Complete API documentation * SDK usage examples * Best practices and implementation patterns * Troubleshooting guides Connect once and get accurate, up-to-date Fish Audio knowledge in your coding environment. This documentation site also exposes built-in LLM-friendly indexes: * [llms.txt](https://docs.fish.audio/llms.txt) for the curated page index * [llms-full.txt](https://docs.fish.audio/llms-full.txt) for broader site context If your coding agent supports direct document fetching, start with `llms.txt` before pulling individual pages. ## Install as an Agent Skill Fish Audio publishes a ready-made [Agent Skill](https://github.com/vercel-labs/skills) that teaches your coding agent how to call the Fish Audio REST and WebSocket APIs directly, without an SDK. It covers authentication, every endpoint in our OpenAPI schema, MessagePack vs JSON vs multipart encoding rules, multi-speaker dialogue, and the WebSocket streaming protocol. ```bash theme={null} npx skills add https://docs.fish.audio --skill fish-audio-api ``` This installs the skill into your agent's local skill directory (for example `~/.claude/skills/fish-audio-api/`). Once installed, ask your agent to "call the Fish Audio TTS API with curl" or "stream TTS over WebSocket in Python" and it will follow the skill's conventions. ```bash theme={null} npx skills add https://docs.fish.audio ``` Installs every skill advertised at [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json), including the auto-generated product overview skill and the raw-API skill. The discovery index lives at [/.well-known/agent-skills/index.json](https://docs.fish.audio/.well-known/agent-skills/index.json) and each skill's raw markdown is served at [/.well-known/agent-skills/\/SKILL.md](https://docs.fish.audio/.well-known/agent-skills/fish-audio-api/SKILL.md). Review the skill content first, then install with: ```bash theme={null} npx skills add https://docs.fish.audio --list # show available skills npx skills add https://docs.fish.audio --skill fish-audio-api ``` The `skills` CLI works with any agent that uses `SKILL.md` conventions — Claude Code, Cursor, Windsurf, Codex, and others. See [`npx skills --help`](https://github.com/vercel-labs/skills) for agent-specific install flags such as `-a claude-code` or `-a cursor`. Prefer MCP if you want live documentation search inside your editor. Prefer the Agent Skill if you want a self-contained instruction file that works offline after install and doesn't rely on a running MCP server. ## Why Use MCP Integration? Access the latest API documentation without leaving your editor Generate working code based on current API specifications Get context-aware help for debugging and optimization ## Setup Open your terminal in your project directory and run: ```bash theme={null} claude mcp add --transport http fish-audio --scope project https://docs.fish.audio/mcp ``` This creates a `.mcp.json` file in your project root with the Fish Audio documentation server configuration. Claude Code supports three installation scopes: * **`--scope project`** (recommended): Stores configuration in `.mcp.json` at project root. Version-controlled and shared with your team. * **`--scope user`**: Available globally across all your projects, but private to your account. * **`--scope local`** (default): Project-specific but private to you only. Good for experimentation. For team collaboration, use project scope and commit the `.mcp.json` file to git. Check that the server is connected: ```bash theme={null} claude mcp list ``` You should see `fish-audio` in the list of configured servers. Ask Claude Code: "What Fish Audio models are available?" or "How do I use Fish Audio's TTS API?" Use `Cmd+Shift+P` (Mac) or `Ctrl+Shift+P` (Windows/Linux) to open the command palette, then search for "Open MCP settings". Select "Add custom MCP" to open the `mcp.json` configuration file. Add the Fish Audio documentation server: ```json theme={null} { "mcpServers": { "fish-audio": { "url": "https://docs.fish.audio/mcp" } } } ``` Save the configuration file and reload Cursor to apply changes. In Cursor's chat, ask: "What tools do you have available?" You should see the Fish Audio MCP server listed. Then try: "What Fish Audio TTS models are available?" Cursor's MCP support was added in early 2025. Ensure you're running the latest version for full functionality. Go to `File > Preferences > Windsurf Settings`, then navigate to `Cascade > Model Context Protocol (MCP) Servers`. Click "Add custom server +" or "View raw config" to edit the configuration file at `~/.codeium/windsurf/mcp_config.json`. Add the Fish Audio documentation server: ```json theme={null} { "mcpServers": { "fish-audio": { "url": "https://docs.fish.audio/mcp" } } } ``` Save the configuration and click the refresh button in Windsurf to apply changes. Open Cascade chat (Ctrl+L) and ask: "Search Fish Audio docs for TTS API usage" or "What emotion parameters does Fish Audio support?" Windsurf's MCP support was introduced in Wave 3 (February 2025). Ensure you're running the latest version. ## Using the Integration ### Example Queries Once connected, ask your coding agent questions naturally: "How do I authenticate with Fish Audio API?" "Show me Python code for text-to-speech" "What emotion parameters are available?" "Help me implement real-time streaming" ### Code Generation Examples Ask: "Generate a Python function for text-to-speech with Fish Audio" ```python theme={null} from fish_audio import FishAudioClient def text_to_speech(text: str, voice_id: str, output_file: str): """Convert text to speech using Fish Audio API""" client = FishAudioClient(api_key="your-api-key") response = client.tts.create( text=text, model_id=voice_id, format="mp3" ) with open(output_file, "wb") as f: f.write(response.audio_data) return output_file ``` Ask: "Create a voice cloning pipeline with error handling" ```python theme={null} from fish_audio import FishAudioClient import logging def clone_voice(audio_path: str, name: str): """Clone a voice from audio sample""" client = FishAudioClient(api_key="your-api-key") try: # Upload audio sample with open(audio_path, "rb") as f: model = client.models.create( name=name, audio_data=f.read(), description="Custom cloned voice" ) logging.info(f"Voice cloned: {model.id}") return model.id except Exception as e: logging.error(f"Cloning failed: {e}") raise ``` Ask: "Implement real-time TTS streaming" ```python theme={null} from fish_audio import FishAudioClient import asyncio async def stream_tts(text: str, voice_id: str): """Stream TTS audio in real-time""" client = FishAudioClient(api_key="your-api-key") async for chunk in client.tts.stream( text=text, model_id=voice_id, chunk_size=1024 ): # Process audio chunk yield chunk ``` ## Available Documentation Your coding agent can access: Complete endpoint documentation with parameters Python SDK usage and examples Optimization patterns and tips Available models and rate limits Custom voice creation guides Common issues and solutions ## Advanced Usage ### Custom Commands Create agent workflows for common tasks: ```text Voice Pipeline theme={null} "Create a complete voice generation pipeline with: - Authentication - Voice selection - Emotion control - Error handling - Audio export" ``` ```text Batch Processing theme={null} "Build a batch TTS processor that: - Reads from CSV - Handles rate limits - Retries on failure - Tracks progress" ``` ```text WebSocket Client theme={null} "Implement a WebSocket client for: - Real-time streaming - Auto-reconnection - Buffer management - Error recovery" ``` ### Context-Aware Features With MCP integration, your agent can: * Suggest appropriate models based on use case * Handle rate limiting automatically * Provide inline documentation * Validate API calls against specifications * Recommend optimization strategies ## Troubleshooting If the MCP server isn't connecting: 1. Verify internet connectivity 2. Check `https://docs.fish.audio/mcp` is accessible 3. Ensure your agent supports MCP protocol 4. Restart your coding environment 5. Clear any cached configurations The MCP server always serves the latest documentation: 1. Refresh the MCP connection in settings 2. Clear documentation cache if available 3. Report persistent issues to [support@fish.audio](mailto:support@fish.audio) If certain features aren't available: 1. Verify you're using the latest agent version 2. Check MCP protocol compatibility 3. Ensure proper server configuration 4. Contact support for assistance ## Security **Your data is safe:** - MCP provides read-only access to public documentation * No API keys are transmitted through MCP - All connections use HTTPS encryption - No user queries or usage data is stored ## Next Steps Start with Fish Audio API basics Install and configure the Python SDK Learn text-to-speech optimization Create custom voice models ## Support Need help with MCP integration? * **Technical Support**: [support@fish.audio](mailto:support@fish.audio) * **Documentation Issues**: [GitHub](https://github.com/fishaudio) * **Community**: [Discord](https://discord.gg/dF9Db2Tt3Y) # Migration Guide Source: https://docs.fish.audio/developer-guide/resources/migration Switch from ElevenLabs, OpenAI, or other TTS providers to Fish Audio Coming soon! We're preparing comprehensive migration guides to help you seamlessly switch to Fish Audio. We're working on detailed migration guides for: * ElevenLabs * OpenAI TTS * Google Cloud Text-to-Speech * Amazon Polly * Other TTS providers Check back soon or join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates. # Roadmap Source: https://docs.fish.audio/developer-guide/resources/roadmap Upcoming features and improvements for Fish Audio ## Roadmap Explore what's coming next for Fish Audio. Our roadmap reflects our current priorities and vision for the platform. This roadmap is subject to change based on user feedback and technical considerations. Features may be added, modified, or removed as we continue to develop the platform. ### Coming Soon Details about our upcoming features and improvements will be published here. ## Feature Requests Have a feature request or want to vote on priorities? We'd love to hear from you: * **Email**: [support@fish.audio](mailto:support@fish.audio) * **Discord**: Join our [community Discord](https://discord.gg/dF9Db2Tt3Y) * **GitHub**: Open an issue on our [GitHub repository](https://github.com/fishaudio) ## Stay Updated Subscribe to our [changelog](/developer-guide/getting-started/changelog) RSS feed to get notified when new features are released. # Authentication Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/authentication Manage API keys and client setup in the Fish Audio JavaScript SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Client Initialization Initialize a `FishAudioClient` with your API key to start using the SDK: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; // Initialize with your API key const fishAudio = new FishAudioClient({ apiKey: "your_api_key" }); ``` ### Using Environment Variables For better security, store your API key in environment variables: Set the environment variable in your shell: ```bash theme={null} export FISH_API_KEY=your_api_key_here ``` Then initialize immediately: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; const fishAudio = new FishAudioClient(); ``` ```typescript theme={null} import { config } from "dotenv"; import { FishAudioClient } from "fish-audio"; // Load environment variables from .env file config(); const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); ``` Create a `.env` file in your project root: ```bash theme={null} FISH_API_KEY=your_api_key_here ``` ### Custom Endpoints If you need to use a proxy or custom endpoint: ```typescript theme={null} const fishAudio = new FishAudioClient({ apiKey: "your_api_key", baseUrl: "https://your-proxy-domain.com", }); ``` # Installation Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/installation Install and set up the Fish Audio JavaScript SDK To use the Fish Audio API in server-side JavaScript environments like Node.js, Deno, or Bun, you can use the official [Fish Audio SDK for TypeScript and JavaScript](https://www.npmjs.com/package/fish-audio). ## Requirements * Node.js 18 or higher ## Install Install the JavaScript SDK from npm. Choose your preferred package manager: ```bash theme={null} npm install fish-audio ``` ```bash theme={null} yarn add fish-audio ``` ```bash theme={null} pnpm add fish-audio ``` ## Support Need help? Check out these resources: * [API Reference](/api-reference/introduction) - Complete API documentation * [Create a Voice Clone](/api-reference/endpoint/model/create-model) - Create a voice clone model * [Generate Speech](/api-reference/endpoint/openapi-v1/text-to-speech) - Generate realistic speech * [Real-time Streaming](/developer-guide/sdk-guide/python/websocket) - WebSocket for real-time streaming * [Discord Community](https://discord.com/invite/dF9Db2Tt3Y) - Get help from the community * [Support Email](mailto:support@fish.audio) - Contact our support team # Speech to Text Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/speech-to-text Convert audio to text with Fish Audio JavaScript SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Basic Usage Transcribe audio to text: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; import { createReadStream } from "fs"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); const result = await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), }); console.log(result.text); console.log("Duration (s):", result.duration); ``` ## Language Specification Improve accuracy by specifying the language: ```typescript theme={null} // English transcription await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), language: "en" }); // Chinese transcription await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), language: "zh" }); ``` Common language codes: `en` (English), `zh` (Chinese), `es` (Spanish), `fr` (French), `de` (German), `ja` (Japanese), `ko` (Korean), `pt` (Portuguese) Automatic language detection works well, but specifying the language improves accuracy and speed. ## Working with Segments Get detailed timing for each segment: ```typescript theme={null} const response = await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3") }); // Full transcription console.log(response.text); // Segment details for (const seg of response.segments ?? []) { console.log(`[${seg.start.toFixed(2)}s - ${seg.end.toFixed(2)}s] ${seg.text}`); } ``` ## Timestamps Control Control timestamp generation: ```typescript theme={null} // Include timestamps (default) await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), ignore_timestamps: false }); // Skip timestamp processing for faster results await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3"), ignore_timestamps: true }); ``` `ignore_timestamps: false` (default) includes segment timestamps. Set to `true` to skip timestamp processing for faster transcription when you only need the text. ## Audio Formats Supported audio formats: * MP3 (recommended) * WAV * M4A * OGG * FLAC * AAC File requirements: * Maximum size: 20MB * Maximum duration: 60 minutes * Sample rate: 16kHz or higher recommended ## Transcribing TTS Output Transcribe generated speech: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; const fishAudio = new FishAudioClient(); // Generate speech const ttsAudio = await fishAudio.textToSpeech.convert({ text: "Hello, this is a test" }); // Transcribe it const asr = await fishAudio.speechToText.convert({ audio: ttsAudio }); console.log(asr.text); ``` ## Error Handling Handle common errors: ```typescript theme={null} try { await fishAudio.speechToText.convert({ audio: createReadStream("audio.mp3") }); } catch (e: any) { const status = e?.status || e?.response?.status; if (status === 413) console.error("Audio file too large (max 20MB)"); else if (status === 400) console.error("Invalid audio format"); else throw e; } ``` ## Response Structure The ASR response includes: | Field | Type | Description | | ---------- | ------------- | ------------------------- | | `text` | string | Complete transcription | | `duration` | number | Audio duration (seconds) | | `segments` | ASRSegment\[] | Timestamped text segments | Segment structure: | Field | Type | Description | | ------- | ------ | -------------------- | | `text` | string | Segment text | | `start` | number | Start time (seconds) | | `end` | number | End time (seconds) | Note the timing units: `duration` and segment times are in seconds. ## Request Parameters | Parameter | Type | Description | Default | | | | ------------------- | ------- | -------------------------- | ------------------ | ------------------- | -------- | | `audio` | File | Buffer | Readable stream | Audio to transcribe | Required | | `language` | string | Language code (e.g., "en") | None (auto-detect) | | | | `ignore_timestamps` | boolean | Skip timestamp processing | false | | | # Text to Speech Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/text-to-speech Convert text to natural speech with Fish Audio JavaScript SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Basic Usage Generate speech from text: ```typescript theme={null} import { FishAudioClient, play } from "fish-audio"; const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); const audio = await fishAudio.textToSpeech.convert({ text: "Hello, world!", }); await play(audio); ``` ## Using Voice Models Specify a voice model for consistent voice generation: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; const fishAudio = new FishAudioClient(); const audio = await fishAudio.textToSpeech.convert({ text: "This is my custom voice", reference_id: "your_model_id", // Your model ID from fish.audio }); await play(audio); ``` ### Getting Model IDs The `reference_id` is the model ID from the URL when viewing a model on Fish Audio: * Model URL: `https://fish.audio/m/802e3bc2b27e49c2995d23ef70e6ac89` * Reference ID: `802e3bc2b27e49c2995d23ef70e6ac89` You can also get model IDs programmatically: ```typescript theme={null} // List your models const results = await fishAudio.voices.search({ self: true }); for (const model of results.items ?? []) { console.log(`${model.title}: ${model._id}`); } // Get specific model details const model = await fishAudio.voices.get("your_model_id"); console.log(`Model: ${model.title}, ID: ${model._id}`); ``` ## Emotions The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. Add emotional expressions to your text: ```typescript theme={null} import type { TTSRequest } from "fish-audio"; const text = ` (happy) I'm excited to share this! (sad) Unfortunately, it didn't work out. (whispering) This is a secret. `; const request: TTSRequest = { text, reference_id: "model_id" }; ``` Common emotions: `(happy)`, `(sad)`, `(angry)`, `(excited)`, `(calm)`, `(surprised)`, `(whispering)`, `(shouting)`, `(laughing)`, `(sighing)` For more advanced control over speech generation, including phoneme-level control and additional paralanguage features, see [Fine-grained Control](/developer-guide/core-features/fine-grained-control). ## Audio Formats Choose output format based on your needs: ```typescript theme={null} // MP3 (default) await fishAudio.textToSpeech.convert({ text: "...", format: "mp3", mp3_bitrate: 192 }); // WAV - uncompressed await fishAudio.textToSpeech.convert({ text: "...", format: "wav", sample_rate: 44100 }); // Opus - efficient for streaming await fishAudio.textToSpeech.convert({ text: "...", format: "opus", opus_bitrate: 48 }); // PCM - raw audio data await fishAudio.textToSpeech.convert({ text: "...", format: "pcm", sample_rate: 16000 }); ``` ## Prosody Control Adjust speech speed and volume: ```typescript theme={null} const audio = await fishAudio.textToSpeech.convert({ text: "Adjusted speech", prosody: { speed: 1.2, // 0.5 - 2.0 volume: 5, // -20 - 20 }, }); ``` ## Advanced Parameters Fine-tune generation: ```typescript theme={null} const audio = await client.textToSpeech.convert({ text: "Your text here", chunk_length: 200, // Characters per chunk (100-300) normalize: true, // Normalize text latency: "balanced", // "normal" or "balanced" temperature: 0.7, // Randomness (0.0-1.0) top_p: 0.7, // Token selection (0.0-1.0) }); ``` ## Choosing Backend Our state-of-the-art [S2-Pro model](/developer-guide/models-pricing/models-overview) is the default backend model for TTS. Optionally specify the model via the second argument (`backend: Backends`). ```typescript theme={null} const audio = await fishAudio.textToSpeech.convert({ text: "Hello, world!", }, "s2-pro"); ``` ## Streaming For real-time streaming, see the [WebSocket guide](/developer-guide/sdk-guide/javascript/websocket). ## Error Handling Handle common errors: ```typescript theme={null} async function generateWithRetry(request: Record, maxRetries = 3) { const fishAudio = new FishAudioClient(); for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await fishAudio.textToSpeech.convert(request); } catch (e: any) { const status = e?.status || e?.response?.status; if (status === 429) await new Promise(r => setTimeout(r, 2 ** attempt * 1000)); else if (status === 401) throw new Error("Invalid API key"); else throw e; } } } ``` ## Request Parameters | Parameter | Type | Description | Default | | -------------- | --------- | -------------------- | ---------- | | `text` | string | Text to convert | Required | | `reference_id` | string | Voice model ID | None | | `references` | object\[] | Reference audio | \[] | | `format` | string | Audio format | "mp3" | | `chunk_length` | number | Chunk size (100-300) | 200 | | `normalize` | boolean | Normalize text | true | | `latency` | string | Speed vs quality | "balanced" | | `prosody` | object | Speed/volume | None | | `temperature` | number | Randomness | 0.7 | | `top_p` | number | Token selection | 0.7 | ## Next Steps * [Fine-grained control](/developer-guide/core-features/fine-grained-control) for phoneme-level control and paralanguage * [Voice cloning](/developer-guide/sdk-guide/javascript/voice-cloning) for custom voices * [WebSocket streaming](/developer-guide/sdk-guide/javascript/websocket) for real-time apps * [Guide and Best Practices](/developer-guide/core-features/text-to-speech) for production use * [API reference](/api-reference/endpoint/openapi-v1/text-to-speech) for direct API calls # Voice Cloning Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/voice-cloning Clone voices using reference audio with Fish Audio JavaScript SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Overview Voice cloning allows you to generate speech that matches a specific voice using reference audio. Fish Audio supports two approaches: * Using pre-trained voice models (reference\_id) * Providing reference audio directly in your request Use `reference_id` when you'll reuse a voice multiple times - it's faster and more efficient. Use `references` for one-off voice cloning or testing different voices without creating models. ## Using Reference Audio Clone a voice by providing reference audio directly: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; import type { TTSRequest, ReferenceAudio } from "fish-audio"; import { readFile } from "fs/promises"; const fishAudio = new FishAudioClient(); const audioBuffer = await readFile("voice_sample.wav"); const referenceFile = new File([audioBuffer], "voice_sample.wav"); const referenceAudio: ReferenceAudio = { audio: referenceFile, text: "Text spoken in the reference audio" }; const request: TTSRequest = { text: "Hello, world!", references: [referenceAudio] }; const audio = await client.textToSpeech.convert(request); ``` ## Multiple References Improve voice quality by providing multiple reference samples: ```typescript theme={null} import type { TTSRequest, ReferenceAudio } from "fish-audio"; import { readFile } from "fs/promises"; const references = [] as ReferenceAudio[]; for (const i of [0, 1, 2]) { const buf = await readFile(`sample_${i}.wav`); references.push({ audio: new File([buf], `sample_${i}.wav`), text: `Text from sample ${i}` }); } const request: TTSRequest = { text: "Better voice quality with multiple references", references, }; ``` ## Creating Voice Models For repeated use, create a persistent voice model: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; import { createReadStream } from "fs"; const fishAudio = new FishAudioClient(); // Create a voice model from samples const response = await fishAudio.voices.ivc.create({ title: "My Custom Voice", voices: [ createReadStream("voice_0.wav"), createReadStream("voice_1.wav"), createReadStream("voice_2.wav"), ], cover_image: createReadStream("cover.png"), }); console.log("Created model:", response._id); // Use the model const audio = await fishAudio.textToSpeech.convert({ text: "Using my saved voice model", reference_id: response._id, }); ``` ## Best Practices ### Audio Quality For best results, reference audio should: * Be 10-30 seconds long per sample * Have clear speech without background noise * Match the language you'll generate * Include varied intonation and emotion ### Sample Text The text parameter in ReferenceAudio should: * Match exactly what's spoken in the audio * Include punctuation for proper prosody * Be in the same language as generation ### Performance Tips 1. **Pre-upload models** for frequently used voices 2. **Use 2-3 reference samples** for optimal quality 3. **Keep samples under 30 seconds** each 4. **Normalize audio levels** before uploading ## Audio Format Requirements Supported formats for reference audio: * WAV (recommended) * MP3 * M4A * Other common audio formats Sample rates: * 16kHz minimum * 44.1kHz recommended * Mono or stereo (converted to mono) ## Example: Voice Bank Build a library of cloned voices: ```typescript theme={null} import { FishAudioClient } from "fish-audio"; const fishAudio = new FishAudioClient(); async function createVoiceBank() { const voiceBank: Record = {}; const models = await fishAudio.voices.search(); for (const m of models.items ?? []) voiceBank[m.title] = m._id as string; return voiceBank; } async function generateWithVoice(text: string, voiceName: string) { const bank = await createVoiceBank(); const modelId = bank[voiceName]; if (!modelId) throw new Error(`Voice '${voiceName}' not found`); return fishAudio.textToSpeech.convert({ text, reference_id: modelId }); } ``` ## Combining with Emotions Add emotions to cloned voices: ```typescript theme={null} // With a saved model await fishAudio.textToSpeech.convert({ text: "(happy) This is exciting news! (calm) Let me explain the details.", reference_id: "your_model_id", }); // Or with direct references await fishAudio.textToSpeech.convert({ text: "(excited) Amazing discovery!", references: [referenceAudio], }); ``` ## Error Handling Common issues and solutions: ```typescript theme={null} try { await fishAudio.textToSpeech.convert({ text: "Test speech", references: [referenceAudio] }); } catch (e: any) { const msg = String(e?.message || e); if (msg.includes("Invalid audio format")) console.error("Check audio format - use WAV or MP3"); else if (msg.includes("Audio too short")) console.error("Reference audio should be at least 10 seconds"); else throw e; } ``` # WebSocket Source: https://docs.fish.audio/developer-guide/sdk-guide/javascript/websocket Real-time streaming with Fish Audio JavaScript SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Overview WebSocket streaming enables real-time text-to-speech generation, perfect for conversational AI, live captioning, and streaming applications. ## Basic Streaming Stream text and receive audio in real-time: ```typescript theme={null} import { FishAudioClient, RealtimeEvents } from "fish-audio"; import { writeFile } from "fs/promises"; import path from "path"; // Simple async generator that yields text chunks async function* makeTextStream() { const chunks = [ "Hello from Fish Audio! ", "This is a realtime text-to-speech test. ", "We are streaming multiple chunks over WebSocket.", ]; for (const chunk of chunks) { yield chunk; } } const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY }); // For realtime, set text to "" and stream the content via makeTextStream const request = { text: "" }; const connection = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream()); // Collect audio and write to a file when the stream ends const chunks: Buffer[] = []; connection.on(RealtimeEvents.OPEN, () => console.log("WebSocket opened")); connection.on(RealtimeEvents.AUDIO_CHUNK, (audio: unknown): void => { if (audio instanceof Uint8Array || Buffer.isBuffer(audio)) { chunks.push(Buffer.from(audio)); } }); connection.on(RealtimeEvents.ERROR, (err) => console.error("WebSocket error:", err)); connection.on(RealtimeEvents.CLOSE, async () => { const outPath = path.resolve(process.cwd(), "out.mp3"); await writeFile(outPath, Buffer.concat(chunks)); console.log("Saved to", outPath); }); ``` Set `text: ""` in the request when streaming. The actual text comes from your text stream generator. ## Using Voice Models Stream with a specific voice: ```typescript theme={null} const request = { text: "", // Empty for streaming reference_id: "your_model_id", format: "mp3", }; const conn = await fishAudio.textToSpeech.convertRealtime(request, makeTextStream()); conn.on(RealtimeEvents.AUDIO_CHUNK, () => { /* handle audio */ }); ``` ## Dynamic Text Generation Stream text as it's generated: ```typescript theme={null} async function* generateText() { const responses = [ "Processing your request...", "Here's what I found:", "The answer is 42.", ]; for (const response of responses) { for (const word of response.split(" ")) { yield word + " "; await new Promise(r => setTimeout(r, 20)); } } } await fishAudio.textToSpeech.convertRealtime({ text: "" }, generateText()); ``` ## Line-by-Line Processing Stream text line by line: ```typescript theme={null} import { createReadStream } from "fs"; import readline from "readline"; async function* readFileLines(filepath: string) { const rl = readline.createInterface({ input: createReadStream(filepath) }); for await (const line of rl) { yield line.trim() + " "; } } await fishAudio.textToSpeech.convertRealtime({ text: "" }, readFileLines("story.txt")); ``` ## Errors Handle connection errors via event listeners: ```typescript theme={null} connection.on(RealtimeEvents.ERROR, (err) => { console.error("WebSocket error:", err); // Fallback to regular TTS or retry }); ``` ## Configuration/Choosing Backend Customize WebSocket behavior by configuring the client.
Optionally specify the backend model to use. Our state-of-the-art [S2-Pro model](/developer-guide/models-pricing/models-overview) is the default: ```typescript theme={null} // Custom endpoint const fishAudio = new FishAudioClient({ apiKey: process.env.FISH_API_KEY, baseUrl: "https://api.fish.audio", // Use a proxy/custom endpoint if needed }); // Select backend model const conn = await fishAudio.textToSpeech.convertRealtime( request, makeTextStream(), backend: "s2-pro" ); ``` ## Best Practices 1. **Chunk Size**: Yield text in natural phrases for best prosody 2. **Buffer Management**: Process audio chunks immediately to avoid memory buildup 3. **Connection Reuse**: Keep WebSocket sessions alive for multiple streams 4. **Error Recovery**: Implement retry logic for connection failures 5. **Format Selection**: Use PCM for real-time playback, MP3 for storage ## Events The connection emits these events: | Event | Description | | ------------- | --------------------------------- | | `OPEN` | WebSocket connection established | | `AUDIO_CHUNK` | Audio chunk received (Uint8Array) | | `ERROR` | Error occurred on the connection | | `CLOSE` | Connection closed | # Authentication Source: https://docs.fish.audio/developer-guide/sdk-guide/python/authentication Configure API authentication for the Fish Audio Python SDK ## Get Your API Key Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Client Initialization Initialize the [`FishAudio`](/api-reference/sdk/python/client#fishaudio-objects) client with your API key: The most secure approach is using environment variables: ```python theme={null} from fishaudio import FishAudio # Automatically reads from FISH_API_KEY environment variable client = FishAudio() ``` Set the environment variable in your shell: ```bash theme={null} export FISH_API_KEY=your_api_key_here ``` Or create a `.env` file in your project root: ```bash theme={null} FISH_API_KEY=your_api_key_here ``` Then load it using `python-dotenv`: ```python theme={null} from dotenv import load_dotenv from fishaudio import FishAudio # Load environment variables from .env file load_dotenv() client = FishAudio() ``` Using environment variables keeps your API key out of your codebase and makes it easy to use different keys for development and production. Provide the API key directly when initializing the client: ```python theme={null} from fishaudio import FishAudio client = FishAudio(api_key="your_api_key_here") ``` This approach is less secure. Never commit code containing your actual API key. Use this only for quick testing or when loading the key from a secure secrets manager. If you're using a proxy or custom endpoint: ```python theme={null} from fishaudio import FishAudio client = FishAudio( api_key="your_api_key", base_url="https://your-proxy-domain.com" ) ``` This is useful for: * Corporate proxies * Development/staging environments * Self-hosted deployments ## Verifying Authentication Test your authentication by making a simple API call to check your account credits: ```python focus={7-9} theme={null} from fishaudio import FishAudio from fishaudio.exceptions import AuthenticationError try: client = FishAudio() # Check account credits (requires valid authentication) credits = client.account.get_credits() print(f"Authentication successful! Credits: {credits.credit}") except AuthenticationError: print("Authentication failed. Check your API key.") ``` Handle [`AuthenticationError`](/api-reference/sdk/python/exceptions#authenticationerror-objects) when verifying authentication. The example uses [`get_credits()`](/api-reference/sdk/python/resources#get_credits) to verify the authentication works. ## Next Steps Generate speech with the authenticated client Clone voices and create custom models Check credits and manage your account Handle authentication errors properly # Overview Source: https://docs.fish.audio/developer-guide/sdk-guide/python/overview The official Python library for the Fish Audio API This guide will walk you through installation, authentication, and core features. If you're using the legacy Session-based API (`fish_audio_sdk`), see the [migration guide](/archive/python-sdk-legacy/migration-guide) to upgrade to the new SDK. ## Installation Install via pip (Python 3.9 or higher required): ```bash theme={null} pip install fish-audio-sdk ``` For audio playback utilities, install with the `utils` extra: ```bash theme={null} pip install fish-audio-sdk[utils] ``` Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. Configure your API key using environment variables: ```bash theme={null} export FISH_API_KEY=your_api_key_here ``` Or create a `.env` file in your project root: ```bash theme={null} FISH_API_KEY=your_api_key_here ``` ## Quick Start Get started with the [`FishAudio`](/api-reference/sdk/python/client#fishaudio-objects) client in less than a minute: ```python Synchronous theme={null} from fishaudio import FishAudio from fishaudio.utils import play, save # Initialize client (reads from FISH_API_KEY environment variable) client = FishAudio() # Generate and play audio audio = client.tts.convert(text="Hello, playing from Fish Audio!") play(audio) # Generate and save audio audio = client.tts.convert(text="Saving this audio to a file!") save(audio, "output.mp3") ``` ```python Asynchronous theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play, save async def main(): # Initialize async client client = AsyncFishAudio() # Generate and play audio audio = await client.tts.convert(text="Hello, playing from Fish Audio!") play(audio) # Generate and save audio audio = await client.tts.convert(text="Saving this audio to a file!") save(audio, "output.mp3") asyncio.run(main()) ``` ## Core Features ### Text-to-Speech Fully customizable text-to-speech generation: ```python Synchronous focus={6-10} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # With a specific voice audio = client.tts.convert( text="Custom voice", reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian ) play(audio) ``` ```python Asynchronous focus={8-12} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # With a specific voice audio = await client.tts.convert( text="Custom voice", reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian ) play(audio) asyncio.run(main()) ``` ```python Synchronous focus={6-10} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # With speed control audio = client.tts.convert( text="I'm talking pretty fast, is this still too slow?", speed=1.5 # 1.5x speed ) play(audio) ``` ```python Asynchronous focus={8-12} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # With speed control audio = await client.tts.convert( text="I'm talking pretty fast, is this still too slow?", speed=1.5 # 1.5x speed ) play(audio) asyncio.run(main()) ``` Create reusable configurations with [`TTSConfig`](/api-reference/sdk/python/types#ttsconfig-objects). [`Prosody`](/api-reference/sdk/python/types#prosody-objects) controls speech characteristics like speed and volume: ```python Synchronous focus={7-18} theme={null} from fishaudio import FishAudio from fishaudio.types import TTSConfig, Prosody from fishaudio.utils import play client = FishAudio() # Define config once my_config = TTSConfig( prosody=Prosody(speed=1.2, volume=-5), reference_id="933563129e564b19a115bedd57b7406a", # Sarah format="wav", latency="balanced" ) # Reuse across multiple generations audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config) audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config) audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) play(audio1) play(audio2) play(audio3) ``` ```python Asynchronous focus={9-20} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import TTSConfig, Prosody from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Define config once my_config = TTSConfig( prosody=Prosody(speed=1.2, volume=-5), reference_id="933563129e564b19a115bedd57b7406a", # Sarah format="wav", latency="balanced" ) # Reuse across multiple generations audio1 = await client.tts.convert(text="Welcome to our product demonstration.", config=my_config) audio2 = await client.tts.convert(text="Let me show you the key features.", config=my_config) audio3 = await client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) play(audio1) play(audio2) play(audio3) asyncio.run(main()) ``` For chunk-by-chunk processing, use [`stream()`](/api-reference/sdk/python/resources#stream) which returns an `AudioStream` (iterable). For real-time streaming with dynamic text, see [Real-time Streaming](#real-time-streaming) below. Learn more in the [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech). ### Speech-to-Text Transcribe audio to text for various use cases: ```python Synchronous focus={5-16} theme={null} from fishaudio import FishAudio client = FishAudio() # Transcribe audio with open("audio.wav", "rb") as f: result = client.asr.transcribe( audio=f.read(), language="en" # Optional: specify language ) print(result.text) # Access segments for segment in result.segments: print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}") ``` ```python Asynchronous focus={7-18} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Transcribe audio with open("audio.wav", "rb") as f: result = await client.asr.transcribe( audio=f.read(), language="en" # Optional: specify language ) print(result.text) # Access segments for segment in result.segments: print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}") asyncio.run(main()) ``` Learn more in the [Speech-to-Text guide](/developer-guide/sdk-guide/python/speech-to-text). ### Real-time Streaming Stream dynamically generated text for conversational AI and live applications. Perfect for integrating with LLM streaming responses, live captions, and chatbot interactions: ```python Synchronous focus={7-15} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # Stream dynamically generated text (e.g., from LLM) def text_chunks(): yield "Hello, " yield "this is " yield "streaming text!" audio_stream = client.tts.stream_websocket( text_chunks(), latency="balanced" ) play(audio_stream) ``` ```python Asynchronous focus={9-17} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Stream dynamically generated text async def text_chunks(): yield "Hello, " yield "this is " yield "streaming text!" audio_stream = await client.tts.stream_websocket( text_chunks(), latency="balanced" ) play(audio_stream) asyncio.run(main()) ``` Learn more in the [WebSocket Streaming guide](/developer-guide/sdk-guide/python/websocket). ### Voice Cloning **Instant voice cloning** - Clone a voice on-the-fly using [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects): ```python Synchronous focus={6-12} theme={null} from fishaudio import FishAudio from fishaudio.types import ReferenceAudio client = FishAudio() # Instant voice cloning with open("reference.wav", "rb") as f: audio = client.tts.convert( text="This will sound like the reference voice", references=[ReferenceAudio( audio=f.read(), text="Text spoken in the reference audio" )] ) ``` ```python Asynchronous focus={8-14} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import ReferenceAudio async def main(): client = AsyncFishAudio() # Instant voice cloning with open("reference.wav", "rb") as f: audio = await client.tts.convert( text="This will sound like the reference voice", references=[ReferenceAudio( audio=f.read(), text="Text spoken in the reference audio" )] ) asyncio.run(main()) ``` **Voice models** - Create persistent voice models for repeated use: ```python Synchronous focus={6-11} theme={null} from fishaudio import FishAudio client = FishAudio() # Create persistent voice model with open("voice_sample.wav", "rb") as f: voice = client.voices.create( title="My Custom Voice", voices=[f.read()], description="Custom voice clone" ) print(f"Created voice: {voice.id}") ``` ```python Asynchronous focus={8-13} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Create persistent voice model with open("voice_sample.wav", "rb") as f: voice = await client.voices.create( title="My Custom Voice", voices=[f.read()], description="Custom voice clone" ) print(f"Created voice: {voice.id}") asyncio.run(main()) ``` Learn more in the [Voice Cloning guide](/developer-guide/sdk-guide/python/voice-cloning). ## Client Initialization The recommended approach using environment variables: ```python theme={null} from fishaudio import FishAudio # Automatically reads from FISH_API_KEY environment variable client = FishAudio() ``` Provide the API key directly: ```python theme={null} from fishaudio import FishAudio client = FishAudio(api_key="your_api_key") ``` Never commit API keys to version control. Use environment variables or secret management systems. Configure a custom base URL: ```python theme={null} from fishaudio import FishAudio client = FishAudio( api_key="your_api_key", base_url="https://your-proxy-domain.com" ) ``` ## Sync vs Async The SDK provides both synchronous and asynchronous clients: ```python Synchronous theme={null} from fishaudio import FishAudio # For typical applications client = FishAudio() audio = client.tts.convert(text="Hello!") ``` ```python Asynchronous theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): # For async applications (web servers, concurrent tasks) client = AsyncFishAudio() audio = await client.tts.convert(text="Hello!") asyncio.run(main()) ``` Use [`AsyncFishAudio`](/api-reference/sdk/python/client#asyncfishaudio-objects) when: * Building async web applications (FastAPI, Sanic, etc.) * Processing multiple requests concurrently * Integrating with other async libraries * You need maximum performance ## Resource Clients The SDK organizes functionality into resource clients: | Resource | Description | Key Methods | | ----------------------------------------------------------------------------- | ------------------ | ----------------------------------------------------- | | [`client.tts`](/api-reference/sdk/python/resources#ttsclient-objects) | Text-to-speech | `convert()`, `stream()`, `stream_websocket()` | | [`client.asr`](/api-reference/sdk/python/resources#asrclient-objects) | Speech recognition | `transcribe()` | | [`client.voices`](/api-reference/sdk/python/resources#voicesclient-objects) | Voice management | `list()`, `get()`, `create()`, `update()`, `delete()` | | [`client.account`](/api-reference/sdk/python/resources#accountclient-objects) | Account info | `get_credits()`, `get_package()` | ## Utility Functions The SDK includes helpful utilities (requires `utils` extra): ```python theme={null} from fishaudio.utils import save, play, stream # Save audio to file save(audio, "output.mp3") # Play audio (automatically detects environment) play(audio) # Works in Jupyter, regular Python, etc. # Stream audio in real-time (requires mpv) stream(audio_iterator) ``` Use [`play()`](/api-reference/sdk/python/utils#play) for playback and [`save()`](/api-reference/sdk/python/utils#save) for writing audio files. Learn more in the [API Reference - Utils](/api-reference/sdk/python/utils). ## Error Handling The SDK provides a comprehensive exception hierarchy: ```python theme={null} from fishaudio import FishAudio from fishaudio.exceptions import ( FishAudioError, AuthenticationError, RateLimitError, ValidationError ) client = FishAudio() try: audio = client.tts.convert(text="Hello!") except AuthenticationError: print("Invalid API key") except RateLimitError: print("Rate limit exceeded. Please wait before retrying.") except ValidationError as e: print(f"Invalid request: {e}") except FishAudioError as e: print(f"API error: {e}") ``` The SDK includes exceptions for [`AuthenticationError`](/api-reference/sdk/python/exceptions#authenticationerror-objects), [`RateLimitError`](/api-reference/sdk/python/exceptions#ratelimiterror-objects), [`ValidationError`](/api-reference/sdk/python/exceptions#validationerror-objects), and [`FishAudioError`](/api-reference/sdk/python/exceptions#fishaudioerror-objects) for common error scenarios. Learn more in the [API Reference - Exceptions](/api-reference/sdk/python/exceptions). ## Next Steps Set up API keys and client configuration Generate natural-sounding speech Clone voices and manage voice models Transcribe audio to text Real-time audio streaming Complete API documentation ## Resources * [GitHub Repository](https://github.com/fishaudio/fish-audio-python) * [PyPI Package](https://pypi.org/project/fish-audio-sdk/) * [Migration Guide](/archive/python-sdk-legacy/migration-guide) - Upgrade from legacy SDK * [Best Practices](/developer-guide/best-practices/) - Production-ready tips * [API Reference](/api-reference/sdk/python/) - Detailed documentation # Speech-to-Text Source: https://docs.fish.audio/developer-guide/sdk-guide/python/speech-to-text Transcribe audio to text with the Fish Audio Python SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Basic Transcription Transcribe audio files to text with automatic language detection using [`asr.transcribe()`](/api-reference/sdk/python/resources#transcribe): ```python Synchronous focus={6-10} theme={null} from fishaudio import FishAudio client = FishAudio() # Transcribe audio with open("audio.mp3", "rb") as f: result = client.asr.transcribe(audio=f.read()) print(f"Transcription: {result.text}") print(f"Duration: {result.duration}ms") ``` ```python Asynchronous focus={8-12} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Transcribe audio with open("audio.mp3", "rb") as f: result = await client.asr.transcribe(audio=f.read()) print(f"Transcription: {result.text}") print(f"Duration: {result.duration}ms") asyncio.run(main()) ``` The [`ASRResponse`](/api-reference/sdk/python/types#asrresponse-objects) object contains the full transcription and segment details. ## Language Specification Specify the language for more accurate transcription: ```python Synchronous focus={5-11} theme={null} from fishaudio import FishAudio client = FishAudio() # Specify language code with open("chinese_audio.mp3", "rb") as f: result = client.asr.transcribe( audio=f.read(), language="zh" # Chinese ) print(result.text) ``` ```python Asynchronous focus={7-13} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Specify language code with open("chinese_audio.mp3", "rb") as f: result = await client.asr.transcribe( audio=f.read(), language="zh" # Chinese ) print(result.text) asyncio.run(main()) ``` Auto-detection works well for most cases, but specifying the language can improve accuracy, especially for languages with similar phonetics. ## Segment Timestamps Access word-level or phrase-level timestamps: ```python Synchronous focus={5-14} theme={null} from fishaudio import FishAudio client = FishAudio() # Transcribe with segments with open("audio.mp3", "rb") as f: result = client.asr.transcribe(audio=f.read()) # Access full text print(f"Full text: {result.text}") # Iterate through segments for segment in result.segments: print(f"[{segment.start}ms - {segment.end}ms]: {segment.text}") ``` ```python Asynchronous focus={7-16} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Transcribe with segments with open("audio.mp3", "rb") as f: result = await client.asr.transcribe(audio=f.read()) # Access full text print(f"Full text: {result.text}") # Iterate through segments for segment in result.segments: print(f"[{segment.start}ms - {segment.end}ms]: {segment.text}") asyncio.run(main()) ``` ## Next Steps Convert transcribed text back to speech Use transcribed audio for voice cloning Complete ASR API documentation Production tips and optimization ## Related Resources * [ASR Types Reference](/api-reference/sdk/python/types#asr) - ASR response data structures * [Error Handling](/api-reference/sdk/python/exceptions) - Exception types and handling # Text-to-Speech Source: https://docs.fish.audio/developer-guide/sdk-guide/python/text-to-speech Generate natural-sounding speech with the Fish Audio Python SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Understanding TTS Methods The SDK provides three methods for text-to-speech generation, each optimized for different use cases: | Method | Returns | Best For | | ---------------------------------------------------------------------------- | -------------------- | ------------------------------------------------------------------------ | | [`convert()`](/api-reference/sdk/python/resources#convert) | Complete audio bytes | Most use cases - simple, gets full audio at once | | [`stream()`](/api-reference/sdk/python/resources#stream) | `AudioStream` | Chunk-by-chunk processing, memory-efficient transfer | | [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) | Audio bytes iterator | Real-time streaming with dynamic text (LLM responses, conversational AI) | Use `convert()` for most use cases. Use `stream()` for memory efficiency when handling large files. Use `stream_websocket()` when text is generated dynamically in real-time. ## Basic Usage Generate speech from text with a single function call: ```python Synchronous focus={6-9} theme={null} from fishaudio import FishAudio from fishaudio.utils import save, play client = FishAudio() # Generate speech (returns bytes) audio = client.tts.convert(text="Hello, welcome to Fish Audio!") # Play or save the audio play(audio) save(audio, "output.mp3") ``` ```python Asynchronous focus={8-11} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import save, play async def main(): client = AsyncFishAudio() # Generate speech (returns bytes) audio = await client.tts.convert(text="Hello, welcome to Fish Audio!") # Play or save the audio play(audio) save(audio, "output.mp3") asyncio.run(main()) ``` ## Using Voice Models Specify a voice model for consistent voice characteristics: ```python Synchronous focus={6-10} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # Use a specific voice audio = client.tts.convert( text="This uses a specific voice model", reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian ) play(audio) ``` ```python Asynchronous focus={8-12} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Use a specific voice audio = await client.tts.convert( text="This uses a specific voice model", reference_id="bf322df2096a46f18c579d0baa36f41d" # Adrian ) play(audio) asyncio.run(main()) ``` ### Finding Voice Models Get voice model IDs from the Fish Audio website or programmatically: ```python Synchronous focus={5-16} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # List available voices voices = client.voices.list(language="en", tags="male") for voice in voices.items: print(f"{voice.title}: {voice.id}") # Use a voice from the list audio = client.tts.convert( text="Generated with discovered voice", reference_id=voices.items[0].id ) play(audio) ``` ```python Asynchronous focus={7-18} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # List available voices voices = await client.voices.list(language="en", tags="male") for voice in voices.items: print(f"{voice.title}: {voice.id}") # Use a voice from the list audio = await client.tts.convert( text="Generated with discovered voice", reference_id=voices.items[0].id ) play(audio) asyncio.run(main()) ``` Learn more in the [Voice Cloning guide](/developer-guide/sdk-guide/python/voice-cloning). ## Emotions and Expressions The `(parenthesis)` syntax below applies to the S1 model. S2 uses `[bracket]` syntax with natural language descriptions and is not limited to a fixed set of tags. See the [Models Overview](/developer-guide/models-pricing/models-overview#s2-natural-language-control) for details. Add emotional expressions to make speech more natural: ```python Synchronous focus={5-16} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() text = """ (happy) I'm excited to announce this! (sad) Unfortunately, it didn't work out. (angry) This is so frustrating! (calm) Let me explain the details. """ audio = client.tts.convert( text=text, reference_id="933563129e564b19a115bedd57b7406a" # Sarah ) play(audio) ``` ```python Asynchronous focus={7-18} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() text = """ (happy) I'm excited to announce this! (sad) Unfortunately, it didn't work out. (angry) This is so frustrating! (calm) Let me explain the details. """ audio = await client.tts.convert( text=text, reference_id="933563129e564b19a115bedd57b7406a" # Sarah ) play(audio) asyncio.run(main()) ``` See the [Emotion Reference](/api-reference/emotion-reference) for all available emotions and [Fine-grained Control](/developer-guide/core-features/fine-grained-control) for advanced usage. ## Audio Formats Choose the output format based on your needs: ```python Synchronous focus={5-21} theme={null} from fishaudio import FishAudio client = FishAudio() # MP3 (default) - good balance of quality and size audio = client.tts.convert( text="MP3 format", format="mp3" ) # WAV - uncompressed, highest quality audio = client.tts.convert( text="WAV format", format="wav" ) # PCM - raw audio data for streaming audio = client.tts.convert( text="PCM format", format="pcm" ) ``` ```python Asynchronous focus={7-23} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # MP3 (default) - good balance of quality and size audio = await client.tts.convert( text="MP3 format", format="mp3" ) # WAV - uncompressed, highest quality audio = await client.tts.convert( text="WAV format", format="wav" ) # PCM - raw audio data for streaming audio = await client.tts.convert( text="PCM format", format="pcm" ) asyncio.run(main()) ``` ## Prosody Control Adjust speech speed and volume for natural-sounding output: ```python Synchronous focus={6-10} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # Simple speed adjustment audio = client.tts.convert( text="This will be spoken faster", speed=1.5 # 1.5x speed (range: 0.5-2.0) ) play(audio) ``` ```python Asynchronous focus={8-12} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Simple speed adjustment audio = await client.tts.convert( text="This will be spoken faster", speed=1.5 # 1.5x speed (range: 0.5-2.0) ) play(audio) asyncio.run(main()) ``` For combined speed and volume control, use [`TTSConfig`](/api-reference/sdk/python/types#ttsconfig-objects) with [`Prosody`](/api-reference/sdk/python/types#prosody-objects): ```python Synchronous focus={7-17} theme={null} from fishaudio import FishAudio from fishaudio.types import TTSConfig, Prosody from fishaudio.utils import play client = FishAudio() # Configure prosody with TTSConfig audio = client.tts.convert( text="Adjusted speech with custom speed and volume", config=TTSConfig( prosody=Prosody( speed=1.2, # 20% faster volume=5 # Louder (range: -20 to 20) ) ) ) play(audio) ``` ```python Asynchronous focus={9-19} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import TTSConfig, Prosody from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Configure prosody with TTSConfig audio = await client.tts.convert( text="Adjusted speech with custom speed and volume", config=TTSConfig( prosody=Prosody( speed=1.2, # 20% faster volume=5 # Louder (range: -20 to 20) ) ) ) play(audio) asyncio.run(main()) ``` ## Reusable TTS Configuration Create a configuration once and reuse it across multiple generations: ```python Synchronous focus={5-18} theme={null} from fishaudio import FishAudio from fishaudio.types import TTSConfig, Prosody client = FishAudio() # Define config once my_config = TTSConfig( prosody=Prosody(speed=1.2, volume=-5), reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian format="wav", latency="balanced" ) # Reuse across multiple generations audio1 = client.tts.convert(text="Welcome to our product demonstration.", config=my_config) audio2 = client.tts.convert(text="Let me show you the key features.", config=my_config) audio3 = client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) ``` ```python Asynchronous focus={7-20} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import TTSConfig, Prosody async def main(): client = AsyncFishAudio() # Define config once my_config = TTSConfig( prosody=Prosody(speed=1.2, volume=-5), reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian format="wav", latency="balanced" ) # Reuse across multiple generations audio1 = await client.tts.convert(text="Welcome to our product demonstration.", config=my_config) audio2 = await client.tts.convert(text="Let me show you the key features.", config=my_config) audio3 = await client.tts.convert(text="Thank you for watching this tutorial.", config=my_config) asyncio.run(main()) ``` ## Chunk-by-Chunk Streaming Use `stream()` for memory-efficient transfer and progressive download. Chunks are network transmission units (not semantic audio segments): ```python Synchronous focus={5-8} theme={null} from fishaudio import FishAudio client = FishAudio() # Collect all chunks efficiently audio_stream = client.tts.stream(text="Long text here") audio = audio_stream.collect() # Returns complete audio as bytes ``` ```python Asynchronous focus={7-10} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Collect all chunks efficiently audio_stream = await client.tts.stream(text="Long text here") audio = await audio_stream.collect() # Returns complete audio as bytes asyncio.run(main()) ``` For streaming to files or network without buffering in memory: ```python Synchronous focus={5-9} theme={null} from fishaudio import FishAudio client = FishAudio() # Stream directly to file (memory efficient for large audio) audio_stream = client.tts.stream(text="Very long text...") with open("output.mp3", "wb") as f: for chunk in audio_stream: f.write(chunk) # Write each chunk as it arrives ``` ```python Asynchronous focus={7-11} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Stream directly to file (memory efficient for large audio) audio_stream = await client.tts.stream(text="Very long text...") with open("output.mp3", "wb") as f: async for chunk in audio_stream: f.write(chunk) # Write each chunk as it arrives asyncio.run(main()) ``` Use `stream()` when you have complete text upfront. For real-time streaming with dynamically generated text (LLMs, live captions), use `stream_websocket()` instead. ## Real-time WebSocket Streaming For real-time applications where text is generated dynamically, use [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket). This is perfect for LLM integrations, conversational AI, and live captions: ### Basic WebSocket Streaming ```python Synchronous focus={5-15} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # Stream dynamically generated text def text_chunks(): yield "Hello, " yield "this is " yield "streaming text!" audio_stream = client.tts.stream_websocket( text_chunks(), latency="balanced" ) play(audio_stream) ``` ```python Asynchronous focus={7-16} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Stream dynamically generated text async def text_chunks(): yield "Hello, " yield "this is " yield "streaming text!" audio_stream = await client.tts.stream_websocket( text_chunks(), latency="balanced" ) play(audio_stream) asyncio.run(main()) ``` ### Understanding `FlushEvent` The [`FlushEvent`](/api-reference/sdk/python/types#flushevent-objects) forces the TTS engine to immediately generate audio from the accumulated text buffer. This is useful when you want to ensure audio is generated at specific points, even if the buffer hasn't reached the optimal chunk size. ```python Synchronous focus={6-18} theme={null} from fishaudio import FishAudio from fishaudio.types import FlushEvent client = FishAudio() # Use FlushEvent to force immediate generation def text_with_flush(): yield "This is the first sentence. " yield "This is the second sentence. " yield FlushEvent() # Force audio generation NOW yield "This starts a new segment. " yield "And continues here." yield FlushEvent() # Force final generation audio_stream = client.tts.stream_websocket(text_with_flush()) # Process each audio chunk as it arrives for chunk in audio_stream: print(f"Received audio chunk: {len(chunk)} bytes") ``` ```python Asynchronous focus={8-20} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import FlushEvent async def main(): client = AsyncFishAudio() # Use FlushEvent to force immediate generation async def text_with_flush(): yield "This is the first sentence. " yield "This is the second sentence. " yield FlushEvent() # Force audio generation NOW yield "This starts a new segment. " yield "And continues here." yield FlushEvent() # Force final generation audio_stream = await client.tts.stream_websocket(text_with_flush()) # Process each audio chunk as it arrives async for chunk in audio_stream: print(f"Received audio chunk: {len(chunk)} bytes") asyncio.run(main()) ``` Without `FlushEvent`, the engine automatically generates audio when the buffer reaches an optimal size. Use `FlushEvent` to control exactly when audio should be generated, which can reduce perceived latency in interactive applications. ### `TextEvent` vs Plain Strings You can yield plain strings (recommended for simplicity) or use [`TextEvent`](/api-reference/sdk/python/types#textevent-objects) for explicit control: ```python Synchronous focus={6-17} theme={null} from fishaudio import FishAudio from fishaudio.types import TextEvent client = FishAudio() # Both approaches are equivalent def text_as_strings(): yield "Hello, " yield "world!" def text_as_events(): yield TextEvent(text="Hello, ") yield TextEvent(text="world!") # Use whichever style you prefer audio1 = client.tts.stream_websocket(text_as_strings()) audio2 = client.tts.stream_websocket(text_as_events()) ``` ```python Asynchronous focus={8-19} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import TextEvent async def main(): client = AsyncFishAudio() # Both approaches are equivalent async def text_as_strings(): yield "Hello, " yield "world!" async def text_as_events(): yield TextEvent(text="Hello, ") yield TextEvent(text="world!") # Use whichever style you prefer audio1 = await client.tts.stream_websocket(text_as_strings()) audio2 = await client.tts.stream_websocket(text_as_events()) asyncio.run(main()) ``` ### LLM Integration Pattern WebSocket streaming shines when integrating with LLM streaming responses. The TTS engine acts as an accumulator, buffering text until it has enough to generate natural-sounding audio: ```python Synchronous focus={5-19} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # Simulate streaming LLM response def llm_stream(): """Simulates text chunks from an LLM""" tokens = [ "The ", "weather ", "today ", "is ", "sunny ", "with ", "clear ", "skies. ", "Perfect ", "for ", "outdoor ", "activities!" ] for token in tokens: yield token # Stream to speech in real-time audio_stream = client.tts.stream_websocket(llm_stream()) play(audio_stream) ``` ```python Asynchronous focus={7-21} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Simulate streaming LLM response async def llm_stream(): """Simulates text chunks from an LLM""" tokens = [ "The ", "weather ", "today ", "is ", "sunny ", "with ", "clear ", "skies. ", "Perfect ", "for ", "outdoor ", "activities!" ] for token in tokens: yield token # Stream to speech in real-time audio_stream = await client.tts.stream_websocket(llm_stream()) play(audio_stream) asyncio.run(main()) ``` The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don't need to manually batch tokens unless you want to force generation at specific points using `FlushEvent`. Learn more in the [WebSocket Streaming guide](/developer-guide/sdk-guide/python/websocket). ## Advanced Configuration Comprehensive `TTSConfig` with all available parameters: ```python focus={3-24} theme={null} from fishaudio.types import TTSConfig, Prosody # All TTSConfig parameters config = TTSConfig( # Audio output settings format="mp3", sample_rate=44100, # Custom sample rate (optional) mp3_bitrate=192, # 64, 128, or 192 kbps opus_bitrate=64, # For Opus format: -1000, 24, 32, 48, or 64 normalize=True, # Normalize audio levels # Generation settings chunk_length=200, # Characters per chunk (100-300) latency="balanced", # "normal" or "balanced" # Voice/style settings reference_id="bf322df2096a46f18c579d0baa36f41d", # Adrian prosody=Prosody(speed=1.1, volume=0), # references=[ReferenceAudio(...)] # For instant cloning # Model parameters temperature=0.7, # Randomness (0.0-1.0) top_p=0.7 # Token selection (0.0-1.0) ) # Use with any client audio = client.tts.convert(text="Your text here", config=config) ``` `TTSConfig` works the same for both sync and async clients. See [TTSConfig API Reference](/api-reference/sdk/python/types#ttsconfig-objects) for detailed documentation on each parameter and their defaults. ## Error Handling Handle common TTS errors gracefully: ```python theme={null} from fishaudio import FishAudio from fishaudio.exceptions import ( RateLimitError, ValidationError, NotFoundError, FishAudioError ) import time client = FishAudio() try: audio = client.tts.convert( text="Your text here", reference_id="voice_id" ) except RateLimitError: print("Rate limit exceeded. Please wait before retrying.") time.sleep(60) # Wait before retry except NotFoundError: print("Voice model not found. Check the reference_id") except ValidationError as e: print(f"Invalid request: {e}") except FishAudioError as e: print(f"API error: {e}") ``` Common exceptions include [`RateLimitError`](/api-reference/sdk/python/exceptions#ratelimiterror-objects), [`ValidationError`](/api-reference/sdk/python/exceptions#validationerror-objects), [`NotFoundError`](/api-reference/sdk/python/exceptions#notfounderror-objects), and [`FishAudioError`](/api-reference/sdk/python/exceptions#fishaudioerror-objects). ## Best Practices For long texts, adjust `chunk_length` in `TTSConfig`: ```python theme={null} from fishaudio import FishAudio from fishaudio.types import TTSConfig client = FishAudio() audio = client.tts.convert( text="Very long text...", config=TTSConfig(chunk_length=250) # Larger chunks for efficiency ) ``` If you generate the same speech repeatedly, cache the results: ```python theme={null} import os from fishaudio import FishAudio from fishaudio.utils import save client = FishAudio() def get_or_generate_speech(text, cache_file): if os.path.exists(cache_file): with open(cache_file, "rb") as f: return f.read() audio = client.tts.convert(text=text) save(audio, cache_file) return audio ``` Implement exponential backoff for rate limits: ```python theme={null} from fishaudio import FishAudio from fishaudio.exceptions import RateLimitError import time client = FishAudio() def generate_with_retry(text, max_retries=3): for attempt in range(max_retries): try: return client.tts.convert(text=text) except RateLimitError as e: if attempt < max_retries - 1: time.sleep(2 ** attempt) # Exponential backoff else: raise ``` Balance speed vs quality based on your use case: ```python theme={null} from fishaudio import FishAudio client = FishAudio() # For real-time applications audio = client.tts.convert(text="Fast response", latency="balanced") # For highest quality audio = client.tts.convert(text="Best quality", latency="normal") ``` ## Next Steps Create custom voice models Real-time audio streaming Phoneme-level control and paralanguage Production tips and optimization ## Related Resources * [TTS API Reference](/api-reference/sdk/python/resources#tts) - Complete API documentation * [Audio Formats Guide](/developer-guide/core-features/text-to-speech#audio-formats) - Format comparison * [Emotion Reference](/api-reference/emotion-reference) - All available emotions * [Utils Reference](/api-reference/sdk/python/utils) - Audio utilities # Voice Cloning Source: https://docs.fish.audio/developer-guide/sdk-guide/python/voice-cloning Clone voices and create custom voice models with the Fish Audio Python SDK ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Instant Voice Cloning Clone a voice on-the-fly without creating a persistent model using [`ReferenceAudio`](/api-reference/sdk/python/types#referenceaudio-objects): ```python Synchronous focus={6-15} theme={null} from fishaudio import FishAudio from fishaudio.types import ReferenceAudio from fishaudio.utils import play client = FishAudio() # Clone from reference audio with open("reference_voice.wav", "rb") as f: audio = client.tts.convert( text="This will sound like the reference voice", references=[ReferenceAudio( audio=f.read(), text="Text spoken in the reference audio" )] ) play(audio) ``` ```python Asynchronous focus={8-17} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import ReferenceAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Clone from reference audio with open("reference_voice.wav", "rb") as f: audio = await client.tts.convert( text="This will sound like the reference voice", references=[ReferenceAudio( audio=f.read(), text="Text spoken in the reference audio" )] ) play(audio) asyncio.run(main()) ``` Instant voice cloning is perfect for one-time use cases. For repeated use of the same voice, create a persistent voice model instead. ## Multiple Reference Samples Improve voice quality by providing multiple reference samples: ```python Synchronous focus={6-21} theme={null} from fishaudio import FishAudio from fishaudio.types import ReferenceAudio from fishaudio.utils import play client = FishAudio() # Load multiple reference samples references = [] samples = [ ("sample1.wav", "First sample transcript"), ("sample2.wav", "Second sample transcript"), ("sample3.wav", "Third sample transcript") ] for audio_file, transcript in samples: with open(audio_file, "rb") as f: references.append(ReferenceAudio( audio=f.read(), text=transcript )) # Generate with multiple references audio = client.tts.convert( text="This voice is trained on multiple samples", references=references ) play(audio) ``` ```python Asynchronous focus={8-23} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import ReferenceAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Load multiple reference samples references = [] samples = [ ("sample1.wav", "First sample transcript"), ("sample2.wav", "Second sample transcript"), ("sample3.wav", "Third sample transcript") ] for audio_file, transcript in samples: with open(audio_file, "rb") as f: references.append(ReferenceAudio( audio=f.read(), text=transcript )) # Generate with multiple references audio = await client.tts.convert( text="This voice is trained on multiple samples", references=references ) play(audio) asyncio.run(main()) ``` ## Creating Persistent Voice Models Create a reusable voice model for consistent voice characteristics using [`voices.create()`](/api-reference/sdk/python/resources#create): ```python Synchronous focus={5-20} theme={null} from fishaudio import FishAudio client = FishAudio() # Prepare voice samples voice_samples = [] with open("voice1.wav", "rb") as f1: voice_samples.append(f1.read()) with open("voice2.wav", "rb") as f2: voice_samples.append(f2.read()) # Create voice model voice = client.voices.create( title="My Custom Voice", voices=voice_samples, description="A custom voice for my project", tags=["custom", "english"], visibility="private" ) print(f"Created voice: {voice.id}") ``` ```python Asynchronous focus={7-22} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Prepare voice samples voice_samples = [] with open("voice1.wav", "rb") as f1: voice_samples.append(f1.read()) with open("voice2.wav", "rb") as f2: voice_samples.append(f2.read()) # Create voice model voice = await client.voices.create( title="My Custom Voice", voices=voice_samples, description="A custom voice for my project", tags=["custom", "english"], visibility="private" ) print(f"Created voice: {voice.id}") asyncio.run(main()) ``` ### With Transcripts Providing transcripts is faster and more accurate than automatic transcription. When you provide transcripts, the system skips running ASR (speech recognition), resulting in better performance and quality: ```python Synchronous focus={5-27} theme={null} from fishaudio import FishAudio client = FishAudio() # Voice samples with transcripts samples = [ ("voice1.wav", "This is the first sample"), ("voice2.wav", "This is the second sample"), ("voice3.wav", "This is the third sample") ] voices = [] texts = [] for audio_file, transcript in samples: with open(audio_file, "rb") as f: voices.append(f.read()) texts.append(transcript) # Create voice with transcripts voice = client.voices.create( title="High Quality Voice", voices=voices, texts=texts, description="Voice with accurate transcripts", enhance_audio_quality=True ) print(f"Created voice: {voice.id}") ``` ```python Asynchronous focus={7-29} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Voice samples with transcripts samples = [ ("voice1.wav", "This is the first sample"), ("voice2.wav", "This is the second sample"), ("voice3.wav", "This is the third sample") ] voices = [] texts = [] for audio_file, transcript in samples: with open(audio_file, "rb") as f: voices.append(f.read()) texts.append(transcript) # Create voice with transcripts voice = await client.voices.create( title="High Quality Voice", voices=voices, texts=texts, description="Voice with accurate transcripts", enhance_audio_quality=True ) print(f"Created voice: {voice.id}") asyncio.run(main()) ``` ### Audio Quality Enhancement Enable automatic audio enhancement to clean up noisy reference audio: ```python theme={null} voice = client.voices.create( title="Enhanced Voice", voices=voice_samples, enhance_audio_quality=True # Clean up background noise and normalize levels ) ``` Audio enhancement helps process noisy or lower-quality reference audio. If your audio is already clean and well-recorded, this may not provide additional benefit. ## Managing Voice Models ### List Voices Discover available voices with filtering using [`voices.list()`](/api-reference/sdk/python/resources#list): ```python Synchronous focus={5-11} theme={null} from fishaudio import FishAudio client = FishAudio() # List all voices voices = client.voices.list(page_size=20) print(f"Total voices: {voices.total}") for voice in voices.items: print(f"{voice.title}: {voice.id}") ``` ```python Asynchronous focus={7-13} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # List all voices voices = await client.voices.list(page_size=20) print(f"Total voices: {voices.total}") for voice in voices.items: print(f"{voice.title}: {voice.id}") asyncio.run(main()) ``` ### Filter by Tags and Language ```python Synchronous focus={5-21} theme={null} from fishaudio import FishAudio client = FishAudio() # Filter by tags male_voices = client.voices.list( tags=["male", "english"], page_size=10 ) # Filter by language chinese_voices = client.voices.list( language="zh", page_size=10 ) # Get only your own voices my_voices = client.voices.list( self_only=True, page_size=20 ) ``` ```python Asynchronous focus={7-23} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Filter by tags male_voices = await client.voices.list( tags=["male", "english"], page_size=10 ) # Filter by language chinese_voices = await client.voices.list( language="zh", page_size=10 ) # Get only your own voices my_voices = await client.voices.list( self_only=True, page_size=20 ) asyncio.run(main()) ``` ### Get Voice Details Use [`voices.get()`](/api-reference/sdk/python/resources#get) to retrieve voice details: ```python Synchronous focus={5-11} theme={null} from fishaudio import FishAudio client = FishAudio() # Get specific voice voice = client.voices.get("bf322df2096a46f18c579d0baa36f41d") # Adrian print(f"Title: {voice.title}") print(f"Description: {voice.description}") print(f"Tags: {voice.tags}") print(f"Languages: {voice.languages}") ``` ```python Asynchronous focus={7-13} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Get specific voice voice = await client.voices.get("bf322df2096a46f18c579d0baa36f41d") # Adrian print(f"Title: {voice.title}") print(f"Description: {voice.description}") print(f"Tags: {voice.tags}") print(f"Languages: {voice.languages}") asyncio.run(main()) ``` ### Update Voice Metadata Update voice information using [`voices.update()`](/api-reference/sdk/python/resources#update): ```python Synchronous focus={5-11} theme={null} from fishaudio import FishAudio client = FishAudio() # Update voice information client.voices.update( "bf322df2096a46f18c579d0baa36f41d", # Adrian title="Updated Voice Name", description="Updated description", visibility="public", # "public", "unlist", or "private" tags=["updated", "english", "male"] ) ``` ```python Asynchronous focus={7-13} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Update voice information await client.voices.update( "bf322df2096a46f18c579d0baa36f41d", # Adrian title="Updated Voice Name", description="Updated description", visibility="public", # "public", "unlist", or "private" tags=["updated", "english", "male"] ) asyncio.run(main()) ``` ### Delete Voice Remove voice models using [`voices.delete()`](/api-reference/sdk/python/resources#delete): ```python Synchronous focus={5-7} theme={null} from fishaudio import FishAudio client = FishAudio() # Delete a voice model client.voices.delete("bf322df2096a46f18c579d0baa36f41d") # Adrian print("Voice deleted successfully") ``` ```python Asynchronous focus={7-9} theme={null} import asyncio from fishaudio import AsyncFishAudio async def main(): client = AsyncFishAudio() # Delete a voice model await client.voices.delete("bf322df2096a46f18c579d0baa36f41d") # Adrian print("Voice deleted successfully") asyncio.run(main()) ``` Deleting a voice is permanent and cannot be undone. Make sure you have backups of any important voice models. ## Next Steps Use cloned voices for speech generation Stream audio with custom voices in real-time Complete voice management API documentation Production tips and optimization strategies ## Related Resources * [Voice Types Reference](/api-reference/sdk/python/types#voices) - Voice model data structures * [Audio Formats Guide](/developer-guide/core-features/text-to-speech#audio-formats) - Supported audio formats * [Fine-grained Control](/developer-guide/core-features/fine-grained-control) - Advanced voice customization # WebSocket Streaming Source: https://docs.fish.audio/developer-guide/sdk-guide/python/websocket Stream text-to-speech in real-time with WebSocket connections ## Prerequisites Sign up for a free Fish Audio account to get started with our API. 1. Go to [fish.audio/auth/signup](https://fish.audio/auth/signup) 2. Fill in your details to create an account, complete steps to verify your account. 3. Log in to your account and navigate to the [API section](https://fish.audio/app/api-keys) Once you have an account, you'll need an API key to authenticate your requests. 1. Log in to your [Fish Audio Dashboard](https://fish.audio/app/api-keys/) 2. Navigate to the API Keys section 3. Click "Create New Key" and give it a descriptive name, set a expiration if desired 4. Copy your key and store it securely Keep your API key secret! Never commit it to version control or share it publicly. ## Overview Use [`stream_websocket()`](/api-reference/sdk/python/resources#stream_websocket) for real-time text streaming with LLMs and live captions. The connection automatically buffers incoming text and generates audio as it becomes available. ## Basic Usage Stream text chunks and receive audio in real-time: ```python Synchronous focus={5-17} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # Define text generator def text_chunks(): yield "Hello, " yield "this is " yield "real-time " yield "streaming!" # Stream audio via WebSocket audio_stream = client.tts.stream_websocket( text_chunks(), latency="balanced" # Use "balanced" for real-time, "normal" for quality ) # Play streamed audio play(audio_stream) ``` ```python Asynchronous focus={8-20} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Define async text generator async def text_chunks(): yield "Hello, " yield "this is " yield "real-time " yield "streaming!" # Stream audio via WebSocket audio_stream = await client.tts.stream_websocket( text_chunks(), latency="balanced" # Use "balanced" for real-time, "normal" for quality ) # Play streamed audio play(audio_stream) asyncio.run(main()) ``` For details on audio formats, voice selection, and advanced configuration options like `TTSConfig`, see the [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech). ## Using FlushEvent Force immediate audio generation to create pauses using [`FlushEvent`](/api-reference/sdk/python/types#flushevent-objects): ```python Synchronous focus={6-12} theme={null} from fishaudio import FishAudio from fishaudio.types import FlushEvent client = FishAudio() def text_with_flush(): yield "First sentence. " yield "Second sentence. " yield FlushEvent() # Forces generation NOW yield "Third sentence." audio_stream = client.tts.stream_websocket(text_with_flush()) ``` ```python Asynchronous focus={8-14} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.types import FlushEvent async def main(): client = AsyncFishAudio() async def text_with_flush(): yield "First sentence. " yield "Second sentence. " yield FlushEvent() # Forces generation NOW yield "Third sentence." audio_stream = await client.tts.stream_websocket(text_with_flush()) asyncio.run(main()) ``` See [Text-to-Speech guide](/developer-guide/sdk-guide/python/text-to-speech#understanding-flushevent) for detailed FlushEvent usage and advanced examples. ## LLM Integration WebSocket streaming is designed for integrating with LLM streaming responses. The TTS engine automatically buffers incoming text chunks and generates audio when it has enough context for natural speech: ```python Synchronous focus={5-21} theme={null} from fishaudio import FishAudio from fishaudio.utils import play client = FishAudio() # Simulate streaming LLM response def llm_stream(): """Simulates text chunks from an LLM.""" tokens = [ "The ", "weather ", "today ", "is ", "sunny ", "with ", "clear ", "skies. ", "Perfect ", "for ", "outdoor ", "activities!" ] for token in tokens: yield token # Stream to speech in real-time audio_stream = client.tts.stream_websocket( llm_stream(), latency="balanced" ) play(audio_stream) ``` ```python Asynchronous focus={7-23} theme={null} import asyncio from fishaudio import AsyncFishAudio from fishaudio.utils import play async def main(): client = AsyncFishAudio() # Simulate streaming LLM response async def llm_stream(): """Simulates text chunks from an LLM.""" tokens = [ "The ", "weather ", "today ", "is ", "sunny ", "with ", "clear ", "skies. ", "Perfect ", "for ", "outdoor ", "activities!" ] for token in tokens: yield token # Stream to speech in real-time audio_stream = await client.tts.stream_websocket( llm_stream(), latency="balanced" ) play(audio_stream) asyncio.run(main()) ``` The WebSocket connection automatically buffers incoming text and generates audio when it has accumulated enough context for natural-sounding speech. You don't need to manually batch tokens unless you want to force generation at specific points using `FlushEvent`. ## Next Steps Learn about non-streaming TTS options, audio formats, TextEvent vs plain strings, and advanced configuration Use custom voices in streams and learn about voice selection Complete streaming API documentation Production streaming optimization ## Related Resources * [WebSocket Types](/api-reference/sdk/python/types#tts) - TextEvent, FlushEvent, and more * [Utils Reference](/api-reference/sdk/python/utils) - Audio playback utilities * [Error Handling](/api-reference/sdk/python/exceptions) - WebSocket exception handling * [Fine-grained Control](/developer-guide/core-features/fine-grained-control) - Advanced speech control # Docker Deployment Source: https://docs.fish.audio/developer-guide/self-hosting/docker-deployment Deploy Fish Audio models using Docker containers Fish Audio provides Docker images for both WebUI and API server deployments. You can use pre-built images from Docker Hub or build custom images locally. ## Prerequisites Before deploying with Docker, ensure you have: * **Docker** and **Docker Compose** installed * **NVIDIA Docker runtime** (for GPU support) * At least **12GB GPU memory** for CUDA inference * Downloaded model weights (see [Running Inference](/developer-guide/self-hosting/running-inference#download-weights)) ## Pre-built Images Fish Audio provides ready-to-use Docker images on Docker Hub: | Image | Description | Best For | | ------------------------------------------ | ----------------------- | -------------------------------- | | `fishaudio/fish-speech:latest-webui-cuda` | WebUI with CUDA support | Interactive development with GPU | | `fishaudio/fish-speech:latest-webui-cpu` | WebUI CPU-only | Testing without GPU | | `fishaudio/fish-speech:latest-server-cuda` | API server with CUDA | Production deployments with GPU | | `fishaudio/fish-speech:latest-server-cpu` | API server CPU-only | Low-traffic CPU deployments | For production use, we recommend using specific version tags instead of `latest` to ensure consistency across deployments. ## Quick Start with Docker Run The fastest way to get started is using `docker run`: ### WebUI Deployment ```bash theme={null} # Create directories for model weights and reference audio mkdir -p checkpoints references # Start WebUI with CUDA support (recommended) docker run -d \ --name fish-speech-webui \ --gpus all \ -p 7860:7860 \ -v ./checkpoints:/app/checkpoints \ -v ./references:/app/references \ -e COMPILE=1 \ fishaudio/fish-speech:latest-webui-cuda # For CPU-only deployment docker run -d \ --name fish-speech-webui-cpu \ -p 7860:7860 \ -v ./checkpoints:/app/checkpoints \ -v ./references:/app/references \ fishaudio/fish-speech:latest-webui-cpu ``` Access the WebUI at `http://localhost:7860` ### API Server Deployment ```bash theme={null} # Start API server with CUDA support docker run -d \ --name fish-speech-server \ --gpus all \ -p 8080:8080 \ -v ./checkpoints:/app/checkpoints \ -v ./references:/app/references \ -e COMPILE=1 \ fishaudio/fish-speech:latest-server-cuda # For CPU-only deployment docker run -d \ --name fish-speech-server-cpu \ -p 8080:8080 \ -v ./checkpoints:/app/checkpoints \ -v ./references:/app/references \ fishaudio/fish-speech:latest-server-cpu ``` Access the API documentation at `http://localhost:8080` Enable the `COMPILE=1` environment variable for \~10x faster inference on CUDA deployments. This uses `torch.compile` to optimize the model. ## Docker Compose Deployment For development or customization, Docker Compose provides easier configuration management: ### Setup ```bash theme={null} # Clone the repository git clone https://github.com/fishaudio/fish-speech.git cd fish-speech ``` ### Start Services ```bash theme={null} # Start WebUI with CUDA docker compose --profile webui up # Start WebUI with compile optimization COMPILE=1 docker compose --profile webui up # Start API server docker compose --profile server up # Start API server with compile optimization COMPILE=1 docker compose --profile server up # For CPU-only deployment BACKEND=cpu docker compose --profile webui up ``` Run containers in detached mode by adding the `-d` flag: `docker compose --profile webui up -d` ### Environment Variables Customize deployment using environment variables or a `.env` file: ```bash theme={null} # .env file example BACKEND=cuda # or cpu COMPILE=1 # Enable compile optimization GRADIO_PORT=7860 # WebUI port API_PORT=8080 # API server port UV_VERSION=0.8.15 # UV package manager version ``` ## Manual Docker Build For advanced users who need custom configurations: ### Build WebUI Image ```bash theme={null} # Build with CUDA support docker build \ --platform linux/amd64 \ -f docker/Dockerfile \ --build-arg BACKEND=cuda \ --build-arg CUDA_VER=12.6.0 \ --build-arg UV_EXTRA=cu126 \ --target webui \ -t fish-speech-webui:cuda . # Build CPU-only (supports multi-platform) docker build \ --platform linux/amd64,linux/arm64 \ -f docker/Dockerfile \ --build-arg BACKEND=cpu \ --target webui \ -t fish-speech-webui:cpu . ``` ### Build API Server Image ```bash theme={null} # Build with CUDA support docker build \ --platform linux/amd64 \ -f docker/Dockerfile \ --build-arg BACKEND=cuda \ --build-arg CUDA_VER=12.6.0 \ --build-arg UV_EXTRA=cu126 \ --target server \ -t fish-speech-server:cuda . ``` ### Build Development Image ```bash theme={null} # Build development image with all tools docker build \ --platform linux/amd64 \ -f docker/Dockerfile \ --build-arg BACKEND=cuda \ --target dev \ -t fish-speech-dev:cuda . ``` ### Build Arguments | Argument | Options | Default | Description | | ------------ | ------------------------- | -------- | ------------------- | | `BACKEND` | `cuda`, `cpu` | `cuda` | Compute backend | | `CUDA_VER` | `12.6.0`, etc. | `12.6.0` | CUDA version | | `UV_EXTRA` | `cu126`, `cu128`, `cu129` | `cu126` | UV extra for CUDA | | `UBUNTU_VER` | `24.04`, etc. | `24.04` | Ubuntu base version | | `PY_VER` | `3.12`, etc. | `3.12` | Python version | ## Volume Mounts Both Docker run and Compose methods require these volume mounts: | Host Path | Container Path | Purpose | | --------------- | ------------------ | --------------------------------------- | | `./checkpoints` | `/app/checkpoints` | Model weights directory | | `./references` | `/app/references` | Reference audio files for voice cloning | Ensure model weights are downloaded and placed in the `./checkpoints` directory before starting containers. See [Running Inference](/developer-guide/self-hosting/running-inference#download-weights) for download instructions. ## Environment Variables Reference ### WebUI Configuration | Variable | Default | Description | | -------------------- | --------- | ---------------------------- | | `GRADIO_SERVER_NAME` | `0.0.0.0` | WebUI server host | | `GRADIO_SERVER_PORT` | `7860` | WebUI server port | | `GRADIO_SHARE` | `false` | Enable Gradio public sharing | ### API Server Configuration | Variable | Default | Description | | ----------------- | --------- | --------------- | | `API_SERVER_NAME` | `0.0.0.0` | API server host | | `API_SERVER_PORT` | `8080` | API server port | ### Model Configuration | Variable | Default | Description | | ------------------------- | ----------------------------------------- | -------------------------- | | `LLAMA_CHECKPOINT_PATH` | `checkpoints/openaudio-s1-mini` | Path to model weights | | `DECODER_CHECKPOINT_PATH` | `checkpoints/openaudio-s1-mini/codec.pth` | Path to decoder weights | | `DECODER_CONFIG_NAME` | `modded_dac_vq` | Decoder configuration name | ### Performance Optimization | Variable | Default | Description | | --------- | ------- | -------------------------------------------------- | | `COMPILE` | `0` | Enable torch.compile for \~10x speedup (CUDA only) | ## Container Management ### View Logs ```bash theme={null} # Docker run docker logs fish-speech-webui # Docker Compose docker compose logs webui ``` ### Stop Containers ```bash theme={null} # Docker run docker stop fish-speech-webui # Docker Compose docker compose down ``` ### Update Images ```bash theme={null} # Pull latest images docker pull fishaudio/fish-speech:latest-webui-cuda # Restart containers with new image docker compose --profile webui up -d ``` ## GPU Support ### Prerequisites Install NVIDIA Container Toolkit: ```bash theme={null} # Ubuntu/Debian distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo systemctl restart docker ``` ### Verify GPU Access ```bash theme={null} docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi ``` GPU support requires NVIDIA Docker runtime. For CPU-only deployment, remove the `--gpus all` flag and use CPU images. ## Troubleshooting ### Container Won't Start Check logs for errors: ```bash theme={null} docker logs fish-speech-webui ``` Common issues: * Missing model weights in `./checkpoints` * Port already in use (change port mapping) * Insufficient GPU memory ### GPU Not Detected Verify NVIDIA Docker runtime is installed: ```bash theme={null} docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu24.04 nvidia-smi ``` ### Performance Issues 1. Enable compile optimization: `COMPILE=1` 2. Ensure GPU is being used (check with `nvidia-smi`) 3. Verify sufficient GPU memory is available ## Next Steps * **[Run inference](/developer-guide/self-hosting/running-inference)** - Learn how to generate speech * **[Download models](https://huggingface.co/fishaudio)** - Get pre-trained weights * **[API documentation](/api-reference/introduction)** - Integrate with your applications # Local Model Setup Source: https://docs.fish.audio/developer-guide/self-hosting/local-setup Install and configure Fish Audio models for local inference This guide is for advanced users who want to self-host Fish Audio models. For most users, we recommend using the [Fish Audio API](https://fish.audio) for easier integration and automatic updates. ## Prerequisites Before you begin, ensure you have: * **GPU**: 12GB VRAM minimum (for inference) * **OS**: Linux or WSL (Windows Subsystem for Linux) * **System dependencies**: Audio processing libraries Install required system packages: ```bash theme={null} apt install portaudio19-dev libsox-dev ffmpeg ``` ## Installation Methods Fish Audio supports multiple installation methods. Choose the one that best fits your development environment. ### Conda Installation Conda provides a stable, isolated Python environment: ```bash theme={null} # Create a new environment with Python 3.12 conda create -n fish-speech python=3.12 conda activate fish-speech # GPU installation (choose your CUDA version: cu126, cu128, cu129) pip install -e .[cu129] # CPU-only installation (slower, not recommended for production) pip install -e .[cpu] # Default installation (uses PyTorch default index) pip install -e . ``` For best performance, match your CUDA version with your GPU driver. Use `nvidia-smi` to check your CUDA version. ### UV Installation [UV](https://github.com/astral-sh/uv) provides faster dependency resolution and installation: ```bash theme={null} # GPU installation (choose your CUDA version: cu126, cu128, cu129) uv sync --python 3.12 --extra cu129 # CPU-only installation uv sync --python 3.12 --extra cpu ``` UV is recommended for faster setup times, especially when working with large dependency trees. ### Intel Arc XPU Support For Intel Arc GPU users, install with XPU support: ```bash theme={null} # Create environment conda create -n fish-speech python=3.12 conda activate fish-speech # Install required C++ standard library conda install libstdcxx -c conda-forge # Install PyTorch with Intel XPU support pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu # Install Fish Speech pip install -e . ``` The `--compile` optimization flag is not supported on Windows and macOS. To use compile acceleration, you need to install Triton manually. ## Repository Setup Clone the Fish Speech repository to get started: ```bash theme={null} git clone https://github.com/fishaudio/fish-speech.git cd fish-speech ``` Then follow one of the installation methods above. ## Next Steps Once installation is complete, you can: * **[Set up Docker deployment](/developer-guide/self-hosting/docker-deployment)** - Use containerized deployment for easier management * **[Run inference](/developer-guide/self-hosting/running-inference)** - Start generating speech with your local models * **Download models** - Get pre-trained weights from [Hugging Face](https://huggingface.co/fishaudio) ## Hardware Recommendations For optimal performance: | Use Case | Recommended GPU | VRAM | Expected Speed | | ----------- | --------------- | ----- | ----------------------- | | Development | RTX 3060 | 12GB | \~1:15 real-time factor | | Production | RTX 4090 | 24GB | \~1:7 real-time factor | | Enterprise | A100 | 40GB+ | \~1:5 real-time factor | Real-time factor indicates how much faster than real-time the model can generate audio. For example, 1:7 means generating 1 minute of audio takes \~8.5 seconds. ## Troubleshooting ### CUDA Out of Memory If you encounter CUDA out of memory errors: 1. Reduce batch size in inference settings 2. Use `--half` flag for FP16 inference 3. Close other GPU-intensive applications ### Package Installation Errors If you encounter dependency conflicts: 1. Try using UV instead of pip for better dependency resolution 2. Create a fresh conda environment 3. Ensure you're using Python 3.12 (other versions may have compatibility issues) ## Community Support Need help with local setup? * Join our [Discord community](https://discord.gg/dF9Db2Tt3Y) for community support * Check [GitHub Issues](https://github.com/fishaudio/fish-speech/issues) for known problems * Contact [enterprise support](mailto:support@fish.audio) for commercial deployments # Running Inference Source: https://docs.fish.audio/developer-guide/self-hosting/running-inference Generate speech using self-hosted Fish Audio models Fish Audio supports multiple inference methods: command line, HTTP API, WebUI, and GUI. Choose the method that best fits your workflow. This guide assumes you have already [installed Fish Audio locally](/developer-guide/self-hosting/local-setup) or [set up Docker deployment](/developer-guide/self-hosting/docker-deployment). ## Download Weights Before running inference, download the required model weights from Hugging Face: ```bash theme={null} # Install Hugging Face CLI (if not already installed) pip install huggingface_hub[cli] # or uv tool install huggingface_hub[cli] # Download Fish Audio S1-mini weights hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini ``` **Fish Audio S1-mini** is the open-source distilled version (0.5B parameters) optimized for local deployment. The full **S1** model (4B parameters) is available exclusively on [Fish Audio cloud](https://fish.audio). ## Command Line Inference Command line inference provides maximum control and is ideal for scripting and batch processing. ### Step 1: Extract VQ Tokens from Reference Audio First, encode your reference audio to get voice characteristics: ```bash theme={null} python fish_speech/models/dac/inference.py \ -i "reference_audio.wav" \ --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" ``` This generates two files: * `fake.npy` - VQ tokens representing voice characteristics * `fake.wav` - Reconstructed audio for verification **Skip this step if you want random voice generation** - the model can generate speech without reference audio. ### Step 2: Generate Semantic Tokens from Text Convert your text to semantic tokens using the language model: ```bash theme={null} python fish_speech/models/text2semantic/inference.py \ --text "The text you want to convert to speech" \ --prompt-text "Transcription of your reference audio" \ --prompt-tokens "fake.npy" \ --compile ``` **Parameters:** * `--text`: The text to synthesize * `--prompt-text`: Transcription of the reference audio (for voice cloning) * `--prompt-tokens`: Path to VQ tokens from Step 1 (for voice cloning) * `--compile`: Enable kernel fusion for faster inference (\~10x speedup on RTX 4090) For random voice generation, omit `--prompt-text` and `--prompt-tokens` parameters. This creates a file named `codes_N.npy` (where N starts from 0) containing semantic tokens. For GPUs that don't support bf16 (bfloat16), add the `--half` flag to use fp16 instead. ### Step 3: Generate Audio from Semantic Tokens Finally, convert semantic tokens to audio: ```bash theme={null} python fish_speech/models/dac/inference.py \ -i "codes_0.npy" ``` This generates the final audio file. ### Full Example Here's a complete workflow for voice cloning: ```bash theme={null} # 1. Encode reference audio python fish_speech/models/dac/inference.py \ -i "my_voice.wav" \ --checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" # 2. Generate semantic tokens python fish_speech/models/text2semantic/inference.py \ --text "Hello, this is a test of voice cloning." \ --prompt-text "This is my reference voice recording." \ --prompt-tokens "fake.npy" \ --compile # 3. Generate final audio python fish_speech/models/dac/inference.py \ -i "codes_0.npy" ``` ## HTTP API Inference The HTTP API provides a programmatic interface for integrations and production deployments. ### Start API Server ```bash theme={null} # With local installation python -m tools.api_server \ --listen 0.0.0.0:8080 \ --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \ --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \ --decoder-config-name modded_dac_vq # With UV uv run tools/api_server.py \ --listen 0.0.0.0:8080 \ --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \ --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \ --decoder-config-name modded_dac_vq ``` Add the `--compile` flag to enable torch.compile optimization for faster inference. ### Access API Documentation Once the server is running, access the interactive API documentation at: ``` http://localhost:8080/docs ``` The API provides endpoints for: * Text-to-speech synthesis * Voice cloning with reference audio * Batch processing * Model information ### Example API Request ```bash theme={null} curl -X POST "http://localhost:8080/v1/tts" \ -H "Content-Type: application/json" \ -d '{ "text": "Hello, this is a test", "reference_audio": "base64_encoded_audio", "reference_text": "Reference transcription" }' ``` ## WebUI Inference The WebUI provides an intuitive interface for interactive testing and development. ### Start WebUI ```bash theme={null} # With all parameters python -m tools.run_webui \ --llama-checkpoint-path "checkpoints/openaudio-s1-mini" \ --decoder-checkpoint-path "checkpoints/openaudio-s1-mini/codec.pth" \ --decoder-config-name modded_dac_vq # Or use defaults (auto-detects models in checkpoints/) python -m tools.run_webui ``` Add the `--compile` flag for faster inference during interactive sessions. ### Access WebUI The WebUI starts on port 7860 by default. Access it at: ``` http://localhost:7860 ``` ### Configure with Environment Variables Customize the WebUI using Gradio environment variables: ```bash theme={null} # Enable public sharing GRADIO_SHARE=1 python -m tools.run_webui # Change server port GRADIO_SERVER_PORT=8080 python -m tools.run_webui # Change server name GRADIO_SERVER_NAME=0.0.0.0 python -m tools.run_webui ``` ### Using Reference Audio Library For faster workflow, pre-save reference audio: 1. Create a `references/` directory in the project root 2. Create subdirectories named by voice ID: `references//` 3. Place files in each subdirectory: * `sample.wav` - Reference audio file * `sample.lab` - Text transcription of the audio Example structure: ``` references/ ├── alice/ │ ├── sample.wav │ └── sample.lab └── bob/ ├── sample.wav └── sample.lab ``` These references will appear as selectable options in the WebUI. ## GUI Inference For users who prefer a native desktop application, a PyQt6-based GUI is available. ### Download GUI Client Download the latest release from the [Fish Speech GUI repository](https://github.com/AnyaCoder/fish-speech-gui/releases). **Supported platforms:** * Linux * Windows * macOS ### Connect to API Server The GUI client connects to a running API server (see [HTTP API Inference](#http-api-inference) above). 1. Start the API server 2. Launch the GUI client 3. Configure the API endpoint (default: `http://localhost:8080`) ## Docker Inference If you're using Docker deployment, refer to the [Docker Deployment guide](/developer-guide/self-hosting/docker-deployment) for detailed instructions on: * Running pre-built WebUI containers * Running pre-built API server containers * Customizing container configuration * Volume mounts for models and references Quick example: ```bash theme={null} # Start WebUI with Docker docker run -d \ --name fish-speech-webui \ --gpus all \ -p 7860:7860 \ -v ./checkpoints:/app/checkpoints \ -v ./references:/app/references \ -e COMPILE=1 \ fishaudio/fish-speech:latest-webui-cuda ``` ## Performance Optimization ### Enable Compilation Torch compilation provides \~10x speedup on compatible GPUs: ```bash theme={null} # Add --compile flag to any inference command python -m tools.api_server --compile ... ``` Compilation requires: * CUDA-compatible GPU * Triton library (not supported on Windows/macOS) * First run will be slow due to compilation overhead ### Use Mixed Precision For GPUs without bf16 support, use fp16: ```bash theme={null} python fish_speech/models/text2semantic/inference.py --half ... ``` ### Batch Processing For multiple audio generations, use batch processing to amortize model loading overhead: ```python theme={null} # Example batch processing script import fish_speech model = fish_speech.load_model("checkpoints/openaudio-s1-mini") texts = ["First sentence", "Second sentence", "Third sentence"] for text in texts: audio = model.synthesize(text) audio.save(f"output_{texts.index(text)}.wav") ``` ## Emotion Control Fish Audio S1 supports emotional markers for expressive speech synthesis: ### Basic Emotions ``` (angry) (sad) (excited) (surprised) (satisfied) (delighted) (scared) (worried) (upset) (nervous) (frustrated) (depressed) (empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) ``` ### Advanced Emotions ``` (disdainful) (unhappy) (anxious) (hysterical) (indifferent) (impatient) (guilty) (scornful) (panicked) (furious) (reluctant) (keen) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused) ``` ### Tone Markers ``` (in a hurry tone) (shouting) (screaming) (whispering) (soft tone) ``` ### Special Effects ``` (laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing) ``` ### Example Usage ```bash theme={null} python fish_speech/models/text2semantic/inference.py \ --text "(excited)This is amazing! (laughing)Ha ha ha!" \ --compile ``` Emotion control is currently supported for English, Chinese, and Japanese. More languages coming soon! For more details, see the [Emotion Reference](/api-reference/emotion-reference). ## Troubleshooting ### Out of Memory Errors If you encounter CUDA out of memory errors: 1. Reduce input text length 2. Use `--half` flag for fp16 inference 3. Close other GPU applications 4. Use a smaller batch size ### Slow Inference To improve speed: 1. Enable `--compile` flag 2. Verify GPU is being used (check with `nvidia-smi`) 3. Ensure CUDA version matches PyTorch installation 4. Use fp16 instead of bf16 on older GPUs ### Poor Audio Quality For better quality: 1. Use high-quality reference audio (clear, no background noise) 2. Ensure reference text accurately matches reference audio 3. Use 10-30 seconds of reference audio 4. See [Voice Cloning Best Practices](/developer-guide/best-practices/voice-cloning) ### Model Loading Errors If models fail to load: 1. Verify model weights are downloaded completely 2. Check checkpoint paths are correct 3. Ensure sufficient disk space 4. Re-download weights if corrupted ## Next Steps * **[Emotion Control Best Practices](/developer-guide/best-practices/emotion-control)** - Master expressive speech * **[Voice Cloning Best Practices](/developer-guide/best-practices/voice-cloning)** - Optimize voice cloning quality * **[API Reference](/api-reference/introduction)** - Integrate with your applications * **[Cloud API](https://fish.audio)** - Compare with managed service performance # Tutorials & Examples Source: https://docs.fish.audio/developer-guide/tutorials/tutorials Step-by-step guides and code examples for Fish Audio features Coming soon! We're preparing comprehensive tutorials and examples to help you get the most out of Fish Audio. We're working on tutorials for: * Building your first TTS application * Creating custom voice models * Implementing real-time streaming * Building interactive voice applications * Advanced emotion and prosody control * Multi-speaker conversations In the meantime, check out: * [Quickstart Guide](/developer-guide/getting-started/quickstart) for getting started * [Python SDK Examples](/developer-guide/sdk-guide/python/text-to-speech) for code samples * [JavaScript SDK Examples](/developer-guide/sdk-guide/javascript/text-to-speech) for code samples * [Guide and Best Practices](/developer-guide/core-features/text-to-speech) for optimization tips Join our [Discord](https://discord.gg/dF9Db2Tt3Y) for updates and community examples.