Fish Audio home page
Search...
⌘K
Ask AI
Support
Playground
Playground
Search...
Navigation
Text to Speech
Voice Cloning Best Practices
Documentation
API Reference
Playground
Blog
Get Started
Introduction
Rate Limits
Developer Program
Text to Speech
Create Model
Text to Speech
Text to Speech (WebSocket)
Voice Cloning Best Practices
Fine-grained Control
Emotion Control
Emotion and Control Tags
情感与控制指令
感情・形態マーカー使用ガイド
Speech to Text
Speech to Text
On this page
Audio Quality Guidelines
Instant Voice Cloning (Playground)
Premium Voice Cloning (Let’s Talk)
File Formats
Text to Speech
Voice Cloning Best Practices
Tips for optimal audio samples in voice cloning.
Audio Quality Guidelines
Single speaker only
Steady volume, tone, and emotion
Brief pauses (0.5s recommended)
Ideally: No background noise
Ideally: Professional recording quality
Ideally: No room echo
Instant Voice Cloning (Playground)
30-45 seconds of quality audio
Best: 2-3 15-20s clips forming a complete paragraph
Premium Voice Cloning (Let’s Talk)
30-180 minutes of high-quality audio
Optional: Multiple languages and emotions
File Formats
Various audio types accepted
Recommended: MP3 at 192kbps+ to avoid quality loss
Uncompressed formats (e.g., WAV) offer minimal benefit
Text to Speech (WebSocket)
Fine-grained Control
Assistant
Responses are generated using AI and may contain mistakes.