Available Models
Fish Audio offers state-of-the-art text-to-speech models optimized for different use cases and performance requirements.Recommended Model
s2.1-pro
Fish Audio S2.1-Pro - Our recommended production TTS model and an improved version of S2-Pro
- Natural language control with
[bracket]syntax — not limited to a fixed set (e.g.,[whispers sweetly],[laughing nervously]) - Multi-speaker dialogue support
- 83 languages
- Improved quality, latency, and throughput over S2-Pro
- Production option for workloads that need TTFA and DPA guarantees
Free Development Model
s2.1-pro-free
Fish Audio S2.1-Pro Free - The same model as S2.1-Pro, available at $0 for development and testing
- Use the
s2.1-pro-freemodel string with the same TTS API endpoint - Same model quality and language coverage as
s2.1-pro - Free to use under fair-use limits
- No TTFA or DPA guarantees
- Best for testing, prototyping, development, and smaller businesses
Previous S2 Model
s2-pro
Fish Audio S2-Pro - Previous-generation S2 TTS model
- Natural language control with
[bracket]syntax — not limited to a fixed set (e.g.,[whispers sweetly],[laughing nervously]) - Multi-speaker dialogue support
- 80+ languages
- 100ms time-to-first-audio
- Full SGLang-based serving stack
- Open-source
We recommend using
s2.1-pro for production projects. Use s2.1-pro-free when you want the same model for evaluation, prototyping, development, and smaller businesses without TTFA or DPA guarantees. S1 remains available for existing integrations.Previous Model
s1
Fish Audio S1 - High-quality voice generation
- 4 billion parameters
- 0.008 WER (0.8% word error rate)
- Full emotional control capabilities with
(parenthesis)syntax
Model Specifications
Fish Audio S1 Performance Metrics
- Word Error Rate (WER): 0.008 (0.8%)
- Character Error Rate (CER): 0.004 (0.4%)
- Real-time Factor: ~1:7 on standard hardware
- TTS-Arena2 Ranking: #1 worldwide
Supported Languages
S2.1-Pro and S2-Pro
S2.1-Pro supports 83 languages, while S2-Pro supports 80+ languages. Both use automatic language detection and support inline emotion and paralinguistic cues.Language detection is automatic - simply provide text in your target language.
S1
S1 supports text-to-speech generation in 13 languages with full emotional expression capabilities.Voice Styles and Emotions
Fish Audio models support emotional expressions and voice styles that can be controlled through text markers in your input.S2.1-Pro and S2-Pro Natural Language Control
S2.1-Pro and S2-Pro treat[bracket] tags as standard text rather than dedicated control tokens. Through training on massive datasets, the models learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as [whispers sweetly] or [laughing nervously].
Common examples include:
S1 Voice Styles and Emotions
S1 supports 64+ emotional expressions using(parenthesis) syntax.
Basic Emotions (24 expressions)
Advanced Emotions (25 expressions)
Tone Markers (5 expressions)
Audio Effects (10 expressions)
Support
Need help? Check out these resources:- API Reference - Complete API documentation
- Create a Voice Clone - Create a voice clone model
- Generate Speech - Generate realistic speech
- Real-time Streaming - WebSocket for real-time streaming
- Discord Community - Get help from the community
- Support Email - Contact our support team

