Available Models
Fish Audio offers state-of-the-art text-to-speech models optimized for different use cases and performance requirements.Recommended Model
s2-pro
Fish Audio S2-Pro - Our next-generation TTS model with best-in-class performance
- Natural language control with
[bracket]syntax — not limited to a fixed set (e.g.,[whispers sweetly],[laughing nervously]) - Multi-speaker dialogue support
- 80+ languages
- 100ms time-to-first-audio
- Full SGLang-based serving stack
- Open-source
We recommend using
s2-pro for all new projects to access the latest capabilities and performance improvements. S1 remains available for existing integrations.Previous Model
s1
Fish Audio S1 - High-quality voice generation
- 4 billion parameters
- 0.008 WER (0.8% word error rate)
- Full emotional control capabilities with
(parenthesis)syntax
Model Specifications
Fish Audio S1 Performance Metrics
- Word Error Rate (WER): 0.008 (0.8%)
- Character Error Rate (CER): 0.004 (0.4%)
- Real-time Factor: ~1:7 on standard hardware
- TTS-Arena2 Ranking: #1 worldwide
Supported Languages
S2-Pro
S2-Pro supports 80+ languages with automatic language detection and inline emotion and paralinguistic cue support.Language detection is automatic - simply provide text in your target language.
S1
S1 supports text-to-speech generation in 13 languages with full emotional expression capabilities.Voice Styles and Emotions
Fish Audio models support emotional expressions and voice styles that can be controlled through text markers in your input.S2-Pro Natural Language Control
S2-Pro treats[bracket] tags as standard text rather than dedicated control tokens. Through training on massive datasets, the model learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as [whispers sweetly] or [laughing nervously].
Common examples include:
S1 Voice Styles and Emotions
S1 supports 64+ emotional expressions using(parenthesis) syntax.
Basic Emotions (24 expressions)
Advanced Emotions (25 expressions)
Tone Markers (5 expressions)
Audio Effects (10 expressions)
Support
Need help? Check out these resources:- API Reference - Complete API documentation
- Create a Voice Clone - Create a voice clone model
- Generate Speech - Generate realistic speech
- Real-time Streaming - WebSocket for real-time streaming
- Discord Community - Get help from the community
- Support Email - Contact our support team




