Available Models
Fish Audio offers state-of-the-art text-to-speech models optimized for different use cases and performance requirements.Current Model
s1
OpenAudio S1 - Our flagship model with industry-leading quality
- 4 billion parameters
- 0.008 WER (0.8% word error rate)
- Best performance and naturalness
- Full emotional control capabilities
Legacy Models
speech-1.6
Fish Speech 1.6 - Previous generation model
- Stable and production-tested
- Good performance for standard use cases
- Basic emotional control
speech-1.5
Fish Speech 1.5 - Earlier generation model
- Reliable performance
- Limited to basic emotions
- Lower resource requirements
We recommend using
s1
for all new projects to access the latest capabilities and performance improvements. Legacy models remain available for existing integrations.Model Specifications
OpenAudio S1
Specification | Value |
---|---|
Parameters | 4B |
Context Length | 8,192 tokens |
Languages | 30+ languages |
Emotions | 48+ expressions |
Latency | ~200ms |
Max Audio Length | 30 minutes |
Streaming | ✅ Yes |
Performance Metrics
- Word Error Rate (WER): 0.008 (0.8%)
- Character Error Rate (CER): 0.004 (0.4%)
- Real-time Factor: ~1:7 on standard hardware
- TTS-Arena2 Ranking: #1 worldwide
Supported Languages
Fish Audio models support text-to-speech generation in 30+ languages with full emotional expression capabilities.Primary Languages
Additional Languages
Language detection is automatic - simply provide text in your target language. You can also specify the language explicitly using the
language
parameter in your API request.Voice Styles and Emotions
OpenAudio models support 64+ emotional expressions and voice styles that can be controlled through text markers in your input.Basic Emotions (24 expressions)
Advanced Emotions (25 expressions)
Tone Markers (5 expressions)
Audio Effects (10 expressions)
You can also use natural expressions like “Ha,ha,ha” for laughter. Experiment with combinations to achieve the perfect emotional tone for your application.