Skip to main content

Available Models

Fish Audio offers state-of-the-art text-to-speech models optimized for different use cases and performance requirements.

s2-pro

Fish Audio S2-Pro - Our next-generation TTS model with best-in-class performance
  • Natural language control with [bracket] syntax — not limited to a fixed set (e.g., [whispers sweetly], [laughing nervously])
  • Multi-speaker dialogue support
  • 80+ languages
  • 100ms time-to-first-audio
  • Full SGLang-based serving stack
  • Open-source
We recommend using s2-pro for all new projects to access the latest capabilities and performance improvements. S1 remains available for existing integrations.

Previous Model

s1

Fish Audio S1 - High-quality voice generation
  • 4 billion parameters
  • 0.008 WER (0.8% word error rate)
  • Full emotional control capabilities with (parenthesis) syntax

Model Specifications

Fish Audio S1 Performance Metrics

  • Word Error Rate (WER): 0.008 (0.8%)
  • Character Error Rate (CER): 0.004 (0.4%)
  • Real-time Factor: ~1:7 on standard hardware
  • TTS-Arena2 Ranking: #1 worldwide

Supported Languages

S2-Pro

S2-Pro supports 80+ languages with automatic language detection and inline emotion and paralinguistic cue support.
Language detection is automatic - simply provide text in your target language.

S1

S1 supports text-to-speech generation in 13 languages with full emotional expression capabilities.
English, Chinese, Japanese, German,
French, Spanish, Korean, Arabic,
Russian, Dutch, Italian, Polish, Portuguese

Voice Styles and Emotions

Fish Audio models support emotional expressions and voice styles that can be controlled through text markers in your input.

S2-Pro Natural Language Control

S2-Pro treats [bracket] tags as standard text rather than dedicated control tokens. Through training on massive datasets, the model learned implicit mappings between natural language descriptions and acoustic variations. This means you are not limited to a predefined set of tags — you can use any descriptive expression and the model will interpret it, such as [whispers sweetly] or [laughing nervously]. Common examples include:
[whisper] [laugh] [emphasis] [sigh] [gasp] [pause]
[angry] [excited] [sad] [surprised] [inhale] [exhale]
S2-Pro cues can be placed anywhere in your text to control emotion at specific positions. For example: "I can't believe it [gasp] you actually did it [laugh]"

S1 Voice Styles and Emotions

S1 supports 64+ emotional expressions using (parenthesis) syntax.

Basic Emotions (24 expressions)

(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)

Advanced Emotions (25 expressions)

(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)

Tone Markers (5 expressions)

(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)

Audio Effects (10 expressions)

(laughing) (chuckling) (sobbing) (crying loudly) (sighing)
(panting) (groaning) (crowd laughing) (background laughter) (audience laughing)
You can also use natural expressions like “Ha,ha,ha” for laughter. Experiment with combinations to achieve the perfect emotional tone for your application.

Support

Need help? Check out these resources: