Fish Audio S2
Next-generation text-to-speech model with inline emotion cues, multi-speaker dialogue support, and 80+ languages.S2 introduces[bracket] syntax for natural language control over emotion and paralinguistic cues (e.g., [whisper], [laugh], [emphasis]). Tags are treated as standard text rather than dedicated control tokens, so you are not limited to a fixed set of expressions. Built on the Qwen3-4B backbone and fully open-source.Use model ID s2-pro in the API. S1 remains supported for existing integrations.GitHub | HuggingFace



