Introduction

Welcome to the advanced AI voice technology provided by Fish Audio. We believe that “how you say it” is just as important as “what you say.” To ensure that every synthesized voice is not only “realistic” but also “emotionally resonant,” we have introduced a powerful real-time emotion and tone control tag system.

This system is a core component of Fish Audio S1, allowing you to precisely inject emotions into voices and control speech rate and tone. This guide will serve as your companion, comprehensively introducing how to use these tags, related rules, and best practices to help you transform creativity into expressive voice works.

This guide provides a comprehensive overview of how to use these tags, their relevant rules, and best practices.

1. Core Usage: Tag Syntax

All control tags must be enclosed in parentheses (). This syntax is universal.

Basic Format: (tag)Text to be read

Scope: A tag affects all subsequent text until a new tag is encountered. English tag placement rules are stricter than other languages, as detailed below.

2. Tag Categories & Rules

Tags are divided into three main categories: Emotion Tags, Tone Control Tags, and Paralinguistic Tags.

2.1 Emotion Tags

Emotion tags set the emotional tone for a sentence or phrase.

Rule: For English, emotion tags MUST be placed at the very beginning of a sentence, providing less flexibility compared to other languages.

Examples:

  • Sentence-initial usage: (angry)How could you repay me like this?
  • Incorrect usage: I trusted you so much, (angry)how could you repay me like this?

Complete English Tag List:

1. Emotional Markers (must and can only be at the beginning):

(angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused)

2. Tone Markers (can be at any position):

(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)

3. Special Markers (can be at any position):

(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing)

2.2 Tone Control Tags

These tags can be placed anywhere in a sentence to modify vocal delivery.

  • (in a hurry tone): Used to create a tense or urgent atmosphere.

    • Example: Go now! The door is closing, (in a hurry tone) we don't have much time!
  • (shouting): Used to simulate loud yelling or for strong emphasis.

    • Example: (shouting)Hey! Can anyone hear me?
  • (screaming): Used for intense shouting or expressing extreme emotions.

    • Example: Help me! (screaming) Someone please help!
  • (whispering): Used to simulate quiet, secretive speech.

    • Example: Come closer, (whispering) I have a secret to tell you.
  • (soft tone): Used for gentle, quiet delivery.

    • Example: He leaned close to my ear, (soft tone) quietly telling me a secret.

2.3 Special Markers (Paralinguistic Tags)

These tags simulate non-verbal sounds and can be placed at any position. Some MUST be followed by corresponding onomatopoeia.

  • (laughing): Used to express hearty laughter.

    • Example: When he heard the punchline, he couldn't help it, (laughing) Ha,ha,ha!
  • (chuckling): Used for quiet, subdued laughter.

    • Example: That's quite amusing, (chuckling) Hmm,hmm.
  • (sobbing): Used to express sad weeping.

    • Example: She covered her face, (sobbing) and couldn't speak another word.
  • (crying loudly): Used for intense crying or wailing.

    • Example: The child was inconsolable, (crying loudly) waah waah!
  • (sighing): Used to express disappointment, helplessness, or fatigue.

    • Example: How did things turn out this way... (sighing) sigh.
  • (panting): Used to simulate heavy breathing from exertion.

    • Example: After running up the stairs, (panting) he could barely speak.
  • (groaning): Used to express pain, frustration, or annoyance.

    • Example: When he saw the mess, (groaning) he shook his head.
  • (crowd laughing): Used to simulate multiple people laughing.

    • Example: The comedian's joke had everyone (crowd laughing) in stitches.
  • (background laughter): Used for ambient laughter sounds.

    • Example: Despite the serious topic, there was (background laughter) from the audience.
  • (audience laughing): Used specifically for audience reaction.

    • Example: The performer paused as the (audience laughing) filled the theater.

3. Advanced Usage & Combined Examples

Combine different tags to create layered and dynamic vocal effects.

English Example (demonstrating tag combination):

(angry)How dare you betray me! (shouting) I trusted you so much, how could you repay me like this?

4. Important Notes & Best Practices

  1. Follow Rules Strictly: Although English rules are less flexible, placing emotion tags at the beginning of emotional units usually yields the clearest effects.

  2. Prioritize Standard Tags: The official tags listed above have the highest accuracy rates.

  3. Use Descriptive Tags with Caution: Avoid creating tags like (in a sad and quiet voice). The model will likely read it aloud instead of executing the command. Use a combination of standard tags instead, such as (sad)(soft tone).

  4. Avoid Tag Overuse: Too many tags in a short sentence may interfere with the model. Use them purposefully.

  5. Be Aware of Known Issues: The pronunciation of some onomatopoeia (especially laughter or crying) may occasionally sound unnatural. This is a known issue we are actively working to improve.