Real-time Voice Streaming

Overview

Real-time streaming lets you generate speech as you type or speak, perfect for chatbots, virtual assistants, and live applications.

When to Use Streaming

Perfect for:

Live chat applications
Virtual assistants
Interactive storytelling
Real-time translations
Gaming dialogue

Not ideal for:

Pre-recorded content
Batch processing
When perfect quality is critical

Getting Started

Web Playground

Try real-time streaming instantly:

Visit fish.audio
Enable “Streaming Mode”
Start typing and hear voice generation in real-time

Using the SDK

Stream text as it’s being written:

from fish_audio_sdk import WebSocketSession, TTSRequest

# Initialize WebSocket session
session = WebSocketSession("your_api_key")

# Stream text word by word
def stream_text():
    text = "Hello, this is being generated in real time"
    for word in text.split():
        yield word + " "

# Generate speech as text streams
request = TTSRequest(
    text="",
    reference_id="your_voice_model_id",
    temperature=0.7,  # Controls variation
    top_p=0.7  # Controls diversity
)

with open("output.mp3", "wb") as f:
    for audio_chunk in session.tts(request, stream_text()):
        f.write(audio_chunk)

Configuration Options

Speed vs Quality

Latency Modes:

Normal: Best quality, ~500ms latency
Balanced: Good quality, ~300ms latency

request = TTSRequest(
    text="",
    reference_id="model_id",
    latency="balanced"  # For faster response
)

Voice Control

Temperature (0.1 - 1.0):

Lower: More consistent, predictable
Higher: More varied, expressive

Top-p (0.1 - 1.0):

Lower: More focused
Higher: More diverse

Real-time Applications

Chatbot Integration

Stream responses as they’re generated:

def chatbot_response(user_input):
    # Get AI response (streaming)
    ai_text = get_ai_response(user_input)
    
    # Convert to speech in real-time
    for text_chunk in ai_text:
        for audio_chunk in session.tts(request, text_chunk):
            play_audio(audio_chunk)

Live Translation

Translate and speak simultaneously:

def live_translate(source_audio):
    # Transcribe source audio
    text = transcribe(source_audio)
    
    # Translate text
    translated = translate(text, target_language)
    
    # Stream translated speech
    for chunk in stream_text(translated):
        generate_speech(chunk)

Best Practices

Text Buffering

Do:

Send complete words with spaces
Use punctuation for natural pauses
Buffer 5-10 words for smoothness

Don’t:

Send individual characters
Forget spaces between words
Send huge chunks at once

Connection Management

Keep connections alive for multiple generations
Handle disconnections gracefully
Implement retry logic for reliability

Audio Playback

For smooth playback:

Buffer 2-3 audio chunks
Use cross-fading between chunks
Handle network delays gracefully

Common Use Cases

Interactive Story

def interactive_story():
    story_parts = [
        "Once upon a time,",
        "in a land far away,",
        "there lived a brave knight..."
    ]
    
    for part in story_parts:
        # Generate and play each part
        stream_speech(part)
        # Wait for user input
        user_choice = get_user_input()
        # Continue based on choice

Virtual Assistant

def virtual_assistant():
    while True:
        # Listen for wake word
        if detect_wake_word():
            # Start streaming response
            response = process_command()
            stream_speech(response)

Live Commentary

def live_commentary(event_stream):
    for event in event_stream:
        # Generate commentary
        commentary = generate_commentary(event)
        # Stream immediately
        stream_speech(commentary)

Troubleshooting

Audio Gaps

Problem: Gaps between audio chunks Solution:

Increase buffer size
Use balanced latency mode
Check network connection

Delayed Response

Problem: Long wait before audio starts Solution:

Use balanced latency mode
Send initial text immediately
Reduce chunk size

Choppy Playback

Problem: Audio cuts in and out Solution:

Buffer more chunks before playing
Check network stability
Use consistent chunk sizes

Advanced Features

Dynamic Voice Switching

Change voices mid-stream:

# Start with one voice
request1 = TTSRequest(reference_id="voice1")
stream_speech("Hello from voice one.", request1)

# Switch to another
request2 = TTSRequest(reference_id="voice2")
stream_speech("And now voice two!", request2)

Emotion Injection

Add emotions dynamically:

def emotional_speech(text, emotion):
    emotional_text = f"({emotion}) {text}"
    stream_speech(emotional_text)

Speed Control

Adjust speaking speed:

request = TTSRequest(
    text="",
    prosody={
        "speed": 1.5,  # 1.5x speed
        "volume": 0    # Normal volume
    }
)

Performance Tips

Pre-load voices for instant start
Use connection pooling for multiple streams
Monitor latency and adjust settings
Cache common phrases for instant playback

Get Support

Need help with streaming?

Discord Community: Join our Discord
Email Support: support@fish.audio
Status Page: status.fish.audio

Tutorials

Reference

Best Practices

Real-time Voice Streaming

Overview

When to Use Streaming

Getting Started

Web Playground

Using the SDK

Configuration Options

Speed vs Quality

Voice Control

Real-time Applications

Chatbot Integration

Live Translation

Best Practices

Text Buffering

Connection Management

Audio Playback

Common Use Cases

Interactive Story

Virtual Assistant

Live Commentary

Troubleshooting

Audio Gaps

Delayed Response

Choppy Playback

Advanced Features

Dynamic Voice Switching

Emotion Injection

Speed Control

Performance Tips

Get Support

Tutorials

Reference

Best Practices

​Overview

​When to Use Streaming

​Getting Started

​Web Playground

​Using the SDK

​Configuration Options

​Speed vs Quality

​Voice Control

​Real-time Applications

​Chatbot Integration

​Live Translation

​Best Practices

​Text Buffering

​Connection Management

​Audio Playback

​Common Use Cases

​Interactive Story

​Virtual Assistant

​Live Commentary

​Troubleshooting

​Audio Gaps

​Delayed Response

​Choppy Playback

​Advanced Features

​Dynamic Voice Switching

​Emotion Injection

​Speed Control

​Performance Tips

​Get Support

Overview

When to Use Streaming

Getting Started

Web Playground

Using the SDK

Configuration Options

Speed vs Quality

Voice Control

Real-time Applications

Chatbot Integration

Live Translation

Best Practices

Text Buffering

Connection Management

Audio Playback

Common Use Cases

Interactive Story

Virtual Assistant

Live Commentary

Troubleshooting

Audio Gaps

Delayed Response

Choppy Playback

Advanced Features

Dynamic Voice Switching

Emotion Injection

Speed Control

Performance Tips

Get Support