Speech-to-Text Streaming

Real-time Audio Transcription with WebSocket

The Speechall API provides a WebSocket endpoint for real-time speech-to-text transcription, enabling you to stream audio and receive transcription results as they become available. This is ideal for live applications, voice assistants, and real-time communication tools.

WebSocket Endpoint

The streaming endpoint mirrors the functionality of the REST /transcribe endpoint but uses WebSocket protocol for bidirectional communication:

wss://api.speechall.com/v1/transcribe

Note: Use wss:// (WebSocket Secure) instead of https:// for the streaming endpoint.

Supported Providers

The streaming service currently supports four speech-to-text providers:

  • Assembly AI - High-accuracy transcription with speaker diarization
  • Deepgram - Low-latency streaming optimized for real-time applications
  • OpenAI - Whisper and GPT-4o models for transcription with multilingual support
  • Gladia - Advanced AI transcription with custom vocabulary support

Audio Requirements

For optimal performance and compatibility, ensure your audio meets these specifications:

  • Sample Rate: 16,000 Hz (16 kHz)
  • Channels: Mono (single channel)
  • Encoding: 16-bit PCM
  • Format: Raw binary audio chunks

Connection Parameters

When establishing a WebSocket connection, provide transcription configuration as query parameters, just like the REST endpoint:

const wsUrl = new URL('wss://api.speechall.com/v1/transcribe');
wsUrl.searchParams.set('model', 'assemblyai.best');
wsUrl.searchParams.set('language', 'en');
wsUrl.searchParams.set('output_format', 'json');
wsUrl.searchParams.set('punctuation', 'true');

const ws = new WebSocket(wsUrl.toString());

Authentication

Important for Browser Applications: Browsers do not allow custom headers (including Authorization headers) to be set on WebSocket requests. If you’re making WebSocket requests from a browser, you must include your API key as a query parameter:

const wsUrl = new URL('wss://api.speechall.com/v1/transcribe');
wsUrl.searchParams.set('api_key', apiKey);
wsUrl.searchParams.set('model', 'deepgram.nova-2');
wsUrl.searchParams.set('language', 'en');

const ws = new WebSocket(wsUrl.toString());

For server-side applications (Node.js, etc.), you can use either method:

  • Query parameter: api_key=YOUR_API_KEY
  • Authorization header: Authorization: Bearer YOUR_API_KEY

Available Parameters

ParameterTypeDescriptionDefault
api_keystringYour API key (required for browser requests)Required
modelstringProvider and model identifier (e.g., assemblyai.best, deepgram.nova-2)Required
languagestringLanguage code in ISO 639-1 format (e.g., en, es) or auto for detectionen
output_formatstringResponse format: text, json, or verbose_jsontext
punctuationbooleanEnable automatic punctuationtrue
diarizationbooleanEnable speaker diarizationfalse
temperaturenumberControls randomness (0-1)-
initial_promptstringText prompt to guide the model-
speakers_expectedintegerExpected number of speakers (1-10)-
custom_vocabularyarrayList of custom words/phrases for better recognition-

Basic Usage Example

Here’s a complete example of how to use the WebSocket streaming API in a browser:

// Establish WebSocket connection with parameters (browser-compatible)
const apiKey = 'YOUR_API_KEY';
const wsUrl = new URL('wss://api.speechall.com/v1/transcribe');
wsUrl.searchParams.set('api_key', apiKey);
wsUrl.searchParams.set('model', 'deepgram.nova-2');
wsUrl.searchParams.set('language', 'en');
wsUrl.searchParams.set('output_format', 'json');
wsUrl.searchParams.set('punctuation', 'true');

const ws = new WebSocket(wsUrl.toString());

// Handle connection events
ws.onopen = () => {
    console.log('WebSocket connected');
    startAudioCapture();
};

ws.onmessage = (event) => {
    if (wsUrl.searchParams.get('output_format') === 'json') {
        const transcription = JSON.parse(event.data);
        console.log('Transcription:', transcription.text);
    } else {
        // Plain text response
        console.log('Transcription:', event.data);
    }
};

ws.onerror = (error) => {
    console.error('WebSocket error:', error);
};

ws.onclose = () => {
    console.log('WebSocket connection closed');
};

// Send audio chunks
function sendAudioChunk(audioBuffer) {
    if (ws.readyState === WebSocket.OPEN) {
        ws.send(audioBuffer);
    }
}

Audio Capture Example

Here’s how to capture audio from the microphone and send it to the WebSocket:

async function startAudioCapture() {
    try {
        const stream = await navigator.mediaDevices.getUserMedia({
            audio: {
                sampleRate: 16000,
                channelCount: 1,
                echoCancellation: true,
                noiseSuppression: true
            }
        });

        const audioContext = new AudioContext({ sampleRate: 16000 });
        const source = audioContext.createMediaStreamSource(stream);
        const processor = audioContext.createScriptProcessor(4096, 1, 1);

        processor.onaudioprocess = (event) => {
            const inputBuffer = event.inputBuffer;
            const inputData = inputBuffer.getChannelData(0);
            
            // Convert float32 to int16
            const int16Array = new Int16Array(inputData.length);
            for (let i = 0; i < inputData.length; i++) {
                int16Array[i] = Math.max(-32768, Math.min(32767, inputData[i] * 32768));
            }
            
            // Send audio chunk to WebSocket
            sendAudioChunk(int16Array.buffer);
        };

        source.connect(processor);
        processor.connect(audioContext.destination);
        
    } catch (error) {
        console.error('Error accessing microphone:', error);
    }
}

Response Formats

JSON Format (output_format: 'json')

When using JSON output format, you’ll receive structured responses:

{
    "id": "123e4567-e89b-12d3-a456-426614174000",
    "text": "Hello, this is a test transcription.",
    "language": "en",
    "words": [
        {
            "text": "Hello",
            "start": 0.0,
            "end": 0.5,
            "confidence": 0.98
        }
    ]
}
Text Format (output_format: 'text')

With text format, you’ll receive plain text chunks:

Hello, this is a test transcription.

Best Practices

  1. Buffer Management: Implement proper audio buffering to handle network latency
  2. Error Handling: Always handle WebSocket errors and implement reconnection logic
  3. Audio Quality: Ensure good microphone quality and minimize background noise
  4. Chunk Size: Send audio chunks of 4096 samples (256ms at 16kHz) for optimal performance
  5. Connection Management: Close WebSocket connections properly when done

Error Handling

ws.onerror = (error) => {
    console.error('WebSocket error:', error);
    // Implement reconnection logic
    setTimeout(() => {
        reconnectWebSocket();
    }, 1000);
};

ws.onclose = (event) => {
    if (event.code !== 1000) {
        console.log('Connection closed unexpectedly:', event.code, event.reason);
        // Implement reconnection logic
    }
};

SDK Support

WebSocket streaming support is coming soon to our official SDKs:

  • TypeScript SDK - Real-time streaming with TypeScript support
  • Python SDK - Async WebSocket client for Python applications

Stay tuned for updates on SDK availability.

Rate Limits and Quotas

The same rate limits and quotas that apply to the REST API also apply to the WebSocket streaming endpoint. Monitor your usage through the console dashboard.

Troubleshooting

Common Issues:

  • Audio Format: Ensure audio is 16kHz, mono, 16-bit PCM
  • Authentication:
    • For browser applications: Use api_key query parameter (headers are not supported)
    • For server applications: Use either Authorization header or api_key query parameter
  • Network: Check for firewall restrictions on WebSocket connections
  • Browser Support: Ensure WebSocket and MediaDevices API support

Need Help?

If you encounter issues with the streaming API, please contact our support team with details about your implementation and any error messages.