Speech-to-Text Streaming
Real-time Audio Transcription with WebSocket
The Speechall API provides a WebSocket endpoint for real-time speech-to-text transcription, enabling you to stream audio and receive transcription results as they become available. This is ideal for live applications, voice assistants, and real-time communication tools.
WebSocket Endpoint
The streaming endpoint mirrors the functionality of the REST /transcribe
endpoint but uses WebSocket protocol for bidirectional communication:
wss://api.speechall.com/v1/transcribe
Note: Use wss://
(WebSocket Secure) instead of https://
for the streaming endpoint.
Supported Providers
The streaming service currently supports four speech-to-text providers:
- Assembly AI - High-accuracy transcription with speaker diarization
- Deepgram - Low-latency streaming optimized for real-time applications
- OpenAI - Whisper and GPT-4o models for transcription with multilingual support
- Gladia - Advanced AI transcription with custom vocabulary support
Audio Requirements
For optimal performance and compatibility, ensure your audio meets these specifications:
- Sample Rate: 16,000 Hz (16 kHz)
- Channels: Mono (single channel)
- Encoding: 16-bit PCM
- Format: Raw binary audio chunks
Connection Parameters
When establishing a WebSocket connection, provide transcription configuration as query parameters, just like the REST endpoint:
const wsUrl = new URL('wss://api.speechall.com/v1/transcribe');
wsUrl.searchParams.set('model', 'assemblyai.best');
wsUrl.searchParams.set('language', 'en');
wsUrl.searchParams.set('output_format', 'json');
wsUrl.searchParams.set('punctuation', 'true');
const ws = new WebSocket(wsUrl.toString());
Authentication
Important for Browser Applications: Browsers do not allow custom headers (including Authorization
headers) to be set on WebSocket requests. If you’re making WebSocket requests from a browser, you must include your API key as a query parameter:
const wsUrl = new URL('wss://api.speechall.com/v1/transcribe');
wsUrl.searchParams.set('api_key', apiKey);
wsUrl.searchParams.set('model', 'deepgram.nova-2');
wsUrl.searchParams.set('language', 'en');
const ws = new WebSocket(wsUrl.toString());
For server-side applications (Node.js, etc.), you can use either method:
- Query parameter:
api_key=YOUR_API_KEY
- Authorization header:
Authorization: Bearer YOUR_API_KEY
Available Parameters
Parameter | Type | Description | Default |
---|---|---|---|
api_key | string | Your API key (required for browser requests) | Required |
model | string | Provider and model identifier (e.g., assemblyai.best , deepgram.nova-2 ) | Required |
language | string | Language code in ISO 639-1 format (e.g., en , es ) or auto for detection | en |
output_format | string | Response format: text , json , or verbose_json | text |
punctuation | boolean | Enable automatic punctuation | true |
diarization | boolean | Enable speaker diarization | false |
temperature | number | Controls randomness (0-1) | - |
initial_prompt | string | Text prompt to guide the model | - |
speakers_expected | integer | Expected number of speakers (1-10) | - |
custom_vocabulary | array | List of custom words/phrases for better recognition | - |
Basic Usage Example
Here’s a complete example of how to use the WebSocket streaming API in a browser:
// Establish WebSocket connection with parameters (browser-compatible)
const apiKey = 'YOUR_API_KEY';
const wsUrl = new URL('wss://api.speechall.com/v1/transcribe');
wsUrl.searchParams.set('api_key', apiKey);
wsUrl.searchParams.set('model', 'deepgram.nova-2');
wsUrl.searchParams.set('language', 'en');
wsUrl.searchParams.set('output_format', 'json');
wsUrl.searchParams.set('punctuation', 'true');
const ws = new WebSocket(wsUrl.toString());
// Handle connection events
ws.onopen = () => {
console.log('WebSocket connected');
startAudioCapture();
};
ws.onmessage = (event) => {
if (wsUrl.searchParams.get('output_format') === 'json') {
const transcription = JSON.parse(event.data);
console.log('Transcription:', transcription.text);
} else {
// Plain text response
console.log('Transcription:', event.data);
}
};
ws.onerror = (error) => {
console.error('WebSocket error:', error);
};
ws.onclose = () => {
console.log('WebSocket connection closed');
};
// Send audio chunks
function sendAudioChunk(audioBuffer) {
if (ws.readyState === WebSocket.OPEN) {
ws.send(audioBuffer);
}
}
Audio Capture Example
Here’s how to capture audio from the microphone and send it to the WebSocket:
async function startAudioCapture() {
try {
const stream = await navigator.mediaDevices.getUserMedia({
audio: {
sampleRate: 16000,
channelCount: 1,
echoCancellation: true,
noiseSuppression: true
}
});
const audioContext = new AudioContext({ sampleRate: 16000 });
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (event) => {
const inputBuffer = event.inputBuffer;
const inputData = inputBuffer.getChannelData(0);
// Convert float32 to int16
const int16Array = new Int16Array(inputData.length);
for (let i = 0; i < inputData.length; i++) {
int16Array[i] = Math.max(-32768, Math.min(32767, inputData[i] * 32768));
}
// Send audio chunk to WebSocket
sendAudioChunk(int16Array.buffer);
};
source.connect(processor);
processor.connect(audioContext.destination);
} catch (error) {
console.error('Error accessing microphone:', error);
}
}
Response Formats
JSON Format (output_format: 'json'
)
When using JSON output format, you’ll receive structured responses:
{
"id": "123e4567-e89b-12d3-a456-426614174000",
"text": "Hello, this is a test transcription.",
"language": "en",
"words": [
{
"text": "Hello",
"start": 0.0,
"end": 0.5,
"confidence": 0.98
}
]
}
Text Format (output_format: 'text'
)
With text format, you’ll receive plain text chunks:
Hello, this is a test transcription.
Best Practices
- Buffer Management: Implement proper audio buffering to handle network latency
- Error Handling: Always handle WebSocket errors and implement reconnection logic
- Audio Quality: Ensure good microphone quality and minimize background noise
- Chunk Size: Send audio chunks of 4096 samples (256ms at 16kHz) for optimal performance
- Connection Management: Close WebSocket connections properly when done
Error Handling
ws.onerror = (error) => {
console.error('WebSocket error:', error);
// Implement reconnection logic
setTimeout(() => {
reconnectWebSocket();
}, 1000);
};
ws.onclose = (event) => {
if (event.code !== 1000) {
console.log('Connection closed unexpectedly:', event.code, event.reason);
// Implement reconnection logic
}
};
SDK Support
WebSocket streaming support is coming soon to our official SDKs:
- TypeScript SDK - Real-time streaming with TypeScript support
- Python SDK - Async WebSocket client for Python applications
Stay tuned for updates on SDK availability.
Rate Limits and Quotas
The same rate limits and quotas that apply to the REST API also apply to the WebSocket streaming endpoint. Monitor your usage through the console dashboard.
Troubleshooting
Common Issues:
- Audio Format: Ensure audio is 16kHz, mono, 16-bit PCM
- Authentication:
- For browser applications: Use
api_key
query parameter (headers are not supported) - For server applications: Use either
Authorization
header orapi_key
query parameter
- For browser applications: Use
- Network: Check for firewall restrictions on WebSocket connections
- Browser Support: Ensure WebSocket and MediaDevices API support
Need Help?
If you encounter issues with the streaming API, please contact our support team with details about your implementation and any error messages.