A fully managed automatic speech recognition (ASR) service that converts speech into text using deep learning. It supports both batch and streaming transcriptions with features like punctuation, diarization, and language identification.
The Best tier model is optimized for accuracy, low latency, and ease of use. It is recommended for high-quality search experiences, complex audio files, and when high accuracy is necessary for Large Language Model processing.
The Nano tier model is a lightweight, lower cost model for a wide range of use cases. It is ideal for clear audio sources and is a budget-friendly alternative.
Slam-1 is a Speech Language Model that combines LLM architecture with ASR encoders for superior speech-to-text transcription. It is optimized for customization via prompting and delivers enhanced accuracy by understanding context and semantic meaning. It is currently in public beta.
Universal-2 is a state-of-the-art model built on Universal-1, offering enhanced multilingual support and improved performance in challenging audio conditions. It aims for best-in-class accuracy and is optimized for faster processing speeds.
Azure's default, general-purpose speech-to-text model, trained on a vast amount of Microsoft-owned data. It is suitable for conversational and dictation scenarios and supports both real-time and batch transcription.
A general-purpose speech recognition model based on OpenAI's Whisper, trained on a large dataset of diverse audio. It can perform multilingual speech recognition, speech translation, and language identification. Cloudflare Workers AI provides access to this model.
Cloudflare
$0.02700/hr
In progress
afarhy+54 more
PunctuationWord TimestampsLanguage Detection
Deepgram - Base
Standard base model for speech recognition
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Base ConversationalAI
Base model optimized for conversational AI applications
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Base Finance
Base model optimized for finance terminology
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Base General
Base model for general-purpose transcription
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Base Meeting
Base model optimized for meetings and conferences
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Base Phonecall
Base model optimized for phone conversations
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Base Video
Base model optimized for video content
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Base Voicemail
Base model optimized for voicemail transcription
deepgram
$0.75000/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Enhanced
Improved accuracy model for speech recognition
deepgram
$0.87000/hr
In progress
Coming soon
PunctuationStreamingWord Timestamps
Deepgram - Enhanced Finance
Enhanced model optimized for finance terminology
deepgram
$0.87000/hr
In progress
Coming soon
PunctuationStreamingWord Timestamps
Deepgram - Enhanced General
Enhanced model for general-purpose transcription
deepgram
$0.87000/hr
In progress
Coming soon
PunctuationStreamingWord Timestamps
Deepgram - Enhanced Meeting
Enhanced model optimized for meetings and conferences
deepgram
$0.87000/hr
In progress
Coming soon
PunctuationStreamingWord Timestamps
Deepgram - Enhanced Phonecall
Enhanced model optimized for phone conversations
deepgram
$0.87000/hr
In progress
Coming soon
PunctuationStreamingWord Timestamps
Deepgram - Nova
Advanced, high-performance speech recognition model
deepgram
$0.25800/hr
In progress
Coming soon
PunctuationStreaming
Deepgram - Nova 2
High-accuracy, next-generation speech recognition model
Gladia's cutting-edge, next-generation ASR model, launched in April 2025. Designed for real-time multilingual communications, offering unmatched accuracy (94% WAR for common languages) and ultra-low latency (270ms) across 100 languages, including many previously underserved. Optimized for call centers and voice-first platforms.
Speech-to-text model powered by GPT-4o mini. Offers improvements in word error rate, language recognition, and accuracy compared to original Whisper models. A lighter, faster model with slightly reduced accuracy but lower latency compared to gpt-4o-transcribe.
OpenAI
$0.18000/hr
In progress
Coming soon
PunctuationStreamingLanguage Detection
GPT-4o Transcribe
Speech-to-text model powered by GPT-4o. Offers improvements in word error rate, language recognition, and accuracy compared to original Whisper models.
OpenAI
$0.36000/hr
In progress
Coming soon
PunctuationStreamingLanguage Detection
Groq - Distil Whisper Large V3 [EN]
A distilled version of Whisper Large V3, specifically engineered for faster and more efficient English speech recognition.
vv3
groq
$0.02000/hr
262x realtime
afarhy+54 more
Punctuation
Groq - Whisper Large V3
A multilingual ASR model offering high accuracy and speed for transcription and translation in multiple languages.
vv3
groq
$0.11100/hr
300x realtime
afarhy+54 more
PunctuationLanguage Detection
Groq - Whisper Turbo Large V3
A pruned and fine-tuned version of Whisper Large v3, designed for faster and less expensive multilingual speech recognition while maintaining comparable accuracy.
vv3 Turbo
groq
$0.04000/hr
247x realtime
afarhy+54 more
PunctuationLanguage Detection
OpenAI - Whisper
General-purpose speech recognition model. Based on the open-source Whisper large-v2 model, offering faster performance than the open-source version.
vlarge-v2
OpenAI
$0.36000/hr
In progress
afarhy+54 more
PunctuationWord TimestampsLanguage Detection
Rev AI Enhanced
Rev AI's high-accuracy general-purpose speech-to-text model, trained on a diverse collection of voices. Offers best-in-class accuracy for various use cases including business intelligence, market research, and meeting transcription. The v2 version provides a significant accuracy increase over its predecessor.
Rev AI's open-source derived English Automatic Speech Recognition (ASR) model. Trained on 200,000 hours of human-transcribed English audio, known for high accuracy on English speech and allowing control over transcription verbatimicity.
A cloud-based speech recognition service from IBM Watson that converts audio into text using AI and machine learning. It provides accurate transcription and supports various use cases. It is designed for enterprise applications requiring robust, scalable, and customizable capabilities.
This comprehensive comparison of speech-to-text models helps you find the perfect solution for your audio transcription needs. Our table
showcases models from leading providers, with detailed information on pricing, processing speed, and supported features.
Each model offers unique capabilities such as punctuation, speaker diarization, real-time streaming, speaker labels, word timestamps, and
automatic language detection. Filter by language support to find models that work with your specific linguistic requirements, whether you
need English, Spanish, French, German, Japanese, or dozens of other languages.
Click on any model to access its detailed card with comprehensive specifications, including maximum file size, accuracy ratings, supported
audio formats, and sample rates. Whether you're developing an application requiring real-time transcription, processing interviews with
multiple speakers, or needing highly accurate transcripts with confidence scores, you'll find the right speech recognition model for your
project.
Compare hourly costs and processing speeds to balance your budget against performance requirements. Our detailed feature badges make it
easy to identify models supporting critical functionality like speaker identification, word-level timestamps, or automatic punctuation
insertion.