Speech-to-Text Models

Languages Features Details
Amazon Transcribe
A fully managed automatic speech recognition (ASR) service that converts speech into text using deep learning. It supports both batch and streaming transcriptions with features like punctuation, diarization, and language identification.
Amazon $1.44000/hr In progress
ab-GEast-ESaz-AZ +61 more
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Best
The Best tier model is optimized for accuracy, low latency, and ease of use. It is recommended for high-quality search experiences, complex audio files, and when high accuracy is necessary for Large Language Model processing.
AssemblyAI $0.37000/hr In progress
autoen-USen-GB +2 more
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Nano
The Nano tier model is a lightweight, lower cost model for a wide range of use cases. It is ideal for clear audio sources and is a budget-friendly alternative.
AssemblyAI $0.12000/hr In progress
auto
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Slam-1
Slam-1 is a Speech Language Model that combines LLM architecture with ASR encoders for superior speech-to-text transcription. It is optimized for customization via prompting and delivers enhanced accuracy by understanding context and semantic meaning. It is currently in public beta.
v1
AssemblyAI $0.37000/hr In progress
en-US
Punctuation Diarization Streaming Speaker Labels Word Timestamps Confidence
AssemblyAI Universal-2
Universal-2 is a state-of-the-art model built on Universal-1, offering enhanced multilingual support and improved performance in challenging audio conditions. It aims for best-in-class accuracy and is optimized for faster processing speeds.
v2
AssemblyAI $0.37000/hr In progress
auto
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
Azure AI Speech-to-Text
Azure's default, general-purpose speech-to-text model, trained on a vast amount of Microsoft-owned data. It is suitable for conversational and dictation scenarios and supports both real-time and batch transcription.
Azure $0.18000/hr In progress
autoaf-ZAam-ET +131 more
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
Cloudflare - Whisper
A general-purpose speech recognition model based on OpenAI's Whisper, trained on a large dataset of diverse audio. It can perform multilingual speech recognition, speech translation, and language identification. Cloudflare Workers AI provides access to this model.
Cloudflare $0.02700/hr In progress
afarhy +54 more
Punctuation Word Timestamps Language Detection
Deepgram - Base
Standard base model for speech recognition
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Base ConversationalAI
Base model optimized for conversational AI applications
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Base Finance
Base model optimized for finance terminology
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Base General
Base model for general-purpose transcription
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Base Meeting
Base model optimized for meetings and conferences
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Base Phonecall
Base model optimized for phone conversations
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Base Video
Base model optimized for video content
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Base Voicemail
Base model optimized for voicemail transcription
deepgram $0.75000/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Enhanced
Improved accuracy model for speech recognition
deepgram $0.87000/hr In progress
Coming soon
Punctuation Streaming Word Timestamps
Deepgram - Enhanced Finance
Enhanced model optimized for finance terminology
deepgram $0.87000/hr In progress
Coming soon
Punctuation Streaming Word Timestamps
Deepgram - Enhanced General
Enhanced model for general-purpose transcription
deepgram $0.87000/hr In progress
Coming soon
Punctuation Streaming Word Timestamps
Deepgram - Enhanced Meeting
Enhanced model optimized for meetings and conferences
deepgram $0.87000/hr In progress
Coming soon
Punctuation Streaming Word Timestamps
Deepgram - Enhanced Phonecall
Enhanced model optimized for phone conversations
deepgram $0.87000/hr In progress
Coming soon
Punctuation Streaming Word Timestamps
Deepgram - Nova
Advanced, high-performance speech recognition model
deepgram $0.25800/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Nova 2
High-accuracy, next-generation speech recognition model
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 ATC
Nova 2 model optimized for air traffic control
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Automotive
Nova 2 model optimized for automotive industry
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 ConversationalAI
Nova 2 model optimized for conversational AI
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Drivethru
Nova 2 model optimized for drive-through scenarios
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Finance
Nova 2 model optimized for finance terminology
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 General
Nova 2 model for general-purpose transcription
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Medical
Nova 2 model optimized for medical terminology
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Meeting
Nova 2 model optimized for meetings and conferences
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Phonecall
Nova 2 model optimized for phone conversations
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Video
Nova 2 model optimized for video content
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova 2 Voicemail
Nova 2 model optimized for voicemail transcription
deepgram $0.25800/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps
Deepgram - Nova General
Nova model for general-purpose transcription
deepgram $0.25800/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Nova Phonecall
Nova model optimized for phone conversations
deepgram $0.25800/hr In progress
Coming soon
Punctuation Streaming
Deepgram - Whisper
OpenAI Whisper model hosted by Deepgram
deepgram $0.28800/hr In progress
afarhy +54 more
Punctuation
Deepgram - Whisper Base
Base Whisper model hosted by Deepgram
deepgram $0.21000/hr In progress
afarhy +54 more
Punctuation
Deepgram - Whisper Large
Large Whisper model hosted by Deepgram
deepgram $0.28800/hr In progress
afarhy +54 more
Punctuation
Deepgram - Whisper Medium
Medium Whisper model hosted by Deepgram
deepgram $0.25200/hr In progress
afarhy +54 more
Punctuation
Deepgram - Whisper Small
Small Whisper model hosted by Deepgram
deepgram $0.22800/hr In progress
afarhy +54 more
Punctuation
Deepgram - Whisper Tiny
Tiny Whisper model hosted by Deepgram
deepgram $0.19800/hr In progress
afarhy +54 more
Punctuation
Deepgram Nova 3
Great accuracy in a broader range of real-world enterprise use cases and challenging audio conditions.
deepgram $0.31200/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram Nova 3 General
Nova 3 model for general-purpose transcription
deepgram $0.31200/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram Nova 3 Medical
Nova 3 model optimized for medical terminology
deepgram $0.31200/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
FalAI - ElevenLabs Speech to Text
Generate text from speech using ElevenLabs advanced speech-to-text model. Supports 99 languages with state-of-the-art accuracy.
FalAI $1.86486/hr In progress
Coming soon
Punctuation Diarization Speaker Labels Word Timestamps Language Detection
FalAI - Whisper
Whisper model hosted on FalAI platform
v3
FalAI $0.06900/hr In progress
afarhy +54 more
Punctuation Diarization Speaker Labels Word Timestamps Language Detection
FalAI - Wizper
Optimized version of Whisper for improved performance
v3
FalAI $0.03000/hr In progress
afamar +99 more
Punctuation Diarization Speaker Labels Word Timestamps Language Detection
FireworksAI - Whisper Turbo V3
Accelerated Whisper V3 model by FireworksAI
FireworksAI $0.05400/hr In progress
afarhy +54 more
Punctuation
FireworksAI - Whisper V3
Whisper V3 model hosted by FireworksAI
FireworksAI $0.09000/hr In progress
afarhy +54 more
Punctuation
Gladia Solaria
Gladia's cutting-edge, next-generation ASR model, launched in April 2025. Designed for real-time multilingual communications, offering unmatched accuracy (94% WAR for common languages) and ultra-low latency (270ms) across 100 languages, including many previously underserved. Optimized for call centers and voice-first platforms.
v1
Gladia $0.61200/hr In progress
auto
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Google - Enhanced
Enhanced speech recognition model by Google
google $0.96000/hr In progress
Coming soon
Punctuation
Google - Standard
Standard speech recognition model by Google
google $0.96000/hr In progress
Coming soon
Punctuation
GPT-4o mini Transcribe
Speech-to-text model powered by GPT-4o mini. Offers improvements in word error rate, language recognition, and accuracy compared to original Whisper models. A lighter, faster model with slightly reduced accuracy but lower latency compared to gpt-4o-transcribe.
OpenAI $0.18000/hr In progress
Coming soon
Punctuation Streaming Language Detection
GPT-4o Transcribe
Speech-to-text model powered by GPT-4o. Offers improvements in word error rate, language recognition, and accuracy compared to original Whisper models.
OpenAI $0.36000/hr In progress
Coming soon
Punctuation Streaming Language Detection
Groq - Distil Whisper Large V3 [EN]
A distilled version of Whisper Large V3, specifically engineered for faster and more efficient English speech recognition.
vv3
groq $0.02000/hr 262x realtime
afarhy +54 more
Punctuation
Groq - Whisper Large V3
A multilingual ASR model offering high accuracy and speed for transcription and translation in multiple languages.
vv3
groq $0.11100/hr 300x realtime
afarhy +54 more
Punctuation Language Detection
Groq - Whisper Turbo Large V3
A pruned and fine-tuned version of Whisper Large v3, designed for faster and less expensive multilingual speech recognition while maintaining comparable accuracy.
vv3 Turbo
groq $0.04000/hr 247x realtime
afarhy +54 more
Punctuation Language Detection
OpenAI - Whisper
General-purpose speech recognition model. Based on the open-source Whisper large-v2 model, offering faster performance than the open-source version.
vlarge-v2
OpenAI $0.36000/hr In progress
afarhy +54 more
Punctuation Word Timestamps Language Detection
Rev AI Enhanced
Rev AI's high-accuracy general-purpose speech-to-text model, trained on a diverse collection of voices. Offers best-in-class accuracy for various use cases including business intelligence, market research, and meeting transcription. The v2 version provides a significant accuracy increase over its predecessor.
v2.0
RevAI $0.30000/hr In progress
enesfr
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
Rev AI Reverb ASR
Rev AI's open-source derived English Automatic Speech Recognition (ASR) model. Trained on 200,000 hours of human-transcribed English audio, known for high accuracy on English speech and allowing control over transcription verbatimicity.
v1.0
RevAI $0.30000/hr In progress
en
Punctuation Diarization Streaming Speaker Labels Word Timestamps Confidence
Speech-to-Text
Leverage the rapid processing capabilities of AI models to enable accurate and efficient real-time speech-to-text transcription.
FalAI $0.04973/hr In progress
Coming soon
Punctuation Streaming
Speechmatics Enhanced
Speechmatics' Enhanced ASR model offers very good accuracy, though processing is slower than the Standard model.
speechmatics $0.40000/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
Speechmatics Standard
Speechmatics' Standard ASR model offers faster results with good accuracy.
speechmatics $0.24000/hr In progress
Coming soon
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
Watson Speech to Text
A cloud-based speech recognition service from IBM Watson that converts audio into text using AI and machine learning. It provides accurate transcription and supports various use cases. It is designed for enterprise applications requiring robust, scalable, and customizable capabilities.
IBM $1.20000/hr In progress
ar-MSzh-CNcs-CZ +17 more
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence

Compare Speech-to-Text Models

This comprehensive comparison of speech-to-text models helps you find the perfect solution for your audio transcription needs. Our table showcases models from leading providers, with detailed information on pricing, processing speed, and supported features.

Each model offers unique capabilities such as punctuation, speaker diarization, real-time streaming, speaker labels, word timestamps, and automatic language detection. Filter by language support to find models that work with your specific linguistic requirements, whether you need English, Spanish, French, German, Japanese, or dozens of other languages.

Click on any model to access its detailed card with comprehensive specifications, including maximum file size, accuracy ratings, supported audio formats, and sample rates. Whether you're developing an application requiring real-time transcription, processing interviews with multiple speakers, or needing highly accurate transcripts with confidence scores, you'll find the right speech recognition model for your project.

Compare hourly costs and processing speeds to balance your budget against performance requirements. Our detailed feature badges make it easy to identify models supporting critical functionality like speaker identification, word-level timestamps, or automatic punctuation insertion.