Speech-to-Text Models

			Subtitle Formats	Features
Amazon Transcribe A fully managed automatic speech recognition (ASR) service that converts speech ...	Amazon	$1.44000/hr	SRT VTT	Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Universal 2 Universal-2 is a state-of-the-art model built on Universal-1, offering enhanced ... v2	AssemblyAI	$0.37000/hr	SRT VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Universal 2 Universal-2 is a state-of-the-art model built on Universal-1, offering enhanced ... v2	AssemblyAI	$0.37000/hr	SRT VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Universal 3 Pro Universal-3 Pro is the first production-quality speech model that adapts its beh... v3	assemblyai	$0.21000/hr	None	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
Azure AI Speech-to-Text Azure's default, general-purpose speech-to-text model, trained on a vast amount ...	Azure	$0.18000/hr	SRT VTT	Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Cloudflare - Whisper A general-purpose speech recognition model based on OpenAI's Whisper, trained on...	Cloudflare	$0.02700/hr	VTT	Punctuation Language Detection
Cloudflare - Whisper Large V3 Turbo Whisper is a pre-trained model for automatic speech recognition (ASR) and speech...	Cloudflare	$0.03060/hr	VTT	Punctuation Word Timestamps Language Detection
Cloudflare - Whisper Tiny (EN) This is the English-only version of the Whisper Tiny model which was trained on ...	Cloudflare	$0.02700/hr	VTT	Punctuation Word Timestamps
Deepgram - Base Standard base model for speech recognition v2024-01-26.8851	deepgram	$0.75000/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced Improved accuracy model for speech recognition	deepgram	$0.87000/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced Finance Enhanced model optimized for finance terminology	deepgram	$0.87000/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced General Enhanced model for general-purpose transcription	deepgram	$0.87000/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced Meeting Enhanced model optimized for meetings and conferences	deepgram	$0.87000/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced Phonecall Enhanced model optimized for phone conversations	deepgram	$0.87000/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Nova Advanced, high-performance speech recognition model	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Nova 2 High-accuracy, next-generation speech recognition model	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Automotive Nova 2 model optimized for automotive industry	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 ConversationalAI Nova 2 model optimized for conversational AI	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Drivethru Nova 2 model optimized for drive-through scenarios	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Finance Nova 2 model optimized for finance terminology	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 General Nova 2 model for general-purpose transcription	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Medical Nova 2 model optimized for medical terminology	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Meeting Nova 2 model optimized for meetings and conferences	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Video Nova 2 model optimized for video content	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Voicemail Nova 2 model optimized for voicemail transcription	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova General Nova model for general-purpose transcription	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Nova Phonecall Nova model optimized for phone conversations	deepgram	$0.25800/hr	VTT	Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram Nova 3 Great accuracy in a broader range of real-world enterprise use cases and challen...	deepgram	$0.31200/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram Nova 3 General Nova 3 model for general-purpose transcription	deepgram	$0.31200/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram Nova 3 Medical Nova 3 model optimized for medical terminology	deepgram	$0.31200/hr	VTT	Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
ElevenLabs Scribe Scribe is a speech-to-text model built for accuracy and handling real-world audi... v1	ElevenLabs	$0.40000/hr	SRT VTT	Punctuation Diarization Speaker Labels Word Timestamps Language Detection
FalAI - Cohere Transcribe Cohere Transcribe turns your business audio into accurate text, ready for search...	FalAI	$0.25000/hr	None	Punctuation
FalAI - Whisper Whisper model hosted on FalAI platform v3	FalAI	$0.06900/hr	None	Punctuation Speaker Labels Word Timestamps
FalAI - Wizper Optimized version of Whisper for improved performance v3	FalAI	$0.03000/hr	None	Punctuation Speaker Labels
Gemini 2.5 Flash Best model for price-performance, ideal for high-throughput tasks like large-sca... v2.5	Gemini	$0.12222/hr	None	Punctuation Language Detection
Gemini 2.5 Flash-Lite Most cost-efficient and fastest model, optimized for high-volume, latency-sensit... v2.5	Gemini	$0.01215/hr	None	Punctuation Language Detection
Gemini 2.5 Pro Most advanced model for complex tasks, excelling at coding and complex prompts. v2.5	Gemini	$0.26100/hr	None	Punctuation Language Detection
Gladia Solaria Gladia's cutting-edge, next-generation ASR model, launched in April 2025. Design... v1	Gladia	$0.61200/hr	VTT	Punctuation Diarization Streaming Speaker Labels Language Detection
Google Cloud - Enhanced Enhanced speech recognition model by Google	google	$0.96000/hr	SRT VTT	Punctuation Diarization Word Timestamps Language Detection
Google Cloud - Standard Standard speech recognition model by Google	google	$0.96000/hr	SRT VTT	Punctuation Diarization Word Timestamps Language Detection
Groq - Whisper Large V3 A multilingual ASR model offering high accuracy and speed for transcription and ... vv3	groq	$0.11100/hr	None	Punctuation Word Timestamps Language Detection
Groq - Whisper Turbo Large V3 A pruned and fine-tuned version of Whisper Large v3, designed for faster and les... vv3 Turbo	groq	$0.04000/hr	None	Punctuation Word Timestamps Language Detection
IBM Watson Speech to Text A cloud-based speech recognition service from IBM Watson that converts audio int...	IBM	$1.20000/hr	None	Punctuation Diarization Speaker Labels Word Timestamps Confidence
OpenAI - GPT-4o mini Transcribe Speech-to-text model powered by GPT-4o mini. Offers improvements in word error r...	OpenAI	$0.18000/hr	SRT VTT	Punctuation Streaming Language Detection
OpenAI - GPT-4o Transcribe Speech-to-text model powered by GPT-4o. Offers improvements in word error rate, ...	OpenAI	$0.36000/hr	SRT VTT	Punctuation Streaming Language Detection
OpenAI - GPT-4o Transcribe Diarize Speech-to-text model powered by GPT-4o. Offers improvements in word error rate, ...	OpenAI	$0.36000/hr	SRT VTT	Punctuation Streaming Language Detection
OpenAI - Whisper General-purpose speech recognition model. Based on the open-source Whisper large... vlarge-v2	OpenAI	$0.36000/hr	SRT VTT	Punctuation Streaming Word Timestamps Language Detection
Rev AI Enhanced Rev AI's high-accuracy general-purpose speech-to-text model, trained on a divers... v2.0	RevAI	$0.30000/hr	SRT VTT	Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Rev AI Reverb ASR Rev AI's open-source derived English Automatic Speech Recognition (ASR) model. T... v1.0	RevAI	$0.30000/hr	SRT VTT	Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Speechmatics Enhanced Speechmatics' Enhanced ASR model offers very good accuracy, though processing is...	speechmatics	$0.40000/hr	SRT	Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Speechmatics Standard Speechmatics' Standard ASR model offers faster results with good accuracy.	speechmatics	$0.24000/hr	SRT	Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Voxtral Mini Transcribe Voxtral is Mistral’s audio model family designed for powerful speech understandi...	Mistral	$0.06000/hr	None	Punctuation Language Detection
Voxtral Mini Transcribe v2 Voxtral is Mistral’s audio model family designed for powerful speech understandi...	Mistral	$0.17999/hr	None	Punctuation Language Detection

Compare Speech-to-Text Models

This comprehensive comparison of speech-to-text models helps you find the perfect solution for your audio transcription needs. Our table showcases models from leading providers, with detailed information on pricing, processing speed, and supported features.

Each model offers unique capabilities such as punctuation, speaker diarization, real-time streaming, speaker labels, word timestamps, and automatic language detection. Filter by language support to find models that work with your specific linguistic requirements, whether you need English, Spanish, French, German, Japanese, or dozens of other languages.

Click on any model to access its detailed card with comprehensive specifications, including maximum file size, accuracy ratings, supported audio formats, and sample rates. Whether you're developing an application requiring real-time transcription, processing interviews with multiple speakers, or needing highly accurate transcripts with confidence scores, you'll find the right speech recognition model for your project.

Compare hourly costs and processing speeds to balance your budget against performance requirements. Our detailed feature badges make it easy to identify models supporting critical functionality like speaker identification, word-level timestamps, or automatic punctuation insertion.