Speech-to-Text Models

Subtitle Formats Features Details
Amazon Transcribe
A fully managed automatic speech recognition (ASR) service that converts speech ...
Amazon $1.44000/hr
SRT VTT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Best
The Best tier model is optimized for accuracy, low latency, and ease of use. It ...
AssemblyAI $0.37000/hr
SRT VTT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Nano
The Nano tier model is a lightweight, lower cost model for a wide range of use c...
AssemblyAI $0.12000/hr
SRT VTT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Slam-1
Slam-1 is a Speech Language Model that combines LLM architecture with ASR encode...
v1
AssemblyAI $0.37000/hr
SRT VTT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
AssemblyAI Universal-2
Universal-2 is a state-of-the-art model built on Universal-1, offering enhanced ...
v2
AssemblyAI $0.37000/hr
SRT VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection Confidence
Azure AI Speech-to-Text
Azure's default, general-purpose speech-to-text model, trained on a vast amount ...
Azure $0.18000/hr
SRT VTT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Cloudflare - Whisper
A general-purpose speech recognition model based on OpenAI's Whisper, trained on...
Cloudflare $0.02700/hr
VTT
Punctuation Language Detection
Cloudflare - Whisper Large V3 Turbo
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech...
Cloudflare $0.03060/hr
VTT
Punctuation Word Timestamps Language Detection
Cloudflare - Whisper Tiny (EN)
This is the English-only version of the Whisper Tiny model which was trained on ...
Cloudflare $0.02700/hr
VTT
Punctuation Word Timestamps
Deepgram - Base
Standard base model for speech recognition
v2024-01-26.8851
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Base ConversationalAI
Base model optimized for conversational AI applications
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Base Finance
Base model optimized for finance terminology
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Base General
Base model for general-purpose transcription
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Base Meeting
Base model optimized for meetings and conferences
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Base Phonecall
Base model optimized for phone conversations
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Base Video
Base model optimized for video content
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Base Voicemail
Base model optimized for voicemail transcription
deepgram $0.75000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced
Improved accuracy model for speech recognition
deepgram $0.87000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced Finance
Enhanced model optimized for finance terminology
deepgram $0.87000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced General
Enhanced model for general-purpose transcription
deepgram $0.87000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced Meeting
Enhanced model optimized for meetings and conferences
deepgram $0.87000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Enhanced Phonecall
Enhanced model optimized for phone conversations
deepgram $0.87000/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Nova
Advanced, high-performance speech recognition model
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Nova 2
High-accuracy, next-generation speech recognition model
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 ATC
Nova 2 model optimized for air traffic control
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Automotive
Nova 2 model optimized for automotive industry
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 ConversationalAI
Nova 2 model optimized for conversational AI
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Drivethru
Nova 2 model optimized for drive-through scenarios
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Finance
Nova 2 model optimized for finance terminology
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 General
Nova 2 model for general-purpose transcription
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Medical
Nova 2 model optimized for medical terminology
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Meeting
Nova 2 model optimized for meetings and conferences
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Phonecall
Nova 2 model optimized for phone conversations
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Video
Nova 2 model optimized for video content
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova 2 Voicemail
Nova 2 model optimized for voicemail transcription
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram - Nova General
Nova model for general-purpose transcription
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Nova Phonecall
Nova model optimized for phone conversations
deepgram $0.25800/hr
VTT
Punctuation Diarization Streaming Word Timestamps Language Detection
Deepgram - Whisper
OpenAI Whisper model hosted by Deepgram
deepgram $0.28800/hr
VTT
Punctuation Diarization Word Timestamps Language Detection
Deepgram - Whisper Base
Base Whisper model hosted by Deepgram
deepgram $0.21000/hr
VTT
Punctuation Diarization Word Timestamps Language Detection
Deepgram - Whisper Large
Large Whisper model hosted by Deepgram
deepgram $0.28800/hr
VTT
Punctuation Diarization Word Timestamps Language Detection
Deepgram - Whisper Medium
Medium Whisper model hosted by Deepgram
deepgram $0.25200/hr
VTT
Punctuation Diarization Word Timestamps Language Detection
Deepgram - Whisper Small
Small Whisper model hosted by Deepgram
deepgram $0.22800/hr
VTT
Punctuation Diarization Word Timestamps Language Detection
Deepgram - Whisper Tiny
Tiny Whisper model hosted by Deepgram
deepgram $0.19800/hr
VTT
Punctuation Diarization Word Timestamps Language Detection
Deepgram Nova 3
Great accuracy in a broader range of real-world enterprise use cases and challen...
deepgram $0.31200/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram Nova 3 General
Nova 3 model for general-purpose transcription
deepgram $0.31200/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
Deepgram Nova 3 Medical
Nova 3 model optimized for medical terminology
deepgram $0.31200/hr
VTT
Punctuation Diarization Streaming Speaker Labels Word Timestamps Language Detection
FalAI - Whisper
Whisper model hosted on FalAI platform
v3
FalAI $0.06900/hr
None
Punctuation Speaker Labels Word Timestamps
FalAI - Wizper
Optimized version of Whisper for improved performance
v3
FalAI $0.03000/hr
None
Punctuation Speaker Labels
FireworksAI - Whisper Turbo V3
Accelerated Whisper V3 model by FireworksAI
FireworksAI $0.05400/hr
SRT VTT
Punctuation Diarization Language Detection
FireworksAI - Whisper V3
Whisper V3 model hosted by FireworksAI
FireworksAI $0.09000/hr
SRT VTT
Punctuation Diarization Language Detection
Gemini 2.0 Flash
Next generation features, speed, thinking, and realtime streaming.
Gemini $0.08532/hr
None
Punctuation Language Detection
Gemini 2.0 Flash-Lite
Cost efficiency and low latency
Gemini $0.01215/hr
None
Punctuation Language Detection
Gemini 2.5 Flash Preview 05-20
Adaptive thinking, cost efficiency
Gemini $0.12222/hr
None
Punctuation Language Detection
Gemini 2.5 Pro Preview
Enhanced thinking and reasoning, general understanding, advanced coding, and mor...
Gemini $0.26100/hr
None
Punctuation Language Detection
Gladia Solaria
Gladia's cutting-edge, next-generation ASR model, launched in April 2025. Design...
v1
Gladia $0.61200/hr
VTT
Punctuation Diarization Streaming Speaker Labels Language Detection
Google Cloud - Enhanced
Enhanced speech recognition model by Google
google $0.96000/hr
SRT VTT
Punctuation Diarization Word Timestamps Language Detection
Google Cloud - Standard
Standard speech recognition model by Google
google $0.96000/hr
SRT VTT
Punctuation Diarization Word Timestamps Language Detection
Groq - Whisper Large V3
A multilingual ASR model offering high accuracy and speed for transcription and ...
vv3
groq $0.11100/hr
None
Punctuation Word Timestamps Language Detection
Groq - Whisper Turbo Large V3
A pruned and fine-tuned version of Whisper Large v3, designed for faster and les...
vv3 Turbo
groq $0.04000/hr
None
Punctuation Word Timestamps Language Detection
IBM Watson Speech to Text
A cloud-based speech recognition service from IBM Watson that converts audio int...
IBM $1.20000/hr
None
Punctuation Diarization Speaker Labels Word Timestamps Confidence
OpenAI - GPT-4o mini Transcribe
Speech-to-text model powered by GPT-4o mini. Offers improvements in word error r...
OpenAI $0.18000/hr
SRT VTT
Punctuation Streaming Language Detection
OpenAI - GPT-4o Transcribe
Speech-to-text model powered by GPT-4o. Offers improvements in word error rate, ...
OpenAI $0.36000/hr
SRT VTT
Punctuation Streaming Language Detection
OpenAI - Whisper
General-purpose speech recognition model. Based on the open-source Whisper large...
vlarge-v2
OpenAI $0.36000/hr
SRT VTT
Punctuation Streaming Word Timestamps Language Detection
Rev AI Enhanced
Rev AI's high-accuracy general-purpose speech-to-text model, trained on a divers...
v2.0
RevAI $0.30000/hr
SRT VTT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Rev AI Reverb ASR
Rev AI's open-source derived English Automatic Speech Recognition (ASR) model. T...
v1.0
RevAI $0.30000/hr
SRT VTT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Speechmatics Enhanced
Speechmatics' Enhanced ASR model offers very good accuracy, though processing is...
speechmatics $0.40000/hr
SRT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence
Speechmatics Standard
Speechmatics' Standard ASR model offers faster results with good accuracy.
speechmatics $0.24000/hr
SRT
Punctuation Diarization Speaker Labels Word Timestamps Language Detection Confidence

Compare Speech-to-Text Models

This comprehensive comparison of speech-to-text models helps you find the perfect solution for your audio transcription needs. Our table showcases models from leading providers, with detailed information on pricing, processing speed, and supported features.

Each model offers unique capabilities such as punctuation, speaker diarization, real-time streaming, speaker labels, word timestamps, and automatic language detection. Filter by language support to find models that work with your specific linguistic requirements, whether you need English, Spanish, French, German, Japanese, or dozens of other languages.

Click on any model to access its detailed card with comprehensive specifications, including maximum file size, accuracy ratings, supported audio formats, and sample rates. Whether you're developing an application requiring real-time transcription, processing interviews with multiple speakers, or needing highly accurate transcripts with confidence scores, you'll find the right speech recognition model for your project.

Compare hourly costs and processing speeds to balance your budget against performance requirements. Our detailed feature badges make it easy to identify models supporting critical functionality like speaker identification, word-level timestamps, or automatic punctuation insertion.