Reviewing stated performance of speech-to-text providers

May 8, 2025 by Langston Liam Morrow, Deep Researcher

In this article

Loading table of contents...

Introduction

The speed at which a speech-to-text (STT) model can transcribe audio is a critical factor for a wide range of applications. From real-time closed captioning and voice assistants to call center analytics and rapid content creation, the responsiveness and efficiency of the transcription process directly impact user experience and operational workflows. Two primary metrics are commonly used to evaluate the speed of STT models: Real-Time Factor (RTF) and Words Per Minute (WPM). Real-Time Factor is a ratio comparing the processing time required by the model to the duration of the input audio. An RTF of less than 1 indicates that the model can transcribe faster than the actual speech duration, a crucial attribute for applications demanding low latency. Words Per Minute, on the other hand, measures the throughput of the model by quantifying the number of words it can transcribe within a minute, which is particularly relevant for assessing the efficiency of batch processing and handling large volumes of audio data. This report aims to analyze and compare the speed performance of various STT models based on the provided research material, with a focus on RTF and its synonyms such as processing speed, latency, and speed.

Amazon Transcribe Speed Research

Amazon Transcribe offers real-time transcription capabilities for streaming audio content. [1] This functionality is designed to process both pre-recorded media, such as movies and podcasts, and live media, like news broadcasts, by transcribing sequential data packets almost instantaneously. [1] A key aspect influencing the speed of this real-time transcription is the latency, which is directly dependent on the size of the audio chunks being processed. [1] For optimal efficiency and reduced latency, Amazon Transcribe recommends setting the audio chunk size between 50 ms and 200 ms when using PCM-encoded audio. [1] This focus on chunk size optimization indicates a strategy to minimize the delay inherent in streaming transcription, allowing for near real-time output. While Amazon Transcribe emphasizes its real-time capabilities and provides recommendations for managing latency, the provided material does not explicitly quantify the Real-Time Factor or the Words Per Minute achieved by its models. [2] One document does mention a processing speed of 30,000 to 60,000 tokens per second on specific hardware 3, but this metric is not directly convertible to standard measures like WPM or RTF in the context of typical audio transcription tasks. The availability of both streaming and batch transcription options 4 suggests that Amazon Transcribe can cater to different speed requirements, but specific performance metrics for these modes are not detailed in the provided information. Therefore, while real-time processing with controllable latency is a feature of Amazon Transcribe, the research material does not offer precise RTF or WPM figures for its standard offerings.

AssemblyAI Speed Research

AssemblyAI highlights its capabilities in both real-time and asynchronous speech-to-text processing. [7] For real-time transcription, AssemblyAI claims a latency of less than 600 milliseconds. [7] This direct measurement of delay provides a clear indication of the responsiveness of their real-time service, suggesting suitability for interactive applications where immediate feedback is important. In addition to real-time processing, AssemblyAI offers asynchronous transcription services where most audio files, regardless of duration, are processed in well under 45 seconds. [7] This rapid turnaround time for pre-recorded audio points to a highly efficient processing pipeline. Further analysis reveals that AssemblyAI has achieved a Real-Time-Factor (RTF) as low as 0.008x for its asynchronous transcription. [8] This exceptionally low RTF signifies that their models can transcribe audio at a speed significantly faster than real-time, as evidenced by examples of hour-long meetings being transcribed in under 45 seconds. [8] These impressive speeds are attributed to continuous improvements in their API inference latency, driven by factors such as optimized model compilation, intelligent mini batching, hardware parallelization, and an optimized serving infrastructure. [9] While AssemblyAI’s real-time latency of under 1 second for its Universal-2 model 10 is within an acceptable range for many applications, the industry benchmark suggests that aiming for latencies under 100 ms is ideal for optimal real-time STT performance. [10] The provided material does not explicitly mention Words Per Minute (WPM) for any of the AssemblyAI models 11, indicating that RTF and latency are the primary speed metrics they emphasize.

Azure AI Speech-to-Text Speed Research

Azure AI Speech-to-Text defines Real-Time Factor (RTF) as the ratio of a device’s processing time to the length of the audio input. [19] For applications requiring real-time and interactive speech experiences, Azure AI recommends an RTF of 1 or lower. [19] This benchmark suggests a focus on ensuring that the transcription process keeps pace with or exceeds the speed of human speech. While Azure AI provides a general estimate of average human speech at 150 words per minute 20, this does not directly reflect the transcription speed of their models. A user report indicates an instance where Azure Speech-to-Text using continuousRecognitionAsync took close to 6 seconds to transcribe just 3 seconds of audio, resulting in an RTF of 2, which is slower than real-time. [21] This observation suggests that real-world performance can vary, and Microsoft recommends several optimizations to reduce latency, including deploying the speech resource closer to users and utilizing the Fast transcription API for quicker results with large files. [21] Azure AI also imposes quotas and limits on its services, such as concurrent request limits for both real-time and fast transcription, and maximum audio lengths for processing. [22] These limits can indirectly affect the overall speed and throughput achievable with the platform, especially under high load. While the Fast transcription API is mentioned as an option for faster processing of large files [21], the provided material does not offer specific RTF or WPM metrics for this or the standard real-time models. [23] Therefore, although Azure AI offers both real-time and fast transcription services, the research material lacks consistent, quantifiable speed metrics like RTF and WPM for a comprehensive performance analysis.

Cloudflare Whisper Speed Research

Cloudflare leverages the Whisper-large-v3-turbo model through its Workers AI platform to provide automatic speech recognition. [27] A key focus of this implementation is to create a scalable and low-latency transcription pipeline using the serverless environment of Cloudflare Workers. [27] The architecture is designed to handle large audio files by splitting them into smaller chunks for processing, a technique that can help improve perceived speed and overcome potential memory or execution time limitations. [27] While Cloudflare emphasizes the low-latency nature of its transcription service, it does impose a rate limit of 720 requests per minute for automatic speech recognition tasks. [28] This limit on the number of requests can influence the overall throughput and might be a consideration for applications with very high transcription demands. The provided research material does not offer specific, quantifiable metrics for Real-Time Factor or Words Per Minute achieved by Cloudflare Whisper. [29] The focus appears to be on the architectural aspects that support low latency and scalability within their serverless framework.

Deepgram Speed Research

Deepgram consistently presents itself as a leader in speech-to-text speed, claiming the “fastest real-time transcription in the biz with latency times of under 300 milliseconds”. [32] This sub-300ms latency is often highlighted with a direct comparison to competitors like Amazon Transcribe, which is stated to have a latency of 2-3 seconds. [32] Furthermore, Deepgram asserts that its technology is 10 times faster than Amazon. [33] For asynchronous batch processing, Deepgram reports speeds of up to 120 times normal audio speed, which translates to a very low Real-Time Factor of approximately 0.0083x, meaning an hour of audio can be transcribed in under 30 seconds. [35] Their Nova series of models, including Nova-2, further exemplifies this speed. Benchmarks show Nova-2 achieving a median inference time of just 29. [8] seconds per hour of diarized audio, an RTF of roughly 0.0083x. [36] In specific use cases, such as medical transcription, Nova-2 is reported to handle 120-180 words per minute. [36] Artificial Analysis benchmarks also indicate a high “Speed Factor” (synonymous with RTF) for Deepgram’s Base model. [37] Additionally, Deepgram’s Nova-2 model is reported to offer real-time transcription with latency less than 300 milliseconds and the capability to transcribe an hour of audio in just 12 seconds, resulting in an RTF of approximately 0.0033x. [34] Across various models and use cases, Deepgram consistently emphasizes and demonstrates its high speed for both real-time and asynchronous speech-to-text processing. [38]

FalAI Speed Research

FalAI’s speed performance is primarily evident through its ElevenLabs Speech to Text offering and its focus on optimized Whisper implementations. ElevenLabs Speech to Text reports a latency (Time to First Audio - TTFA) of 150ms for its lower quality Flash Model and 300ms or more for its full model. [62] These figures indicate competitive low latency, particularly the Flash Model, which could be advantageous for real-time interactive applications. In the realm of Whisper model optimization, FalAI highlights Faster-whisper as a high-speed alternative to the original OpenAI Whisper, claiming it to be up to 4 times faster while using less memory. [64] The Real Time Factor (RTF) of the OpenAI Whisper API is reported as 0.14. [64] Evaluations of Faster Whisper on Dutch test data show it achieving RTFs better than 0.14, with performance varying based on the chosen model size and beam size, illustrating a trade-off between speed and accuracy. [64] Furthermore, Artificial Analysis benchmarks for speech-to-text models include Fal.ai’s Wizper (L, v3), which demonstrates a high “Speed Factor,” a metric synonymous with RTF, indicating fast transcription. [65] This performance aligns with the general theme of FalAI providing faster Whisper-based solutions. [66] While latency is a key aspect of ElevenLabs’ speed, and RTF is emphasized for their Whisper optimizations, the provided material does not explicitly mention Words Per Minute (WPM) for FalAI’s speech-to-text offerings.

FireworksAI Whisper Speed Research

FireworksAI emphasizes the speed of its Whisper model implementations for both streaming and asynchronous transcription. [70] For streaming audio (16kHz mono PCM), they claim an end-to-end latency of 300ms. [70] This low latency suggests suitability for real-time applications. For asynchronous transcription, FireworksAI reports that their service can transcribe one hour of audio in just 4 seconds, resulting in an exceptionally low Real-Time Factor of approximately 0.0011x. [70] This indicates very high processing speed for pre-recorded audio. FireworksAI leverages a finetuned and pruned version of Whisper large-v3 called Whisper large-v3-turbo, which has significantly reduced decoding layers, making it “way faster” than the original model. [71] Relative speed comparisons show that smaller Whisper models generally offer faster processing times, with Whisper Turbo V3 being reported as about 8 times faster than the Large v3 model. [72] However, the processing speed can vary depending on the specific Whisper model size chosen and the underlying hardware used (CPU vs. GPU). [73] While FireworksAI consistently highlights the speed of its Whisper offerings, particularly the Turbo version, the provided material does not prominently feature Words Per Minute (WPM) as a key performance metric. [74]

Gladia Solaria Speed Research

Gladia introduces its Solaria model as a next-generation speech-to-text engine that delivers industry-leading accuracy and speed, along with native-level transcription in over 100 languages. [76] A key highlight of Solaria is its ultra-low latency, reported at 270 ms on interruption latency. [76] This specific and competitive latency figure positions Solaria as a strong performer for real-time communication platforms and call centers where immediate responsiveness is crucial. Gladia also offers a Solaria-1 Mini version, specifically designed to be an ultra-fast variant tailored for low-latency environments such as AI-powered voice agents and automated call handling. [76] The consistent emphasis on low latency, often cited as under 300ms for their real-time streaming engine 78, underscores Gladia’s focus on speed for real-time transcription. While accuracy is also a key selling point for Solaria, the provided material does not explicitly mention Words Per Minute (WPM) as a primary speed metric.

Google Speech-to-Text Speed Research

The provided research material does not contain specific, quantifiable metrics for Real-Time Factor or Words Per Minute for Google’s Enhanced or Standard speech-to-text models. [81] The snippets are either irrelevant to STT performance or point to inaccessible websites. Therefore, based on the information available, a conclusive analysis of Google Speech-to-Text’s speed performance in terms of RTF or WPM cannot be made.

Groq Whisper Speed Research

Groq’s Automatic Speech Recognition (ASR) API, powered by their specialized LPU (Language Processing Unit) AI inference technology, is designed to provide ultra-low latency audio transcription and translation. [85] A significant aspect of Groq’s Whisper implementation is the exceptionally high Real-Time Factor (RTF) achieved. For Whisper Large v3, Groq reports an RTF of up to 300x, indicating that it can transcribe audio at a speed 300 times faster than the original duration. [85] Similarly, Whisper Large v3 Turbo offers an RTF of up to 247x, and Distil-Whisper achieves up to 262x. [85] A user experience report corroborates these claims, noting that Groq’s Whisper Large V3 operates at approximately 172 times real-time speed. [86] These high RTF values are further supported by independent benchmarks from Artificial Analysis [37], which show median speed factors consistent with Groq’s claims. This focus on extremely fast asynchronous transcription makes Groq’s Whisper implementation particularly well-suited for applications where rapid processing of pre-recorded audio is paramount. The provided material does not prominently feature Words Per Minute (WPM) as a key metric for Groq’s Whisper offerings. [87]

OpenAI Speed Research

OpenAI’s speech-to-text offerings include several models with varying speed characteristics. GPT-4o mini Transcribe is described as having ”Fast” speed. [91] While this is a qualitative description, the provision of rate limits in Requests Per Minute (RPM) and Tokens Per Minute (TPM) for different usage tiers offers some context regarding its throughput capabilities. [91] In contrast, the standard GPT-4o Transcribe is listed as having ”Medium” speed, suggesting a potential trade-off with other factors like accuracy or features. [92] OpenAI’s foundational Whisper model has also been adapted for real-time applications. Salesforce reports using OpenAI Whisper for real-time transcriptions with a target latency of under 1 second, aiming for 500 milliseconds, demonstrating its capability for low-latency use cases. [93] Comparative analysis suggests that Deepgram might have a slightly higher Words Per Minute (WPM) rate (200-250) compared to OpenAI Whisper (150-200). [94] However, the Real-Time Factor (RTF) of the OpenAI Whisper API is reported as 0.14, indicating that it still offers faster-than-real-time processing for asynchronous tasks. [64] Furthermore, a newer version, Whisper turbo, is reported to achieve a latency of approximately 0.3 seconds for near real-time transcription. [67] OpenAI also highlights GPT-4.1 nano as its fastest and most cost-effective model for low-latency tasks. [95] Overall, OpenAI provides a range of speech-to-text models with varying speed profiles, catering to different needs from rapid asynchronous processing to low-latency real-time transcription. [40]

Rev AI Speed Research

Rev AI offers both asynchronous and streaming speech-to-text APIs, indicating a capability for different speed requirements. [97] For asynchronous processing, they provide a “Reverb Turbo Transcription” service, suggesting faster turnaround times compared to their standard offering. [97] Additionally, their support for streaming speech-to-text implies real-time capabilities, which typically involve lower latency. [97] However, the provided research material lacks specific, quantifiable metrics for Real-Time Factor or precise latency figures for Rev AI’s streaming service. [98] Information regarding the Asynchronous API suggests that it is not real-time and involves a waiting period, implying longer latency compared to streaming options. [99] While the existence of a “Turbo” option and a streaming API indicates a focus on speed, the absence of detailed performance data in the accessible snippets limits a definitive analysis of Rev AI’s speed in terms of RTF and WPM based solely on this research material.

Speechmatics Speed Research

Speechmatics offers real-time transcription capabilities, with its Ursa 2 model reported to have a real-time latency of under 1 second. [10] This latency places Speechmatics within the acceptable range for many applications requiring immediate transcription. Beyond this, the provided research material offers limited quantifiable data on the Real-Time Factor or Words Per Minute achieved by Speechmatics’ models. [101] Some snippets mention “Real-Time” as a feature, but specific performance details are not included. One snippet mentions the tagging speed of a part-of-speech tagger, TnT, which is not directly indicative of Speechmatics’ STT performance. [101] Therefore, while Speechmatics offers real-time transcription with a reported latency, the research material does not provide sufficient data for a comprehensive speed analysis in terms of RTF and WPM for their speech-to-text services.

Comparative Analysis Table

The following table summarizes the speed metrics found for the analyzed speech-to-text models:

Model Name	Provider	RTF (where available)	Latency (where available)	WPM (where available)	Notes
Amazon Transcribe	Amazon	Not available	Chunk size dependent	Not available	Real-time streaming
Universal-2	AssemblyAI	Not available	<1s	Not available	Real-time
Async	AssemblyAI	≤ 0.008x	N/A	Not available	Asynchronous
Azure AI STT	Azure AI	≤ 1 (recommended)	Varies	150 (human avg.)	Real-time; Fast Transcription API available
Whisper-large-v3-turbo	Cloudflare	Not available	Low latency	Not available	Real-time on Workers AI
Base	Deepgram	High Speed Factor	<300ms	Not available	Real-time
Nova-2	Deepgram	≈ 0.0083x	<300ms	120-180 (Medical)	Asynchronous; Real-time
Nova-3	Deepgram	Not available	Not available	Not available
Whisper API	OpenAI	0.14	~0.3s (turbo)	150-200	Asynchronous; Real-time (turbo)
GPT-4o mini Transcribe	OpenAI	Not available	Fast	Not available
GPT-4o Transcribe	OpenAI	Not available	Medium	Not available
Scribe (Flash Model)	ElevenLabs	Not available	150ms	Not available
Scribe (Full Model)	ElevenLabs	Not available	300ms+	Not available
Faster-whisper	FalAI	\> 0.14	Not available	Not available	Asynchronous; Varies by model size & beam size
Wizper (L, v3)	FalAI	High Speed Factor	Not available	Not available
Whisper Large v3	Groq	Up to 300x	Ultra-low latency	Not available	Asynchronous
Whisper Large v3 Turbo	Groq	Up to 247x	Ultra-low latency	Not available	Asynchronous
Distil-Whisper	Groq	Up to 262x	Ultra-low latency	Not available	Asynchronous
Ursa 2	Speechmatics	Not available	<1s	Not available	Real-time
Reverb Turbo	Rev AI	Faster than standard	Not available	Not available	Asynchronous
Streaming	Rev AI	Not available	Low latency	Not available	Real-time
Whisper Large v3	FireworksAI	≈ 0.0011x	300ms	Not available	Asynchronous; Real-time
Whisper large-v3-turbo	FireworksAI	Way faster	Not available	Not available
Solaria	Gladia	Not available	270ms	Not available	Real-time
Solaria-1 Mini	Gladia	Not available	Ultra-fast	Not available	Real-time

Conclusion

The analysis of the provided research material reveals a diverse landscape of speech-to-text model speed performance. Several providers, including Deepgram, AssemblyAI, Groq, and FireworksAI, demonstrate a strong emphasis on speed, achieving very low latency for real-time applications and exceptionally high Real-Time Factors for asynchronous transcription. Deepgram stands out with its claims of industry-leading low latency and very fast batch processing speeds, supported by benchmark data for its Nova series. AssemblyAI also offers impressive asynchronous speeds with a very low RTF. Groq’s Whisper implementations showcase remarkable RTFs, making them highly efficient for transcribing large volumes of pre-recorded audio. FireworksAI provides both low-latency streaming and extremely fast asynchronous processing for Whisper models. Gladia’s Solaria model boasts a very competitive low latency, positioning it well for real-time use cases.

In contrast, Amazon Transcribe and Azure AI Speech-to-Text focus on real-time capabilities and provide guidance on managing latency, but specific RTF and WPM metrics are not as readily available in the provided material. Cloudflare Whisper emphasizes low latency and scalability within its serverless architecture, while FalAI’s strength lies in optimized Whisper implementations like Faster-whisper, offering significant speed improvements over the original OpenAI model. OpenAI provides a range of models with varying speed profiles, with newer versions like Whisper turbo and GPT-4o mini aiming for faster performance. Rev AI indicates options for faster asynchronous processing and real-time capabilities but lacks detailed speed metrics in the provided material. Speechmatics reports a real-time latency of under 1 second for its Ursa 2 model, but comprehensive speed data is limited. Google Speech-to-Text’s speed performance could not be adequately assessed based on the provided research.

It is important to note that the level of detail in speed metrics varies significantly across different vendors’ documentation and marketing materials. While some providers offer specific RTF and latency figures, others focus on qualitative descriptions or architectural advantages. When selecting an STT solution, speed is a critical factor that should be considered alongside other important aspects such as accuracy, cost, language support, and specific features relevant to the intended application. The optimal choice will depend on the unique requirements of the use case, whether it prioritizes immediate responsiveness, high-throughput batch processing, or a balance of both.

Works cited

Transcribing streaming audio - AWS Documentation, accessed on May 8, 2025, https://docs.aws.amazon.com/transcribe/latest/dg/streaming.html
Post-call analytics - Amazon Transcribe - AWS Documentation, accessed on May 8, 2025, https://docs.aws.amazon.com/transcribe/latest/dg/call-analytics-batch.html
Amazon Transcribe FAQs – Amazon Web Services (AWS), accessed on May 8, 2025, https://aws.amazon.com/transcribe/faqs/
Amazon Transcribe Pricing – Amazon Web Services (AWS), accessed on May 8, 2025, https://aws.amazon.com/transcribe/pricing/
accessed on January 1, 1970, https://docs.aws.amazon.com/transcribe/latest/dg/limits.html
accessed on January 1, 1970, https://aws.amazon.com/blogs/machine-learning/amazon-transcribe-now-supports-real-time-transcription/
AWS Marketplace: AssemblyAI - Amazon.com, accessed on May 8, 2025, https://aws.amazon.com/marketplace/pp/prodview-j45hta6jdej7c
Newsletter - News, Tutorials, AI Research - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/tag/newsletter/
Lower latency, lower cost, more possibilities - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/lower-latency-new-pricing
How to Choose STT & TTS for AI Voice Agents in 2025: A Comprehensive Guide - Softcery, accessed on May 8, 2025, https://softcery.com/lab/how-to-choose-stt-tts-for-ai-voice-agents-in-2025-a-comprehensive-guide/
Benchmarks - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/benchmarks
Raising the Bar for Speech AI: Introducing Slam-1 & a New Streaming Model - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/speech-language-model-and-improved-streaming-model
Introducing Slam-1 | AssemblyAI | Documentation, accessed on May 8, 2025, https://www.assemblyai.com/docs/getting-started/slam-1
Universal-2-TF: Robust All-Neural Text Formatting for ASR - arXiv, accessed on May 8, 2025, https://arxiv.org/html/2501.05948v1
Models | AssemblyAI | Documentation, accessed on May 8, 2025, https://www.assemblyai.com/docs/getting-started/models
Pricing | Production-ready AI Models - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/pricing
Speech & Text | Blog from AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/
accessed on January 1, 1970, https://www.assemblyai.com/blog/assemblyai-launches-faster-and-cheaper-real-time-transcription/
Performance evaluations for Embedded Speech - Speech service - Azure AI services, accessed on May 8, 2025, https://learn.microsoft.com/en-us/azure/ai-services/speech-service/embedded-speech-performance-evaluations
Need to limit the audio time using Azure Text to Speech - Learn Microsoft, accessed on May 8, 2025, https://learn.microsoft.com/en-us/answers/questions/530697/need-to-limit-the-audio-time-using-azure-text-to-s
Azure speech to text appears very slow - Microsoft Q\&A, accessed on May 8, 2025, https://learn.microsoft.com/en-us/answers/questions/2152684/azure-speech-to-text-appears-very-slow
Speech service quotas and limits - Azure AI services - Learn Microsoft, accessed on May 8, 2025, https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-quotas-and-limits
Azure AI Speech Pricing | Microsoft Azure, accessed on May 8, 2025, https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/
What is the Speech service? - Azure AI services | Microsoft Learn, accessed on May 8, 2025, https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/overview
Speech to text overview - Speech service - Azure AI services | Microsoft Learn, accessed on May 8, 2025, https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-to-text
accessed on January 1, 1970, https://techcommunity.microsoft.com/t5/azure-ai/azure-speech-to-text-updates-faster-and-more-accurate-models/ba-p/3788857
Whisper-large-v3-turbo with Cloudflare Workers AI, accessed on May 8, 2025, https://developers.cloudflare.com/workers-ai/guides/tutorials/build-a-workers-ai-whisper-with-chunking/
Limits · Cloudflare Workers AI docs, accessed on May 8, 2025, https://developers.cloudflare.com/workers-ai/platform/limits/
Models · Cloudflare Workers AI docs, accessed on May 8, 2025, https://developers.cloudflare.com/workers-ai/models/?tasks=Automatic+Speech+Recognition
Introducing Stream Generated Captions, powered by Workers AI, accessed on May 8, 2025, https://blog.cloudflare.com/stream-automatic-captions-with-ai/
Cloudflare Community, accessed on May 8, 2025, https://community.cloudflare.com/
All About Transcription for Real-Time (Live) Audio Streaming - Deepgram Blog ⚡️, accessed on May 8, 2025, https://deepgram.com/learn/all-about-transcription-for-real-time-audio-streaming
Speech to Text API: Next-Gen AI Speech Recognition - Deepgram, accessed on May 8, 2025, https://deepgram.com/product/speech-to-text
Voice AI Evolution With Twilio and Deepgram - Efficiently Connected, accessed on May 8, 2025, https://www.efficientlyconnected.com/voice-ai-evolution-with-twilio-and-deepgram/
How Deepgram Works, accessed on May 8, 2025, https://6890003.fs1.hubspotusercontent-na1.net/hubfs/6890003/Collateral/Whitepaper%20How%20Deepgram%20Works-Updated.pdf?hsCtaTracking=e3c2fd70-fe62-48d5-815b-4546c5305a61%7C42cdfce0-5ebf-4b4c-92b7-9172beb24dbb
Deepgram Nova-2 - One API 200+ AI Models, accessed on May 8, 2025, https://aimlapi.com/models/deepgram-nova-2
Whisper (L v2), Deepgram - Artificial Analysis, accessed on May 8, 2025, https://artificialanalysis.ai/speech-to-text/models/deepgram
Text to Speech Latency — Deepgram | Documentation, accessed on May 8, 2025, https://developers.deepgram.com/docs/text-to-speech-latency
Model Options — Deepgram | Documentation, accessed on May 8, 2025, https://developers.deepgram.com/docs/model
Seeking Speech-to-Text API Recommendations: Word-Level Probabilities, High Accuracy, Low Latency : r/AskProgramming - Reddit, accessed on May 8, 2025, https://www.reddit.com/r/AskProgramming/comments/1cxo5ch/seeking_speechtotext_api_recommendations/
What is Deepgram and How Does this Voice AI Work in 2025? - Bigly Sales, accessed on May 8, 2025, https://biglysales.com/what-is-deepgram-and-how-does-this-voice-ai-work/
Using Interim Results — Deepgram | Documentation, accessed on May 8, 2025, https://developers.deepgram.com/docs/using-interim-results
Deepgram’s product-market fit journey - Unusual Ventures, accessed on May 8, 2025, https://www.unusual.vc/post/deepgrams-product-market-fit-journey
Why Deepgram’s Speech-to-Text API is #1 for Developers on G2, accessed on May 8, 2025, https://deepgram.com/learn/deepgrams-speech-to-text-api-number-1-for-developers-g2
Everything you need to know about Voice AI Agents - Deepgram, accessed on May 8, 2025, https://deepgram.com/learn/everything-about-voice-ai-agents
Introducing Deepgram Nova & Deepgram Whisper Cloud and On-Prem, accessed on May 8, 2025, https://deepgram.com/changelog/introducing-deepgram-nova-deepgram-whisper-cloud-and-on-prem
Deepgram vs Whisper Large: A Comprehensive Comparison - BytePlus, accessed on May 8, 2025, https://www.byteplus.com/en/topic/409744
Deepgram’s Advanced Speech Recognition Technology 2024 - CDN Blog, accessed on May 8, 2025, https://blog.blazingcdn.com/en-us/deepgrams-advanced-speech-recognition-technology
A Detailed Breakdown Of Deepgram Pricing & Features (2024) - Unreal Speech, accessed on May 8, 2025, https://blog.unrealspeech.com/deepgram-pricing/
Real-time Speech To Text - Always Send Finalize Response · deepgram · Discussion #1035, accessed on May 8, 2025, https://github.com/orgs/deepgram/discussions/1035
Getting Started | Deepgram’s Docs - Text-to-Speech, accessed on May 8, 2025, https://developers.deepgram.com/docs/text-to-speech
Migrating From Google Speech-to-Text (STT) to Deepgram | Deepgram’s Docs, accessed on May 8, 2025, https://developers.deepgram.com/docs/migrating-from-google-speech-to-text-stt-to-deepgram
Trained on 100000+ Voices: Deepgram Unveils Next-Gen Speaker Diarization and Language Detection Models, accessed on May 8, 2025, https://deepgram.com/learn/nextgen-speaker-diarization-and-language-detection-models
Streaming API is very slow, is it a bug or a user error? · deepgram · Discussion #1066, accessed on May 8, 2025, https://github.com/orgs/deepgram/discussions/1066
Nova-2 Air Traffic Control Speech to Text - Deepgram, accessed on May 8, 2025, https://deepgram.com/changelog/nova-2-air-traffic-control
Introducing Nova-3: Extending Deepgram’s Leadership in Voice AI for Enterprise Use Cases, accessed on May 8, 2025, https://www.reworked.co/the-wire/introducing-nova-3-extending-deepgrams-leadership-in-voice-ai-for-enterprise-use-cases/
Using Google Colab with Deepgram to Transcribe Audio - YouTube, accessed on May 8, 2025, https://www.youtube.com/watch?v=ocvilXRtOGM
Model Options | Deepgram’s Docs, accessed on May 8, 2025, https://developers.deepgram.com/docs/model
Blog - Deepgram, accessed on May 8, 2025, https://deepgram.com/learn/
Introducing Nova-3: Setting a New Standard for AI-Driven Speech-to-Text | Deepgram, accessed on May 8, 2025, https://deepgram.com/learn/introducing-nova-3-speech-to-text-api
ElevenLabs vs OpenAI Speech to Text - Cartesia, accessed on May 8, 2025, https://cartesia.ai/vs/elevenlabs-vs-openai-tts
ElevenLabs vs Google Speech to Text - Cartesia, accessed on May 8, 2025, https://cartesia.ai/vs/elevenlabs-vs-google-tts
Speech-to-Text with Faster-Whisper - Mysoly, accessed on May 8, 2025, https://mysoly.nl/speech-to-text-with-faster-whisper-the-high-speed-alternative-to-openai-whisper/
Whisper: API Provider Benchmarking & Analysis, accessed on May 8, 2025, https://artificialanalysis.ai/speech-to-text/models/whisper?trk=public_post_comment-text
Whisper Meets FalAI: From Speech Recognition to End-to-End Spoken Language Understanding - ISCA Archive, accessed on May 8, 2025, https://www.isca-archive.org/iberspeech_2024/pineiromartin24_iberspeech.pdf
Realtime Transcription using New OpenAI Whisper Turbo : r/LocalLLaMA - Reddit, accessed on May 8, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1fubr8d/realtime_transcription_using_new_openai_whisper/
Model Gallery | fal.ai, accessed on May 8, 2025, https://fal.ai/models
Pricing | fal.ai, accessed on May 8, 2025, https://fal.ai/pricing
Fireworks Streaming Transcription: 300ms with Whisper-v3-large-quality, accessed on May 8, 2025, https://fireworks.ai/blog/streaming-audio-launch
openai/whisper-large-v3-turbo - Hugging Face, accessed on May 8, 2025, https://huggingface.co/openai/whisper-large-v3-turbo
New Whisper model: “turbo” : r/LocalLLaMA - Reddit, accessed on May 8, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1ft3wyb/new_whisper_model_turbo/
Quantization for OpenAI’s Whisper Models: A Comparative Analysis - arXiv, accessed on May 8, 2025, https://arxiv.org/html/2503.09905v1
Whisper V3 Large - Fireworks, accessed on May 8, 2025, https://fireworks.ai/models/fireworks/whisper-v3
Pricing - Fireworks, accessed on May 8, 2025, https://fireworks.ai/pricing
Solaria I The first truly universal speech-to-text model - Gladia, accessed on May 8, 2025, https://www.gladia.io/solaria
Gladia Launches Solaria, the First Fully Multilingual, Next-Generation Speech-to-Text Model for Global Scalability - CMS Wire, accessed on May 8, 2025, https://www.cmswire.com/the-wire/gladia-launches-solaria-the-first-fully-multilingual-next-generation-speech-to-text-model-for-global-scalability/
Pricing - Gladia, accessed on May 8, 2025, https://gladia.io/pricing
Blog - Gladia, accessed on May 8, 2025, https://gladia.io/blog/
Speech-to-Text API Reference - Gladia, accessed on May 8, 2025, https://gladia.io/docs/api-reference/speech-to-text
USA Today Baseball Coaches Poll - AP News, accessed on May 8, 2025, https://apnews.com/80fe2cde1d8d9c092b1353459d977d0f
Typing Speed: How to Set Your Words-Per-Minute (WPM) Goal, accessed on May 8, 2025, https://www.typing.com/blog/typing-speed/
Speech to text models - Google Cloud, accessed on May 8, 2025, https://cloud.google.com/speech-to-text/docs/speech-to-text-supported-languages
Speech-to-Text API Reference - Google Cloud, accessed on May 8, 2025, https://cloud.google.com/speech-to-text/docs/reference/rest
Groq Automatic Speech Recognition (ASR) API, accessed on May 8, 2025, https://groq.com/GroqDocs/Groq%20ASR%20Model%20Guide.pdf
Switching to Groq Whisper Large V3 for Cost-Effective and Faster Transcriptions #57, accessed on May 8, 2025, https://github.com/savbell/whisper-writer/discussions/57
On-demand Pricing for Tokens-as-a-Service - Groq is Fast AI Inference, accessed on May 8, 2025, https://groq.com/pricing/
groq tokens per minute (TPM): Limit 6000 · Issue #4040 · elizaOS/eliza - GitHub, accessed on May 8, 2025, https://github.com/elizaOS/eliza/issues/4040
Groq (Whisper) - Pipecat Docs, accessed on May 8, 2025, https://docs.pipecat.ai/server/services/stt/groq
Supported Models - GroqDocs - GroqCloud, accessed on May 8, 2025, https://console.groq.com/docs/models
GPT-4o mini Transcribe - OpenAI Platform, accessed on May 8, 2025, https://platform.openai.com/docs/models/gpt-4o-mini-transcribe
GPT-4o Transcribe - OpenAI Platform, accessed on May 8, 2025, https://platform.openai.com/docs/models/gpt-4o-transcribe
How Salesforce’s New Speech-to-Text Service Uses OpenAI Whisper Models for Real-Time Transcriptions, accessed on May 8, 2025, https://engineering.salesforce.com/how-salesforces-new-speech-to-text-service-uses-openai-whisper-models-for-real-time-transcriptions/
Benchmark Report: OpenAI Whisper vs. Deepgram - BytePlus, accessed on May 8, 2025, https://www.byteplus.com/en/topic/409746
Pricing | OpenAI, accessed on May 8, 2025, https://openai.com/pricing
Speech to text - OpenAI API - OpenAI Platform, accessed on May 8, 2025, https://platform.openai.com/docs/guides/speech-to-text
Pricing | Rev AI, accessed on May 8, 2025, https://www.rev.ai/pricing
Speech to Text API | Speech Recognition Service - Rev AI, accessed on May 8, 2025, https://www.rev.ai/
Asynchronous Speech-to-Text API Get Started - Rev AI, accessed on May 8, 2025, https://www.rev.ai/docs
Rev AI Speed and Latency Blog, accessed on May 8, 2025, https://www.rev.ai/blog
Pricing for our Speech Recognition API Services | Speechmatics, accessed on May 8, 2025, https://www.speechmatics.com/pricing/
Blog & Latest Speech Recognition News| Speechmatics, accessed on May 8, 2025, https://www.speechmatics.com/blog/
Output – Realtime | Speechmatics Docs, accessed on May 8, 2025, https://docs.speechmatics.com/speech-to-text/realtime/output