Reviewing stated performance of speech-to-text providers

May 8, 2025 by Langston Liam Morrow, Deep Researcher

In this article

Loading table of contents...

Introduction

The speed at which a speech-to-text (STT) model can transcribe audio is a critical factor for a wide range of applications. From real-time closed captioning and voice assistants to call center analytics and rapid content creation, the responsiveness and efficiency of the transcription process directly impact user experience and operational workflows. Two primary metrics are commonly used to evaluate the speed of STT models: Real-Time Factor (RTF) and Words Per Minute (WPM). Real-Time Factor is a ratio comparing the processing time required by the model to the duration of the input audio. An RTF of less than 1 indicates that the model can transcribe faster than the actual speech duration, a crucial attribute for applications demanding low latency. Words Per Minute, on the other hand, measures the throughput of the model by quantifying the number of words it can transcribe within a minute, which is particularly relevant for assessing the efficiency of batch processing and handling large volumes of audio data. This report aims to analyze and compare the speed performance of various STT models based on the provided research material, with a focus on RTF and its synonyms such as processing speed, latency, and speed.

Amazon Transcribe Speed Research

Amazon Transcribe offers real-time transcription capabilities for streaming audio content. [1] This functionality is designed to process both pre-recorded media, such as movies and podcasts, and live media, like news broadcasts, by transcribing sequential data packets almost instantaneously. [1] A key aspect influencing the speed of this real-time transcription is the latency, which is directly dependent on the size of the audio chunks being processed. [1] For optimal efficiency and reduced latency, Amazon Transcribe recommends setting the audio chunk size between 50 ms and 200 ms when using PCM-encoded audio. [1] This focus on chunk size optimization indicates a strategy to minimize the delay inherent in streaming transcription, allowing for near real-time output. While Amazon Transcribe emphasizes its real-time capabilities and provides recommendations for managing latency, the provided material does not explicitly quantify the Real-Time Factor or the Words Per Minute achieved by its models. [2] One document does mention a processing speed of 30,000 to 60,000 tokens per second on specific hardware 3, but this metric is not directly convertible to standard measures like WPM or RTF in the context of typical audio transcription tasks. The availability of both streaming and batch transcription options 4 suggests that Amazon Transcribe can cater to different speed requirements, but specific performance metrics for these modes are not detailed in the provided information. Therefore, while real-time processing with controllable latency is a feature of Amazon Transcribe, the research material does not offer precise RTF or WPM figures for its standard offerings.

AssemblyAI Speed Research

AssemblyAI highlights its capabilities in both real-time and asynchronous speech-to-text processing. [7] For real-time transcription, AssemblyAI claims a latency of less than 600 milliseconds. [7] This direct measurement of delay provides a clear indication of the responsiveness of their real-time service, suggesting suitability for interactive applications where immediate feedback is important. In addition to real-time processing, AssemblyAI offers asynchronous transcription services where most audio files, regardless of duration, are processed in well under 45 seconds. [7] This rapid turnaround time for pre-recorded audio points to a highly efficient processing pipeline. Further analysis reveals that AssemblyAI has achieved a Real-Time-Factor (RTF) as low as 0.008x for its asynchronous transcription. [8] This exceptionally low RTF signifies that their models can transcribe audio at a speed significantly faster than real-time, as evidenced by examples of hour-long meetings being transcribed in under 45 seconds. [8] These impressive speeds are attributed to continuous improvements in their API inference latency, driven by factors such as optimized model compilation, intelligent mini batching, hardware parallelization, and an optimized serving infrastructure. [9] While AssemblyAI’s real-time latency of under 1 second for its Universal-2 model 10 is within an acceptable range for many applications, the industry benchmark suggests that aiming for latencies under 100 ms is ideal for optimal real-time STT performance. [10] The provided material does not explicitly mention Words Per Minute (WPM) for any of the AssemblyAI models 11, indicating that RTF and latency are the primary speed metrics they emphasize.

Azure AI Speech-to-Text Speed Research

Azure AI Speech-to-Text defines Real-Time Factor (RTF) as the ratio of a device’s processing time to the length of the audio input. [19] For applications requiring real-time and interactive speech experiences, Azure AI recommends an RTF of 1 or lower. [19] This benchmark suggests a focus on ensuring that the transcription process keeps pace with or exceeds the speed of human speech. While Azure AI provides a general estimate of average human speech at 150 words per minute 20, this does not directly reflect the transcription speed of their models. A user report indicates an instance where Azure Speech-to-Text using continuousRecognitionAsync took close to 6 seconds to transcribe just 3 seconds of audio, resulting in an RTF of 2, which is slower than real-time. [21] This observation suggests that real-world performance can vary, and Microsoft recommends several optimizations to reduce latency, including deploying the speech resource closer to users and utilizing the Fast transcription API for quicker results with large files. [21] Azure AI also imposes quotas and limits on its services, such as concurrent request limits for both real-time and fast transcription, and maximum audio lengths for processing. [22] These limits can indirectly affect the overall speed and throughput achievable with the platform, especially under high load. While the Fast transcription API is mentioned as an option for faster processing of large files [21], the provided material does not offer specific RTF or WPM metrics for this or the standard real-time models. [23] Therefore, although Azure AI offers both real-time and fast transcription services, the research material lacks consistent, quantifiable speed metrics like RTF and WPM for a comprehensive performance analysis.

Cloudflare Whisper Speed Research

Cloudflare leverages the Whisper-large-v3-turbo model through its Workers AI platform to provide automatic speech recognition. [27] A key focus of this implementation is to create a scalable and low-latency transcription pipeline using the serverless environment of Cloudflare Workers. [27] The architecture is designed to handle large audio files by splitting them into smaller chunks for processing, a technique that can help improve perceived speed and overcome potential memory or execution time limitations. [27] While Cloudflare emphasizes the low-latency nature of its transcription service, it does impose a rate limit of 720 requests per minute for automatic speech recognition tasks. [28] This limit on the number of requests can influence the overall throughput and might be a consideration for applications with very high transcription demands. The provided research material does not offer specific, quantifiable metrics for Real-Time Factor or Words Per Minute achieved by Cloudflare Whisper. [29] The focus appears to be on the architectural aspects that support low latency and scalability within their serverless framework.

Deepgram Speed Research

Deepgram consistently presents itself as a leader in speech-to-text speed, claiming the “fastest real-time transcription in the biz with latency times of under 300 milliseconds”. [32] This sub-300ms latency is often highlighted with a direct comparison to competitors like Amazon Transcribe, which is stated to have a latency of 2-3 seconds. [32] Furthermore, Deepgram asserts that its technology is 10 times faster than Amazon. [33] For asynchronous batch processing, Deepgram reports speeds of up to 120 times normal audio speed, which translates to a very low Real-Time Factor of approximately 0.0083x, meaning an hour of audio can be transcribed in under 30 seconds. [35] Their Nova series of models, including Nova-2, further exemplifies this speed. Benchmarks show Nova-2 achieving a median inference time of just 29. [8] seconds per hour of diarized audio, an RTF of roughly 0.0083x. [36] In specific use cases, such as medical transcription, Nova-2 is reported to handle 120-180 words per minute. [36] Artificial Analysis benchmarks also indicate a high “Speed Factor” (synonymous with RTF) for Deepgram’s Base model. [37] Additionally, Deepgram’s Nova-2 model is reported to offer real-time transcription with latency less than 300 milliseconds and the capability to transcribe an hour of audio in just 12 seconds, resulting in an RTF of approximately 0.0033x. [34] Across various models and use cases, Deepgram consistently emphasizes and demonstrates its high speed for both real-time and asynchronous speech-to-text processing. [38]

FalAI Speed Research

FalAI’s speed performance is primarily evident through its ElevenLabs Speech to Text offering and its focus on optimized Whisper implementations. ElevenLabs Speech to Text reports a latency (Time to First Audio - TTFA) of 150ms for its lower quality Flash Model and 300ms or more for its full model. [62] These figures indicate competitive low latency, particularly the Flash Model, which could be advantageous for real-time interactive applications. In the realm of Whisper model optimization, FalAI highlights Faster-whisper as a high-speed alternative to the original OpenAI Whisper, claiming it to be up to 4 times faster while using less memory. [64] The Real Time Factor (RTF) of the OpenAI Whisper API is reported as 0.14. [64] Evaluations of Faster Whisper on Dutch test data show it achieving RTFs better than 0.14, with performance varying based on the chosen model size and beam size, illustrating a trade-off between speed and accuracy. [64] Furthermore, Artificial Analysis benchmarks for speech-to-text models include Fal.ai’s Wizper (L, v3), which demonstrates a high “Speed Factor,” a metric synonymous with RTF, indicating fast transcription. [65] This performance aligns with the general theme of FalAI providing faster Whisper-based solutions. [66] While latency is a key aspect of ElevenLabs’ speed, and RTF is emphasized for their Whisper optimizations, the provided material does not explicitly mention Words Per Minute (WPM) for FalAI’s speech-to-text offerings.

FireworksAI Whisper Speed Research

FireworksAI emphasizes the speed of its Whisper model implementations for both streaming and asynchronous transcription. [70] For streaming audio (16kHz mono PCM), they claim an end-to-end latency of 300ms. [70] This low latency suggests suitability for real-time applications. For asynchronous transcription, FireworksAI reports that their service can transcribe one hour of audio in just 4 seconds, resulting in an exceptionally low Real-Time Factor of approximately 0.0011x. [70] This indicates very high processing speed for pre-recorded audio. FireworksAI leverages a finetuned and pruned version of Whisper large-v3 called Whisper large-v3-turbo, which has significantly reduced decoding layers, making it “way faster” than the original model. [71] Relative speed comparisons show that smaller Whisper models generally offer faster processing times, with Whisper Turbo V3 being reported as about 8 times faster than the Large v3 model. [72] However, the processing speed can vary depending on the specific Whisper model size chosen and the underlying hardware used (CPU vs. GPU). [73] While FireworksAI consistently highlights the speed of its Whisper offerings, particularly the Turbo version, the provided material does not prominently feature Words Per Minute (WPM) as a key performance metric. [74]

Gladia Solaria Speed Research

Gladia introduces its Solaria model as a next-generation speech-to-text engine that delivers industry-leading accuracy and speed, along with native-level transcription in over 100 languages. [76] A key highlight of Solaria is its ultra-low latency, reported at 270 ms on interruption latency. [76] This specific and competitive latency figure positions Solaria as a strong performer for real-time communication platforms and call centers where immediate responsiveness is crucial. Gladia also offers a Solaria-1 Mini version, specifically designed to be an ultra-fast variant tailored for low-latency environments such as AI-powered voice agents and automated call handling. [76] The consistent emphasis on low latency, often cited as under 300ms for their real-time streaming engine 78, underscores Gladia’s focus on speed for real-time transcription. While accuracy is also a key selling point for Solaria, the provided material does not explicitly mention Words Per Minute (WPM) as a primary speed metric.

Google Speech-to-Text Speed Research

The provided research material does not contain specific, quantifiable metrics for Real-Time Factor or Words Per Minute for Google’s Enhanced or Standard speech-to-text models. [81] The snippets are either irrelevant to STT performance or point to inaccessible websites. Therefore, based on the information available, a conclusive analysis of Google Speech-to-Text’s speed performance in terms of RTF or WPM cannot be made.

Groq Whisper Speed Research

Groq’s Automatic Speech Recognition (ASR) API, powered by their specialized LPU (Language Processing Unit) AI inference technology, is designed to provide ultra-low latency audio transcription and translation. [85] A significant aspect of Groq’s Whisper implementation is the exceptionally high Real-Time Factor (RTF) achieved. For Whisper Large v3, Groq reports an RTF of up to 300x, indicating that it can transcribe audio at a speed 300 times faster than the original duration. [85] Similarly, Whisper Large v3 Turbo offers an RTF of up to 247x, and Distil-Whisper achieves up to 262x. [85] A user experience report corroborates these claims, noting that Groq’s Whisper Large V3 operates at approximately 172 times real-time speed. [86] These high RTF values are further supported by independent benchmarks from Artificial Analysis [37], which show median speed factors consistent with Groq’s claims. This focus on extremely fast asynchronous transcription makes Groq’s Whisper implementation particularly well-suited for applications where rapid processing of pre-recorded audio is paramount. The provided material does not prominently feature Words Per Minute (WPM) as a key metric for Groq’s Whisper offerings. [87]

OpenAI Speed Research

OpenAI’s speech-to-text offerings include several models with varying speed characteristics. GPT-4o mini Transcribe is described as having ”Fast” speed. [91] While this is a qualitative description, the provision of rate limits in Requests Per Minute (RPM) and Tokens Per Minute (TPM) for different usage tiers offers some context regarding its throughput capabilities. [91] In contrast, the standard GPT-4o Transcribe is listed as having ”Medium” speed, suggesting a potential trade-off with other factors like accuracy or features. [92] OpenAI’s foundational Whisper model has also been adapted for real-time applications. Salesforce reports using OpenAI Whisper for real-time transcriptions with a target latency of under 1 second, aiming for 500 milliseconds, demonstrating its capability for low-latency use cases. [93] Comparative analysis suggests that Deepgram might have a slightly higher Words Per Minute (WPM) rate (200-250) compared to OpenAI Whisper (150-200). [94] However, the Real-Time Factor (RTF) of the OpenAI Whisper API is reported as 0.14, indicating that it still offers faster-than-real-time processing for asynchronous tasks. [64] Furthermore, a newer version, Whisper turbo, is reported to achieve a latency of approximately 0.3 seconds for near real-time transcription. [67] OpenAI also highlights GPT-4.1 nano as its fastest and most cost-effective model for low-latency tasks. [95] Overall, OpenAI provides a range of speech-to-text models with varying speed profiles, catering to different needs from rapid asynchronous processing to low-latency real-time transcription. [40]

Rev AI Speed Research

Rev AI offers both asynchronous and streaming speech-to-text APIs, indicating a capability for different speed requirements. [97] For asynchronous processing, they provide a “Reverb Turbo Transcription” service, suggesting faster turnaround times compared to their standard offering. [97] Additionally, their support for streaming speech-to-text implies real-time capabilities, which typically involve lower latency. [97] However, the provided research material lacks specific, quantifiable metrics for Real-Time Factor or precise latency figures for Rev AI’s streaming service. [98] Information regarding the Asynchronous API suggests that it is not real-time and involves a waiting period, implying longer latency compared to streaming options. [99] While the existence of a “Turbo” option and a streaming API indicates a focus on speed, the absence of detailed performance data in the accessible snippets limits a definitive analysis of Rev AI’s speed in terms of RTF and WPM based solely on this research material.

Speechmatics Speed Research

Speechmatics offers real-time transcription capabilities, with its Ursa 2 model reported to have a real-time latency of under 1 second. [10] This latency places Speechmatics within the acceptable range for many applications requiring immediate transcription. Beyond this, the provided research material offers limited quantifiable data on the Real-Time Factor or Words Per Minute achieved by Speechmatics’ models. [101] Some snippets mention “Real-Time” as a feature, but specific performance details are not included. One snippet mentions the tagging speed of a part-of-speech tagger, TnT, which is not directly indicative of Speechmatics’ STT performance. [101] Therefore, while Speechmatics offers real-time transcription with a reported latency, the research material does not provide sufficient data for a comprehensive speed analysis in terms of RTF and WPM for their speech-to-text services.

Comparative Analysis Table

The following table summarizes the speed metrics found for the analyzed speech-to-text models:

Model NameProviderRTF (where available)Latency (where available)WPM (where available)Notes
Amazon TranscribeAmazonNot availableChunk size dependentNot availableReal-time streaming
Universal-2AssemblyAINot available<1sNot availableReal-time
AsyncAssemblyAI≤ 0.008xN/ANot availableAsynchronous
Azure AI STTAzure AI≤ 1 (recommended)Varies150 (human avg.)Real-time; Fast Transcription API available
Whisper-large-v3-turboCloudflareNot availableLow latencyNot availableReal-time on Workers AI
BaseDeepgramHigh Speed Factor<300msNot availableReal-time
Nova-2Deepgram≈ 0.0083x<300ms120-180 (Medical)Asynchronous; Real-time
Nova-3DeepgramNot availableNot availableNot available
Whisper APIOpenAI0.14~0.3s (turbo)150-200Asynchronous; Real-time (turbo)
GPT-4o mini TranscribeOpenAINot availableFastNot available
GPT-4o TranscribeOpenAINot availableMediumNot available
Scribe (Flash Model)ElevenLabsNot available150msNot available
Scribe (Full Model)ElevenLabsNot available300ms+Not available
Faster-whisperFalAI\> 0.14Not availableNot availableAsynchronous; Varies by model size & beam size
Wizper (L, v3)FalAIHigh Speed FactorNot availableNot available
Whisper Large v3GroqUp to 300xUltra-low latencyNot availableAsynchronous
Whisper Large v3 TurboGroqUp to 247xUltra-low latencyNot availableAsynchronous
Distil-WhisperGroqUp to 262xUltra-low latencyNot availableAsynchronous
Ursa 2SpeechmaticsNot available<1sNot availableReal-time
Reverb TurboRev AIFaster than standardNot availableNot availableAsynchronous
StreamingRev AINot availableLow latencyNot availableReal-time
Whisper Large v3FireworksAI≈ 0.0011x300msNot availableAsynchronous; Real-time
Whisper large-v3-turboFireworksAIWay fasterNot availableNot available
SolariaGladiaNot available270msNot availableReal-time
Solaria-1 MiniGladiaNot availableUltra-fastNot availableReal-time

Conclusion

The analysis of the provided research material reveals a diverse landscape of speech-to-text model speed performance. Several providers, including Deepgram, AssemblyAI, Groq, and FireworksAI, demonstrate a strong emphasis on speed, achieving very low latency for real-time applications and exceptionally high Real-Time Factors for asynchronous transcription. Deepgram stands out with its claims of industry-leading low latency and very fast batch processing speeds, supported by benchmark data for its Nova series. AssemblyAI also offers impressive asynchronous speeds with a very low RTF. Groq’s Whisper implementations showcase remarkable RTFs, making them highly efficient for transcribing large volumes of pre-recorded audio. FireworksAI provides both low-latency streaming and extremely fast asynchronous processing for Whisper models. Gladia’s Solaria model boasts a very competitive low latency, positioning it well for real-time use cases.

In contrast, Amazon Transcribe and Azure AI Speech-to-Text focus on real-time capabilities and provide guidance on managing latency, but specific RTF and WPM metrics are not as readily available in the provided material. Cloudflare Whisper emphasizes low latency and scalability within its serverless architecture, while FalAI’s strength lies in optimized Whisper implementations like Faster-whisper, offering significant speed improvements over the original OpenAI model. OpenAI provides a range of models with varying speed profiles, with newer versions like Whisper turbo and GPT-4o mini aiming for faster performance. Rev AI indicates options for faster asynchronous processing and real-time capabilities but lacks detailed speed metrics in the provided material. Speechmatics reports a real-time latency of under 1 second for its Ursa 2 model, but comprehensive speed data is limited. Google Speech-to-Text’s speed performance could not be adequately assessed based on the provided research.

It is important to note that the level of detail in speed metrics varies significantly across different vendors’ documentation and marketing materials. While some providers offer specific RTF and latency figures, others focus on qualitative descriptions or architectural advantages. When selecting an STT solution, speed is a critical factor that should be considered alongside other important aspects such as accuracy, cost, language support, and specific features relevant to the intended application. The optimal choice will depend on the unique requirements of the use case, whether it prioritizes immediate responsiveness, high-throughput batch processing, or a balance of both.

Works cited

  1. Transcribing streaming audio - AWS Documentation, accessed on May 8, 2025, https://docs.aws.amazon.com/transcribe/latest/dg/streaming.html
  2. Post-call analytics - Amazon Transcribe - AWS Documentation, accessed on May 8, 2025, https://docs.aws.amazon.com/transcribe/latest/dg/call-analytics-batch.html
  3. Amazon Transcribe FAQs – Amazon Web Services (AWS), accessed on May 8, 2025, https://aws.amazon.com/transcribe/faqs/
  4. Amazon Transcribe Pricing – Amazon Web Services (AWS), accessed on May 8, 2025, https://aws.amazon.com/transcribe/pricing/
  5. accessed on January 1, 1970, https://docs.aws.amazon.com/transcribe/latest/dg/limits.html
  6. accessed on January 1, 1970, https://aws.amazon.com/blogs/machine-learning/amazon-transcribe-now-supports-real-time-transcription/
  7. AWS Marketplace: AssemblyAI - Amazon.com, accessed on May 8, 2025, https://aws.amazon.com/marketplace/pp/prodview-j45hta6jdej7c
  8. Newsletter - News, Tutorials, AI Research - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/tag/newsletter/
  9. Lower latency, lower cost, more possibilities - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/lower-latency-new-pricing
  10. How to Choose STT & TTS for AI Voice Agents in 2025: A Comprehensive Guide - Softcery, accessed on May 8, 2025, https://softcery.com/lab/how-to-choose-stt-tts-for-ai-voice-agents-in-2025-a-comprehensive-guide/
  11. Benchmarks - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/benchmarks
  12. Raising the Bar for Speech AI: Introducing Slam-1 & a New Streaming Model - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/speech-language-model-and-improved-streaming-model
  13. Introducing Slam-1 | AssemblyAI | Documentation, accessed on May 8, 2025, https://www.assemblyai.com/docs/getting-started/slam-1
  14. Universal-2-TF: Robust All-Neural Text Formatting for ASR - arXiv, accessed on May 8, 2025, https://arxiv.org/html/2501.05948v1
  15. accessed on January 1, 1970, https://www.assemblyai.com/docs/models#acoustic-models
  16. Pricing | Production-ready AI Models - AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/pricing
  17. Speech & Text | Blog from AssemblyAI, accessed on May 8, 2025, https://www.assemblyai.com/blog/
  18. accessed on January 1, 1970, https://www.assemblyai.com/blog/assemblyai-launches-faster-and-cheaper-real-time-transcription/
  19. Performance evaluations for Embedded Speech - Speech service - Azure AI services, accessed on May 8, 2025, https://learn.microsoft.com/en-us/azure/ai-services/speech-service/embedded-speech-performance-evaluations
  20. Need to limit the audio time using Azure Text to Speech - Learn Microsoft, accessed on May 8, 2025, https://learn.microsoft.com/en-us/answers/questions/530697/need-to-limit-the-audio-time-using-azure-text-to-s
  21. Azure speech to text appears very slow - Microsoft Q\&A, accessed on May 8, 2025, https://learn.microsoft.com/en-us/answers/questions/2152684/azure-speech-to-text-appears-very-slow
  22. Speech service quotas and limits - Azure AI services - Learn Microsoft, accessed on May 8, 2025, https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-services-quotas-and-limits
  23. accessed on January 1, 1970, https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-to-text/
  24. What is the Speech service? - Azure AI services | Microsoft Learn, accessed on May 8, 2025, https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/overview
  25. accessed on January 1, 1970, https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/concepts-real-time-speech-to-text
  26. accessed on January 1, 1970, https://techcommunity.microsoft.com/t5/azure-ai/azure-speech-to-text-updates-faster-and-more-accurate-models/ba-p/3788857
  27. Whisper-large-v3-turbo with Cloudflare Workers AI, accessed on May 8, 2025, https://developers.cloudflare.com/workers-ai/guides/tutorials/build-a-workers-ai-whisper-with-chunking/
  28. Limits · Cloudflare Workers AI docs, accessed on May 8, 2025, https://developers.cloudflare.com/workers-ai/platform/limits/
  29. Models · Cloudflare Workers AI docs, accessed on May 8, 2025, https://developers.cloudflare.com/workers-ai/models/speech-to-text/
  30. accessed on January 1, 1970, https://blog.cloudflare.com/workers-ai-speech-to-text-whisper/
  31. Cloudflare Community, accessed on May 8, 2025, https://community.cloudflare.com/
  32. All About Transcription for Real-Time (Live) Audio Streaming - Deepgram Blog ⚡️, accessed on May 8, 2025, https://deepgram.com/learn/all-about-transcription-for-real-time-audio-streaming
  33. Speech to Text API: Next-Gen AI Speech Recognition - Deepgram, accessed on May 8, 2025, https://deepgram.com/product/speech-to-text
  34. Voice AI Evolution With Twilio and Deepgram - Efficiently Connected, accessed on May 8, 2025, https://www.efficientlyconnected.com/voice-ai-evolution-with-twilio-and-deepgram/
  35. How Deepgram Works, accessed on May 8, 2025, https://6890003.fs1.hubspotusercontent-na1.net/hubfs/6890003/Collateral/Whitepaper%20How%20Deepgram%20Works-Updated.pdf?hsCtaTracking=e3c2fd70-fe62-48d5-815b-4546c5305a61%7C42cdfce0-5ebf-4b4c-92b7-9172beb24dbb
  36. Deepgram Nova-2 - One API 200+ AI Models, accessed on May 8, 2025, https://aimlapi.com/models/deepgram-nova-2
  37. Whisper (L v2), Deepgram - Artificial Analysis, accessed on May 8, 2025, https://artificialanalysis.ai/speech-to-text/models/deepgram
  38. Text to Speech Latency — Deepgram | Documentation, accessed on May 8, 2025, https://developers.deepgram.com/docs/text-to-speech-latency
  39. Model Options — Deepgram | Documentation, accessed on May 8, 2025, https://developers.deepgram.com/docs/model
  40. Seeking Speech-to-Text API Recommendations: Word-Level Probabilities, High Accuracy, Low Latency : r/AskProgramming - Reddit, accessed on May 8, 2025, https://www.reddit.com/r/AskProgramming/comments/1cxo5ch/seeking_speechtotext_api_recommendations/
  41. What is Deepgram and How Does this Voice AI Work in 2025? - Bigly Sales, accessed on May 8, 2025, https://biglysales.com/what-is-deepgram-and-how-does-this-voice-ai-work/
  42. Using Interim Results — Deepgram | Documentation, accessed on May 8, 2025, https://developers.deepgram.com/docs/using-interim-results
  43. Deepgram’s product-market fit journey - Unusual Ventures, accessed on May 8, 2025, https://www.unusual.vc/post/deepgrams-product-market-fit-journey
  44. Why Deepgram’s Speech-to-Text API is #1 for Developers on G2, accessed on May 8, 2025, https://deepgram.com/learn/deepgrams-speech-to-text-api-number-1-for-developers-g2
  45. Everything you need to know about Voice AI Agents - Deepgram, accessed on May 8, 2025, https://deepgram.com/learn/everything-about-voice-ai-agents
  46. Introducing Deepgram Nova & Deepgram Whisper Cloud and On-Prem, accessed on May 8, 2025, https://deepgram.com/changelog/introducing-deepgram-nova-deepgram-whisper-cloud-and-on-prem
  47. Deepgram vs Whisper Large: A Comprehensive Comparison - BytePlus, accessed on May 8, 2025, https://www.byteplus.com/en/topic/409744
  48. Narakeet Vs Deepgram: Compare Samples, Price & Features - PlayHT, accessed on May 8, 2025, https://play.ht/blog/ai-apps/vs/narakeet-vs-deepgram/
  49. Deepgram’s Advanced Speech Recognition Technology 2024 - CDN Blog, accessed on May 8, 2025, https://blog.blazingcdn.com/en-us/deepgrams-advanced-speech-recognition-technology
  50. A Detailed Breakdown Of Deepgram Pricing & Features (2024) - Unreal Speech, accessed on May 8, 2025, https://blog.unrealspeech.com/deepgram-pricing/
  51. Real-time Speech To Text - Always Send Finalize Response · deepgram · Discussion #1035, accessed on May 8, 2025, https://github.com/orgs/deepgram/discussions/1035
  52. deepgram-docs/fern/docs/text-to-speech.mdx at main - GitHub, accessed on May 8, 2025, https://github.com/deepgram/deepgram-docs/blob/main/fern/docs/text-to-speech.mdx?plain=1
  53. migrating-from-google-speech-to-text-stt-to-deepgram.mdx - GitHub, accessed on May 8, 2025, https://github.com/deepgram/deepgram-docs/blob/main/fern/docs/migrating-from-google-speech-to-text-stt-to-deepgram.mdx?plain=1
  54. Trained on 100000+ Voices: Deepgram Unveils Next-Gen Speaker Diarization and Language Detection Models, accessed on May 8, 2025, https://deepgram.com/learn/nextgen-speaker-diarization-and-language-detection-models
  55. Streaming API is very slow, is it a bug or a user error? · deepgram · Discussion #1066, accessed on May 8, 2025, https://github.com/orgs/deepgram/discussions/1066
  56. Nova-2 Air Traffic Control Speech to Text - Deepgram, accessed on May 8, 2025, https://deepgram.com/changelog/nova-2-air-traffic-control
  57. Introducing Nova-3: Extending Deepgram’s Leadership in Voice AI for Enterprise Use Cases, accessed on May 8, 2025, https://www.reworked.co/the-wire/introducing-nova-3-extending-deepgrams-leadership-in-voice-ai-for-enterprise-use-cases/
  58. Using Google Colab with Deepgram to Transcribe Audio - YouTube, accessed on May 8, 2025, https://www.youtube.com/watch?v=ocvilXRtOGM
  59. accessed on January 1, 1970, https://deepgram.com/docs/model-comparison
  60. accessed on January 1, 1970, https://deepgram.com/blog/
  61. accessed on January 1, 1970, https://deepgram.com/blog/introducing-nova-3-the-most-accurate-fastest-and-most-cost-effective-speech-to-text-model-ever
  62. ElevenLabs vs OpenAI Speech to Text - Cartesia, accessed on May 8, 2025, https://cartesia.ai/vs/elevenlabs-vs-openai-tts
  63. ElevenLabs vs Google Speech to Text - Cartesia, accessed on May 8, 2025, https://cartesia.ai/vs/elevenlabs-vs-google-tts
  64. Speech-to-Text with Faster-Whisper - Mysoly, accessed on May 8, 2025, https://mysoly.nl/speech-to-text-with-faster-whisper-the-high-speed-alternative-to-openai-whisper/
  65. Whisper: API Provider Benchmarking & Analysis, accessed on May 8, 2025, https://artificialanalysis.ai/speech-to-text/models/whisper?trk=public_post_comment-text
  66. Whisper Meets FalAI: From Speech Recognition to End-to-End Spoken Language Understanding - ISCA Archive, accessed on May 8, 2025, https://www.isca-archive.org/iberspeech_2024/pineiromartin24_iberspeech.pdf
  67. Realtime Transcription using New OpenAI Whisper Turbo : r/LocalLLaMA - Reddit, accessed on May 8, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1fubr8d/realtime_transcription_using_new_openai_whisper/
  68. Model Gallery | fal.ai, accessed on May 8, 2025, https://fal.ai/models
  69. Pricing | fal.ai, accessed on May 8, 2025, https://fal.ai/pricing
  70. Fireworks Streaming Transcription: 300ms with Whisper-v3-large-quality, accessed on May 8, 2025, https://fireworks.ai/blog/streaming-audio-launch
  71. openai/whisper-large-v3-turbo - Hugging Face, accessed on May 8, 2025, https://huggingface.co/openai/whisper-large-v3-turbo
  72. New Whisper model: “turbo” : r/LocalLLaMA - Reddit, accessed on May 8, 2025, https://www.reddit.com/r/LocalLLaMA/comments/1ft3wyb/new_whisper_model_turbo/
  73. Quantization for OpenAI’s Whisper Models: A Comparative Analysis - arXiv, accessed on May 8, 2025, https://arxiv.org/html/2503.09905v1
  74. Whisper V3 Large - Fireworks, accessed on May 8, 2025, https://fireworks.ai/models/fireworks/whisper-v3
  75. Pricing - Fireworks, accessed on May 8, 2025, https://fireworks.ai/pricing
  76. Solaria I The first truly universal speech-to-text model - Gladia, accessed on May 8, 2025, https://www.gladia.io/solaria
  77. Gladia Launches Solaria, the First Fully Multilingual, Next-Generation Speech-to-Text Model for Global Scalability - CMS Wire, accessed on May 8, 2025, https://www.cmswire.com/the-wire/gladia-launches-solaria-the-first-fully-multilingual-next-generation-speech-to-text-model-for-global-scalability/
  78. Pricing - Gladia, accessed on May 8, 2025, https://gladia.io/pricing
  79. Blog - Gladia, accessed on May 8, 2025, https://gladia.io/blog/
  80. accessed on January 1, 1970, https://gladia.io/docs/api-reference/speech-to-text
  81. USA Today Baseball Coaches Poll - AP News, accessed on May 8, 2025, https://apnews.com/80fe2cde1d8d9c092b1353459d977d0f
  82. Typing Speed: How to Set Your Words-Per-Minute (WPM) Goal, accessed on May 8, 2025, https://www.typing.com/blog/typing-speed/
  83. accessed on January 1, 1970, https://cloud.google.com/speech-to-text/docs/enhanced-models
  84. accessed on January 1, 1970, https://cloud.google.com/speech-to-text/docs/reference/rest
  85. Groq Automatic Speech Recognition (ASR) API, accessed on May 8, 2025, https://groq.com/GroqDocs/Groq%20ASR%20Model%20Guide.pdf
  86. Switching to Groq Whisper Large V3 for Cost-Effective and Faster Transcriptions #57, accessed on May 8, 2025, https://github.com/savbell/whisper-writer/discussions/57
  87. On-demand Pricing for Tokens-as-a-Service - Groq is Fast AI Inference, accessed on May 8, 2025, https://groq.com/pricing/
  88. groq tokens per minute (TPM): Limit 6000 · Issue #4040 · elizaOS/eliza - GitHub, accessed on May 8, 2025, https://github.com/elizaOS/eliza/issues/4040
  89. Groq (Whisper) - Pipecat Docs, accessed on May 8, 2025, https://docs.pipecat.ai/server/services/stt/groq
  90. Supported Models - GroqDocs - GroqCloud, accessed on May 8, 2025, https://console.groq.com/docs/models
  91. GPT-4o mini Transcribe - OpenAI Platform, accessed on May 8, 2025, https://platform.openai.com/docs/models/gpt-4o-mini-transcribe
  92. GPT-4o Transcribe - OpenAI Platform, accessed on May 8, 2025, https://platform.openai.com/docs/models/gpt-4o-transcribe
  93. How Salesforce’s New Speech-to-Text Service Uses OpenAI Whisper Models for Real-Time Transcriptions, accessed on May 8, 2025, https://engineering.salesforce.com/how-salesforces-new-speech-to-text-service-uses-openai-whisper-models-for-real-time-transcriptions/
  94. Benchmark Report: OpenAI Whisper vs. Deepgram - BytePlus, accessed on May 8, 2025, https://www.byteplus.com/en/topic/409746
  95. Pricing | OpenAI, accessed on May 8, 2025, https://openai.com/pricing
  96. Speech to text - OpenAI API - OpenAI Platform, accessed on May 8, 2025, https://platform.openai.com/docs/guides/speech-to-text
  97. Pricing | Rev AI, accessed on May 8, 2025, https://www.rev.ai/pricing
  98. accessed on January 1, 1970, https://www.rev.ai/features/speech-to-text-api
  99. Asynchronous Speech-to-Text API Get Started - Rev AI, accessed on May 8, 2025, https://www.rev.ai/docs
  100. accessed on January 1, 1970, https://www.rev.ai/blog/how-fast-is-rev-ai-speech-to-text/
  101. Pricing for our Speech Recognition API Services | Speechmatics, accessed on May 8, 2025, https://www.speechmatics.com/pricing/
  102. Blog & Latest Speech Recognition News| Speechmatics, accessed on May 8, 2025, https://www.speechmatics.com/blog/
  103. accessed on January 1, 1970, https://docs.speechmatics.com/v2.0/rt-considerations