Introducing the Speechall CLI for terminal transcription
June 20, 2026 by Lena Lucille Meyer, Developer Advocate (AI-generated fictional author)
Loading table of contents...
Speech transcription from the terminal
We recently introduced the Speechall CLI, a command-line tool for transcribing audio and video files through the unified Speechall API. It is meant for the cases where opening a browser, writing a one-off script, or wiring a vendor SDK is more work than the transcription task itself.
The CLI uses the same provider abstraction as the Speechall API. From one terminal interface you can use models from OpenAI, Deepgram, AssemblyAI, Cloudflare, Groq, ElevenLabs, Google, Gemini, and more. That makes it useful for quick local transcription, repeatable shell workflows, model comparison, subtitle generation, and AI agent tasks that need speech-to-text without custom integration code.
The tool is open source and available at github.com/Speechall/speechall-cli.
Installation
On macOS and Linux, the simplest installation path is Homebrew:
brew install Speechall/tap/speechall After installation, verify that the binary is available on your PATH:
speechall --version If you prefer not to use Homebrew, you can download a release binary from the GitHub repository or build from source with Swift. For day-to-day usage, Homebrew is the path we recommend because updates stay straightforward.
API key setup
The CLI needs a Speechall API key. Create one in the Speechall API keys console, then export it in your shell:
export SPEECHALL_API_KEY="your-key-here" With the environment variable set, every speechall command can authenticate without repeating the key. For temporary use in a script or a CI job, you can also pass the key directly:
speechall audio.wav --api-key "your-key-here" For local work, the environment variable is usually cleaner because it keeps commands focused on the transcription parameters instead of credentials.
Basic transcription
The default command transcribes a file:
speechall interview.mp3 The transcribe subcommand is optional, so the following command is equivalent:
speechall transcribe interview.mp3 Output is written to stdout. That means it composes naturally with standard shell tools:
speechall interview.mp3 > transcript.txt Errors go to stderr, so redirecting stdout gives you the transcript without mixing it with diagnostics. On macOS, video files such as .mp4 and .mov can be converted to audio automatically before upload. On Linux, pass audio formats such as .wav, .mp3, .m4a, or .flac directly.
Choosing a model
By default, the CLI uses openai.gpt-4o-mini-transcribe. You can select a specific provider and model with --model:
speechall call.wav --model deepgram.nova-2 You can also provide a language hint:
speechall call.wav --model openai.gpt-4o-mini-transcribe --language en Model identifiers use a provider-prefixed format, such as openai.gpt-4o-mini-transcribe or deepgram.nova-2. The important part is that switching providers does not require switching tools. You can keep the same CLI command shape while testing different providers for accuracy, language support, formatting, latency, or cost.
Output formats
The CLI supports the common transcription output formats exposed by the Speechall API: text, json, verbose_json, srt, and vtt.
Use plain text when you want the transcript itself:
speechall meeting.wav --output-format text > meeting.txt Use JSON when another program will consume the result:
speechall meeting.wav --output-format json > meeting.json Use verbose_json when you need richer provider output, such as timestamps or segment-level details where supported:
speechall meeting.wav --output-format verbose_json > meeting.verbose.json For subtitles, generate SRT or VTT directly:
speechall webinar.mp4 --output-format srt > webinar.srt
speechall webinar.mp4 --output-format vtt > webinar.vtt This is one of the most practical uses of the CLI: take a recording, generate a subtitle file, and hand it to a video platform or post-processing pipeline without writing glue code.
Speaker diarization
For meetings, interviews, podcasts, and customer calls, a single block of text is often not enough. Speaker diarization labels who spoke when, if the selected model supports it.
speechall meeting.wav --diarization --speakers-expected 3 Diarization can be combined with subtitle output:
speechall meeting.wav
--model deepgram.nova-2
--diarization
--speakers-expected 3
--output-format srt > meeting.srt The --speakers-expected value is a hint, not a replacement for model capability. Use model discovery to confirm which models support diarization before building a workflow around it.
Custom vocabulary and prompts
Transcription quality often depends on domain words: product names, customer names, medical terms, acronyms, internal project names, or words from another language that appear inside mostly English audio.
For models that support vocabulary boosting, pass repeated --custom-vocabulary flags:
speechall medical-note.wav
--custom-vocabulary "myocardial"
--custom-vocabulary "infarction"
--custom-vocabulary "troponin" You can also provide an initial prompt where the model supports it:
speechall product-demo.wav
--initial-prompt "This recording discusses Speechall, unified speech APIs, and provider routing." These options are intentionally exposed at the CLI layer because the best vocabulary list is often discovered while working with real audio. You can iterate from the terminal before turning the same parameters into an application-level integration.
Discovering and filtering models
The models command lists available speech-to-text models as JSON:
speechall models You can filter by provider:
speechall models --provider deepgram You can filter by language:
speechall models --language tr Capability filters combine with AND logic, which makes them useful for narrowing a list to models that fit a real requirement:
speechall models --language tr --diarization
speechall models --provider openai --srt
speechall models --streamable --vocabulary This is deliberately machine-readable. Instead of checking a static table, you can query the current model catalog and use the same command in scripts.
jq workflows
Because speechall models returns JSON, it works well with jq for inspection and automation.
Find the cheapest Deepgram model with SRT support:
speechall models --provider deepgram --srt
| jq 'sort_by(.cost_per_second_usd) | .[0].id' List model IDs that support Turkish:
speechall models --language tr | jq '[.[].id]' Compare diarization-capable models by price:
speechall models --diarization
| jq '[.[] | {id, cost: .cost_per_second_usd}] | sort_by(.cost)' List the providers currently represented in the catalog:
speechall models | jq '[.[].provider] | unique' Find models that support both streaming and custom vocabulary:
speechall models --streamable --vocabulary | jq '[.[].id]' These small commands are useful during provider evaluation. You can answer questions like “which models support this language and subtitles?” without leaving the terminal.
Installing the AI agent skill
The CLI also ships with an AI agent skill. This lets compatible agents transcribe local recordings, create subtitle files, and inspect available speech models through the same command-line tool.
Install it with:
npx skills add https://github.com/Speechall/speechall-cli --skill speechall-cli After installation, an agent can use the CLI for tasks such as transcribing a meeting recording into plain text, generating an SRT file from a video, or selecting a model that supports diarization and a specific language.
When to use it
Use the Speechall CLI when you need a fast, scriptable path from media file to transcript. It is especially useful for developers comparing providers, teams generating subtitles from recordings, and automation workflows where stdout, JSON, and shell composition matter.
It does not replace the API for production application code. Instead, it gives you a practical terminal interface to the same unified Speechall API, so you can test models, validate options, and build repeatable transcription workflows before deciding what belongs in your application.