Introducing the Speechall CLI for terminal transcription

June 20, 2026 by Lena Lucille Meyer, Developer Advocate (AI-generated fictional author)

In this article

Loading table of contents...

Speech transcription from the terminal

We recently introduced the Speechall CLI, a command-line tool for transcribing audio and video files through the unified Speechall API. It is meant for the cases where opening a browser, writing a one-off script, or wiring a vendor SDK is more work than the transcription task itself.

The CLI uses the same provider abstraction as the Speechall API. From one terminal interface you can use models from OpenAI, Deepgram, AssemblyAI, Cloudflare, Groq, ElevenLabs, Google, Gemini, and more. That makes it useful for quick local transcription, repeatable shell workflows, model comparison, subtitle generation, and AI agent tasks that need speech-to-text without custom integration code.

The tool is open source and available at github.com/Speechall/speechall-cli.

Installation

On macOS and Linux, the simplest installation path is Homebrew:

brew install Speechall/tap/speechall

After installation, verify that the binary is available on your PATH:

speechall --version

If you prefer not to use Homebrew, you can download a release binary from the GitHub repository or build from source with Swift. For day-to-day usage, Homebrew is the path we recommend because updates stay straightforward.

API key setup

The CLI needs a Speechall API key. Create one in the Speechall API keys console, then export it in your shell:

export SPEECHALL_API_KEY="your-key-here"

With the environment variable set, every speechall command can authenticate without repeating the key. For temporary use in a script or a CI job, you can also pass the key directly:

speechall audio.wav --api-key "your-key-here"

For local work, the environment variable is usually cleaner because it keeps commands focused on the transcription parameters instead of credentials.

Basic transcription

The default command transcribes a file:

speechall interview.mp3

The transcribe subcommand is optional, so the following command is equivalent:

speechall transcribe interview.mp3

Output is written to stdout. That means it composes naturally with standard shell tools:

speechall interview.mp3 > transcript.txt

Errors go to stderr, so redirecting stdout gives you the transcript without mixing it with diagnostics. On macOS, video files such as .mp4 and .mov can be converted to audio automatically before upload. On Linux, pass audio formats such as .wav, .mp3, .m4a, or .flac directly.

Choosing a model

By default, the CLI uses openai.gpt-4o-mini-transcribe. You can select a specific provider and model with --model:

speechall call.wav --model deepgram.nova-2

You can also provide a language hint:

speechall call.wav --model openai.gpt-4o-mini-transcribe --language en

Model identifiers use a provider-prefixed format, such as openai.gpt-4o-mini-transcribe or deepgram.nova-2. The important part is that switching providers does not require switching tools. You can keep the same CLI command shape while testing different providers for accuracy, language support, formatting, latency, or cost.

Output formats

The CLI supports the common transcription output formats exposed by the Speechall API: text, json, verbose_json, srt, and vtt.

Use plain text when you want the transcript itself:

speechall meeting.wav --output-format text > meeting.txt

Use JSON when another program will consume the result:

speechall meeting.wav --output-format json > meeting.json

Use verbose_json when you need richer provider output, such as timestamps or segment-level details where supported:

speechall meeting.wav --output-format verbose_json > meeting.verbose.json

For subtitles, generate SRT or VTT directly:

speechall webinar.mp4 --output-format srt > webinar.srt
speechall webinar.mp4 --output-format vtt > webinar.vtt

This is one of the most practical uses of the CLI: take a recording, generate a subtitle file, and hand it to a video platform or post-processing pipeline without writing glue code.

Speaker diarization

For meetings, interviews, podcasts, and customer calls, a single block of text is often not enough. Speaker diarization labels who spoke when, if the selected model supports it.

speechall meeting.wav --diarization --speakers-expected 3

Diarization can be combined with subtitle output:

speechall meeting.wav 
  --model deepgram.nova-2 
  --diarization 
  --speakers-expected 3 
  --output-format srt > meeting.srt

The --speakers-expected value is a hint, not a replacement for model capability. Use model discovery to confirm which models support diarization before building a workflow around it.

Custom vocabulary and prompts

Transcription quality often depends on domain words: product names, customer names, medical terms, acronyms, internal project names, or words from another language that appear inside mostly English audio.

For models that support vocabulary boosting, pass repeated --custom-vocabulary flags:

speechall medical-note.wav 
  --custom-vocabulary "myocardial" 
  --custom-vocabulary "infarction" 
  --custom-vocabulary "troponin"

You can also provide an initial prompt where the model supports it:

speechall product-demo.wav 
  --initial-prompt "This recording discusses Speechall, unified speech APIs, and provider routing."

These options are intentionally exposed at the CLI layer because the best vocabulary list is often discovered while working with real audio. You can iterate from the terminal before turning the same parameters into an application-level integration.

Discovering and filtering models

The models command lists available speech-to-text models as JSON:

speechall models

You can filter by provider:

speechall models --provider deepgram

You can filter by language:

speechall models --language tr

Capability filters combine with AND logic, which makes them useful for narrowing a list to models that fit a real requirement:

speechall models --language tr --diarization
speechall models --provider openai --srt
speechall models --streamable --vocabulary

This is deliberately machine-readable. Instead of checking a static table, you can query the current model catalog and use the same command in scripts.

jq workflows

Because speechall models returns JSON, it works well with jq for inspection and automation.

Find the cheapest Deepgram model with SRT support:

speechall models --provider deepgram --srt 
  | jq 'sort_by(.cost_per_second_usd) | .[0].id'

List model IDs that support Turkish:

speechall models --language tr | jq '[.[].id]'

Compare diarization-capable models by price:

speechall models --diarization 
  | jq '[.[] | {id, cost: .cost_per_second_usd}] | sort_by(.cost)'

List the providers currently represented in the catalog:

speechall models | jq '[.[].provider] | unique'

Find models that support both streaming and custom vocabulary:

speechall models --streamable --vocabulary | jq '[.[].id]'

These small commands are useful during provider evaluation. You can answer questions like “which models support this language and subtitles?” without leaving the terminal.

Installing the AI agent skill

The CLI also ships with an AI agent skill. This lets compatible agents transcribe local recordings, create subtitle files, and inspect available speech models through the same command-line tool.

Install it with:

npx skills add https://github.com/Speechall/speechall-cli --skill speechall-cli

After installation, an agent can use the CLI for tasks such as transcribing a meeting recording into plain text, generating an SRT file from a video, or selecting a model that supports diarization and a specific language.

When to use it

Use the Speechall CLI when you need a fast, scriptable path from media file to transcript. It is especially useful for developers comparing providers, teams generating subtitles from recordings, and automation workflows where stdout, JSON, and shell composition matter.

It does not replace the API for production application code. Instead, it gives you a practical terminal interface to the same unified Speechall API, so you can test models, validate options, and build repeatable transcription workflows before deciding what belongs in your application.