Stop Paying for Transcription — Your Mac Does It Better, Free

Photo: Unsplash

AI Power User

Stop Paying for Transcription — Your Mac Does It Better, Free

Local Whisper on Apple Silicon transcribes an hour of audio in minutes without the file ever leaving your machine

I used to pay $17 a month for cloud transcription. Then I timed my Mac doing the same job: a 62-minute client call, transcribed locally in 4 minutes 50 seconds, with fewer errors on technical vocabulary than the paid service produced — and the recording never left my SSD. I cancelled the subscription that afternoon and haven’t looked back in over a year.

The engine behind this is OpenAI’s Whisper model, which is open source and runs beautifully on Apple Silicon. You can have it working in fifteen minutes via two routes: a free command-line path (whisper.cpp) and a polished GUI path (MacWhisper). I’ll walk through both, give you honest accuracy comparisons against Otter and Rev, and finish with the workflow that changed how I handle meetings: record → transcribe → summarize, 100% on-device.

Why local transcription beats the paid services

Speed. On an M-series Pro or better, whisper.cpp with Core ML acceleration transcribes at roughly 10-15x real time using the large-v3-turbo model — an hour of audio in 4-6 minutes. The medium model runs 20-30x real time if you want an hour done in two or three minutes. Cloud services are often slower than this once you count upload time for a 200MB recording.

Accuracy. Whisper large-v3 benchmarks at roughly 4-8% word error rate on clean English speech — equal to or better than what I measured from Otter and in the same league as Rev’s automated tier (Rev’s human transcription is still more accurate, at $1.99/minute, which is $120 for that hour-long call). In my side-by-side tests on three real meeting recordings, local Whisper beat Otter specifically on technical terms — it nailed “Kubernetes ingress” and “idempotency” where the cloud service produced inventive nonsense. Where paid services claw back value is diarization and meeting-bot integrations, which I’ll be honest about below.

Multilingual. Whisper handles ~100 languages and my bilingual reality — meetings that drift between Czech and English mid-sentence — better than any commercial service I tried. It can also translate any of them to English during transcription with a single flag.

Privacy. This is the unbeatable one. Lawyers recording client consultations, doctors dictating patient notes, journalists protecting sources, anyone under NDA: uploading those recordings to a cloud transcription service ranges from professionally questionable to outright malpractice. Local Whisper removes the question entirely. There is no server. There is no data processing agreement, because nothing is processed anywhere but the machine in front of you.

Path 1: whisper.cpp with Core ML (free, CLI)

whisper.cpp is the high-performance C++ port of Whisper, with a Core ML backend that pushes the encoder onto the Apple Neural Engine.

brew install whisper-cpp

# Download the best speed/quality model (~1.6GB)
curl -L -o ~/whisper-models/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

# Transcribe (Whisper wants 16kHz WAV; ffmpeg converts anything)
ffmpeg -i meeting.m4a -ar 16000 -ac 1 -c:a pcm_s16le meeting.wav
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin -f meeting.wav \
  -otxt -osrt -l auto

That produces meeting.wav.txt (plain text) and .srt (subtitles with timestamps). The flag -l auto detects language; add --translate to get English output from foreign-language audio. For maximum Neural Engine acceleration, grab the Core ML model variant from the same Hugging Face repo (ggml-large-v3-turbo-encoder.mlmodelc) and drop it next to the .bin — whisper.cpp picks it up automatically, and the first run compiles it (takes a few minutes once, then it’s cached).

Model picking guide: large-v3-turbo is the right default — near-large-v3 accuracy at several times the speed. Drop to medium (~1.5GB, faster) for clean podcast-quality audio, or small (~466MB) on an 8-16GB machine. Skip tiny and base for anything that matters.

Path 2: MacWhisper (GUI, drag and drop)

If the terminal isn’t your idea of a good time, MacWhisper wraps the same engine in a native Mac app. The free version includes the smaller models; the Pro version (a ~€59 one-time purchase, not a subscription) unlocks the large models, batch transcription, system-audio capture for transcribing calls directly, and — usefully — experimental speaker diarization.

The workflow is exactly what you’d hope: drag an audio or video file onto the window, pick a model, get a transcript with timestamps you can search, edit, and export as TXT, SRT, VTT, DOCX, or Markdown. It even transcribes a file’s audio track straight from a screen recording. For most people this is the right answer — pay once, never think about the plumbing.

The honest limitations

Speaker diarization is Whisper’s weak spot. Vanilla Whisper produces one undifferentiated stream of text — no “Speaker 1 / Speaker 2” labels. Otter does this natively and it’s genuinely their best feature. Workarounds exist: MacWhisper Pro’s diarization is decent for two or three clearly distinct voices, and the open-source WhisperX project adds proper diarization via pyannote, but it’s a Python-environment project rather than a one-liner. For multi-speaker interviews where attribution is critical, this is the one scenario where I’d still consider a paid service.

No live meeting bot. Otter joins your Zoom calls and transcribes in real time. Local Whisper transcribes recordings after the fact. My fix: record the meeting (with consent — say it out loud, every time), then transcribe the file two minutes after the call ends. For me the delay costs nothing; for someone who needs live captions, it’s a real difference.

Hallucinations on silence. Whisper occasionally invents phantom sentences during long silent stretches. The --no-speech-thold flag mitigates it, and trimming dead air with ffmpeg first helps too. Worth knowing so a hallucinated sentence never makes it into meeting minutes unreviewed.

The full private workflow: meeting → transcript → summary

Here’s the pipeline that replaced my subscription, end to end on-device:

#!/bin/zsh
# transcribe-and-summarize.sh — usage: ./tas.sh recording.m4a
ffmpeg -i "$1" -ar 16000 -ac 1 -c:a pcm_s16le /tmp/audio.wav -y
whisper-cli -m ~/whisper-models/ggml-large-v3-turbo.bin \
  -f /tmp/audio.wav -otxt -of /tmp/transcript
ollama run qwen2.5:32b "Summarize this meeting transcript into: \
1) key decisions, 2) action items with owners, 3) open questions. \
$(cat /tmp/transcript.txt)" > "${1%.*}-summary.md"

One command: audio in, Markdown minutes out. The transcript feeds a local LLM (see day one of this series for the Ollama setup), so even the summarization never touches a server. Total marginal cost: zero. Total time for a one-hour meeting: about six minutes, unattended.

I record voice memos on walks, too, and transcribe them into my notes. The system’s only failure mode so far has been entirely analog: Mochi, my British lilac, has decided the warm spot behind the Mac Studio’s exhaust vent is hers, and a transcription run makes it warmer. One memo ends with ninety seconds of purring, which Whisper — correctly, if uselessly — declined to transcribe.

Fifteen-minute setup checklist

  1. (2 min) Decide your path: CLI (free, scriptable) or MacWhisper (GUI, one-time purchase for Pro).
  2. (5 min) CLI: brew install whisper-cpp ffmpeg, download large-v3-turbo. GUI: download MacWhisper, let it fetch a model.
  3. (3 min) Test on any voice memo. Check the timestamps and technical vocabulary.
  4. (5 min) Wire up the summarize step with Ollama if you run local models.

Then do the satisfying part: open your transcription service’s billing page and hit cancel. Your Mac was the better transcription machine all along — it just never advertised.