Your Mac Can Summarize Any YouTube Video in 30 Seconds

Photo: Unsplash

AI Power User

Your Mac Can Summarize Any YouTube Video in 30 Seconds

A yt-dlp and local LLM pipeline that turns hour-long videos into key points before your coffee cools
yt-dlpwhisperollamayoutube-summarizer

My YouTube “Watch Later” list used to be where curiosity went to die — 140 videos, mostly hour-long podcasts and conference talks I would statistically never watch. Now I run a 20-line shell script instead: it pulls the transcript, feeds it to a local model on my MacBook, and hands me a summary with key points and timestamps. For a typical 20-minute video, the whole thing takes about 30 seconds. I’ll show you the exact pipeline, give you the script to copy, and then break down that 30-second claim honestly, because it has conditions.

The insight that makes it fast: don’t transcribe, download

Everyone’s first instinct is: download audio, run Whisper, summarize. That works, but it’s the slow path. The fast path exploits a fact most people miss — almost every YouTube video already has a transcript, either uploader-provided or auto-generated by Google’s own speech recognition. yt-dlp can grab it directly in seconds, no audio download, no transcription compute:

brew install yt-dlp ollama jq
ollama pull llama3.1:8b
yt-dlp --skip-download --write-auto-subs --sub-langs en --sub-format json3 "VIDEO_URL"

That command fetches the subtitle track as JSON with word-level timing in 2–4 seconds on a normal connection. The transcription step — the expensive part — was already done by Google, for free, before you ever showed up.

The script: copy, paste, summarize

Save this as ytsum, run chmod +x ytsum, drop it in ~/bin or anywhere on your PATH:

#!/bin/bash
# ytsum — summarize a YouTube video with a local LLM
# usage: ytsum "https://youtube.com/watch?v=..."
set -e
URL="$1"
MODEL="${2:-llama3.1:8b}"
TMP=$(mktemp -d)

yt-dlp --skip-download --write-auto-subs --write-subs \
  --sub-langs "en.*" --sub-format json3 \
  -o "$TMP/video" "$URL" >/dev/null 2>&1

SUBFILE=$(ls "$TMP"/*.json3 2>/dev/null | head -1)
if [ -z "$SUBFILE" ]; then
  echo "No subtitles found — falling back to Whisper..." >&2
  yt-dlp -x --audio-format m4a -o "$TMP/audio.m4a" "$URL" >/dev/null 2>&1
  TRANSCRIPT=$(uvx mlx-whisper "$TMP/audio.m4a" --model mlx-community/whisper-large-v3-turbo --output-format txt -o "$TMP" >/dev/null 2>&1 && cat "$TMP/audio.txt")
else
  TRANSCRIPT=$(jq -r '[.events[] | select(.segs) | .segs[].utf8] | join("")' "$SUBFILE" | tr -s ' \n' ' ')
fi

echo "$TRANSCRIPT" | ollama run "$MODEL" \
"Summarize this video transcript. Give me:
1. A 3-sentence summary
2. 5-8 key points as bullets
3. Any specific numbers, tools, or recommendations mentioned
4. Verdict: who should actually watch the full video?

Transcript: $(cat -)"

rm -rf "$TMP"

Run ytsum "https://youtube.com/watch?v=dQw4w9WgXcQ" and read the summary in your terminal. The script tries subtitles first and only falls back to downloading audio and transcribing with mlx-whisper (Apple’s MLX-optimized Whisper, dramatically faster than the original on Apple Silicon) when a video genuinely has no captions — rare on YouTube, common on raw screen recordings.

The 30-second claim, audited

Here’s the honest timing breakdown for a typical 20-minute video (a ~3,200-word transcript) on my M3 Pro MacBook with llama3.1:8b:

  • Subtitle fetch via yt-dlp: 2–4 seconds
  • jq parsing: under 1 second
  • LLM prompt processing (~4,000 tokens in): ~8 seconds
  • Summary generation (~400 tokens out at ~28 tok/sec): ~14 seconds
  • Total: 25–28 seconds. Claim delivered, on the subtitle path, on an M-series Pro chip.

Conditions where it stretches: an hour-long podcast (~9,000-word transcript) takes 50–70 seconds on the same hardware, mostly in prompt processing — still a spectacular trade for an hour of your life. On a base M1 with the same 8B model, double the numbers. And if there are no subtitles at all, the Whisper fallback adds real time: whisper-large-v3-turbo under MLX transcribes about 10–15x realtime on an M3 Pro, so a 20-minute video costs ~90 extra seconds. Slowest case, maybe three minutes total — versus twenty of watching.

One quality note: auto-generated captions lack punctuation and mangle names (“Llama 3” becomes “lamb of three”). The 8B model shrugs this off for summaries. If you need quotable accuracy, take the Whisper path deliberately — uploader-grade transcription is exactly what it’s for.

The no-code paths: Raycast and Shortcuts

Not everyone wants a shell script, and two native routes get you most of the way.

Raycast: the “YouTube Summarizer” extension in the Raycast Store does the subtitle-grab-and-summarize loop in a GUI — paste a URL (or just have it on your clipboard), hit Enter, read the summary in a Raycast window. With Raycast’s AI settings pointed at local Ollama models (Settings → AI → Ollama), the entire flow stays on-device. This is what I set up on my wife’s Mac; she uses it for deciding whether hour-long gardening videos contain the five minutes she needs.

Shortcuts: build a three-action shortcut — Receive URL from Share Sheet → Run Shell Script (/opt/homebrew/bin/ytsum "$1", with “Pass Input: as arguments”) → Show Result. Now “Summarize Video” appears in Safari’s share menu, and since Shortcuts sync, you can trigger the same summarization on your iPhone with the Mac doing the work via SSH (the “Run Script Over SSH” action pointed at your Mac).

What I actually use it for

After six months, the pattern is clear. Triage is the big one: every long video gets summarized before it earns watch time, and roughly 70% of my Watch Later list turned out to be skippable — the summary was the value. Research: when writing a post, I pull key points from five conference talks in three minutes instead of an afternoon. Extraction: cooking videos become ingredient lists and step-by-step instructions (“any specific numbers, tools, or recommendations” in the prompt catches oven temperatures and quantities reliably); DIY and repair videos become tool lists I can check before starting. Podcast filtering: a 2.5-hour interview gets compressed into a verdict line — “watch 1:14–1:32 for the part about unified memory, skip the rest” once I added “include approximate timestamps for each key point” to the prompt, which works because the json3 subtitle format carries timing data.

What it’s not for: anything where delivery is the content. Comedy, cinematography, music — a summary of a great video essay is a menu, not a meal. The script tells you whether to watch, and that’s precisely the point.

Total setup cost: three brew installs, one 4.9 GB model download, one script. Everything runs locally — no API keys, no per-video fees, no transcript of your viewing interests sent to anyone. Run it on the next video YouTube’s algorithm insists you need, and enjoy the small, slightly illicit thrill of getting the point in 30 seconds flat.