Photo: Unsplash
Your Mac Is a Local AI Supercomputer You're Using at 10% Capacity
Here’s a number that surprises almost everyone I show it to: my Mac Studio runs a 70-billion-parameter language model — the same class of model that powered the first wave of ChatGPT-style assistants — entirely offline, at conversational speed, while drawing less power than a gaming console at idle.
Most Mac owners have no idea their machine can do this. They bought it for video editing or Xcode builds, and the AI capability sits there unused. If your Mac has Apple Silicon and 16GB of RAM or more, you’re sitting on a genuinely capable local AI workstation and using maybe 10% of what it can do. Let me fix that.
Why Apple Silicon is weirdly good at this
The secret isn’t raw GPU horsepower. An RTX 4090 demolishes any M-series chip in pure compute. The secret is unified memory.
On a PC, your GPU has its own dedicated VRAM — typically 8GB, 16GB, maybe 24GB on a flagship card. A language model has to fit inside that VRAM to run fast. The moment it doesn’t fit, performance falls off a cliff because data shuffles back and forth over the PCIe bus.
On Apple Silicon, there’s no separation. The CPU, GPU, and Neural Engine all share one pool of memory at full bandwidth. A Mac Studio with 128GB of unified memory effectively has 128GB of “VRAM” (macOS reserves a slice for the system, roughly 25-35% by default, but the rest is fair game for the GPU). To get that much usable VRAM on the PC side, you’re buying multiple NVIDIA cards, a workstation board to hold them, and a power supply that sounds like a hairdryer.
The second ingredient is memory bandwidth, which is the actual bottleneck for LLM inference — every token generated requires reading the entire model from memory. The numbers per chip family:
- M-series base chips: ~100-120 GB/s
- Pro: ~150-270 GB/s depending on generation
- Max: ~400-550 GB/s
- Ultra: ~800+ GB/s
That Ultra figure is in the same neighborhood as dedicated GPUs. This is why a Mac Studio behaves like AI workstation hardware: not because Apple planned it, but because the architecture they built for video workflows happens to be exactly what LLM inference wants.
The five-minute install
The easiest on-ramp is Ollama. It bundles the model runtime, a model library, and an API server into one app.
brew install ollama
ollama serve
Or just download the Mac app, which adds a menu bar item and starts the server automatically. Then pull a model and talk to it:
ollama pull llama3.1:8b
ollama run llama3.1:8b "Explain unified memory like I'm a gamer."
That’s it. No CUDA drivers, no Python environments, no dependency hell. Ollama uses llama.cpp’s Metal backend under the hood, so the GPU acceleration just works.
Match the model to your RAM tier
This is where most people go wrong: they either pull a model too big for their machine and conclude “local AI is slow,” or they run a tiny model and conclude “local AI is dumb.” The model size has to match your memory. My rule of thumb — the model’s quantized file size should stay under about 60-65% of your total RAM, leaving room for macOS, your browser, and the context window.
16GB (MacBook Air, base MacBook Pro): Run 7-8B parameter models at 4-bit quantization. llama3.1:8b, qwen2.5:7b, or mistral:7b all occupy roughly 4-5GB. These are genuinely useful for summarization, drafting, and basic coding questions. Don’t try anything bigger; you’ll hit swap and regret it.
36-48GB (MacBook Pro M-Pro/Max, base Mac Studio): This is the sweet spot tier. 32B-class models — qwen2.5:32b, qwen2.5-coder:32b — fit in about 19-20GB at 4-bit and are dramatically smarter than the 8B class. A 32B coder model is the first local model I’d trust to actually write code without supervision on every line.
64-128GB (Mac Studio Max/Ultra, top MacBook Pro): Welcome to 70B territory. llama3.3:70b at 4-bit needs roughly 40-43GB. On a 128GB machine you can keep a 70B model loaded and run your normal workload beside it. This is the class of model where local output starts feeling comparable to mainstream cloud assistants for everyday tasks.
Realistic speed expectations
Tokens per second is the number that matters — roughly, words generated per second (a token is ~0.75 of an English word). Reading speed is around 5-7 tokens/sec, so anything above 10 feels fluid. Realistic figures I’ve measured and cross-checked against community benchmarks for 4-bit quantized models:
| Chip | 8B model | 32B model | 70B model |
|---|---|---|---|
| M-series base | 10-20 t/s | too big | too big |
| Pro | 15-30 t/s | 5-9 t/s | too big |
| Max | 30-50 t/s | 10-18 t/s | 5-9 t/s |
| Ultra | 50-80 t/s | 20-30 t/s | 10-16 t/s |
Note the pattern: performance scales almost linearly with memory bandwidth, not GPU core count. An M-generation jump matters far less than moving from Pro to Max. If you’re speccing a machine for local AI, buy bandwidth and RAM, not CPU cores.
When local beats cloud — and when it doesn’t
I run both, and I’m not going to pretend local replaces everything. Here’s the honest split.
Local wins on:
- Privacy. Nothing leaves the machine. For contracts, medical notes, client code under NDA, unreleased work — this isn’t a nice-to-have, it’s the whole game.
- Offline. Planes, trains, conference Wi-Fi. My 32B model doesn’t care.
- No rate limits. Batch-process 500 documents overnight. Nobody throttles you, nobody charges per token.
- Cost at volume. Past the hardware, marginal cost is electricity — a Mac Studio under full inference load draws well under 200W, often much less.
- Latency consistency. No “high demand” slowdowns at 2pm on a Tuesday.
Cloud wins on:
- Frontier reasoning. The hardest problems — subtle debugging, complex multi-step analysis — still go to the top cloud models. A 70B local model is roughly a generation behind the frontier, and for some tasks that gap is decisive.
- Massive context. Cloud models handle hundreds of thousands of tokens; local context is practical up to 16-32K before memory and speed suffer.
- Knowledge freshness and search. Local models know nothing after their training cutoff and can’t browse unless you wire that up yourself.
My actual workflow: local handles roughly 70% of my AI interactions by count — summaries, rewrites, quick code questions, transcription post-processing — and cloud handles the 30% that’s genuinely hard. The 70% used to cost me API fees and leak data. Now it costs nothing and leaks nothing.
Your 15-minute setup checklist
Do this today. Total time: about 15 minutes, most of it download.
- (1 min) Check your specs: → About This Mac. Note the chip and RAM.
- (2 min) Install Ollama:
brew install ollamaor grab the app from ollama.com. - (5-8 min) Pull the right model for your tier:
llama3.1:8bfor 16GB,qwen2.5:32bfor 36-48GB,llama3.3:70bfor 64GB+. The download is the slow part — 5-40GB depending on tier. - (1 min) Smoke test:
ollama run <model> "Summarize the plot of Hamlet in three sentences."Watch the tokens stream. - (2 min) Measure it: append
--verboseto see your actual tokens/sec, or call the API:curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"hi"}'. - (2 min) Open Activity Monitor → GPU tab while generating. Watch the GPU peg and the fans… do nothing. That silence is the unified memory architecture earning its keep.
From there, the rabbit hole goes deep: web UIs, IDE integrations, fine-tuning, automation pipelines. I’ll cover all of it in this series. But step one is realizing the supercomputer was on your desk the whole time — and it takes a quarter of an hour to switch it on.
