Run DeepSeek-Class Models on a Mac — The Complete Guide

Photo: Unsplash

AI Power User

Run DeepSeek-Class Models on a Mac — The Complete Guide

What reasoning models actually demand from Apple Silicon and which ones fit your RAM

When DeepSeek-R1 landed, half my feed announced they were “running DeepSeek on a MacBook.” Most of them weren’t — not the real one, anyway. The full R1 is a 671-billion-parameter mixture-of-experts model that wants hundreds of gigabytes of memory even aggressively quantized; that’s Mac Studio cluster territory, not laptop territory. What people actually run, and what this guide covers, are the distills — and the good news is that the distills are genuinely excellent, genuinely fit on consumer Macs, and bring something qualitatively new to local AI: visible reasoning. Here’s the complete picture, with honest numbers from my M2 Max Mac Studio (64 GB) and a 16 GB M1 Pro MacBook for the low end.

What a “distill” is and why it’s not a scam

DeepSeek took the reasoning behavior of the giant R1 and used it as a teacher: they generated hundreds of thousands of long chain-of-thought solutions with the big model, then fine-tuned existing smaller open models — Qwen and Llama bases at 1.5B, 7B, 8B, 14B, 32B, and 70B parameters — on that output. The student models learn the habit of reasoning: decomposing problems, checking their own work, backtracking when a path fails.

So a “DeepSeek-R1 distill 14B” is not a shrunken R1. It’s a Qwen 14B that has been taught to think out loud like R1. That distinction sets correct expectations: you get a large fraction of the reasoning behavior on problems within the student’s knowledge, not the full model’s breadth. The same recipe now applies beyond DeepSeek — QwQ-32B and various R1-style fine-tunes work identically from a hardware standpoint, which is why I call this a guide to “DeepSeek-class” models.

The RAM math is the same as any local model: parameter count × bytes per parameter, plus context overhead. At 4-bit quantization, a 14B distill is ~9 GB, a 32B is ~20 GB, a 70B is ~40 GB. Apple Silicon’s unified memory means the GPU can use all of it — which is precisely why a 64 GB Mac runs models that a 24 GB-VRAM gaming PC cannot.

Reasoning tokens: why these models are slower and smarter

Run a distill and the first thing you’ll see is a <think> block — sometimes hundreds, sometimes thousands of tokens of the model arguing with itself before the answer appears. This is the entire trick. The model is spending inference-time compute to buy accuracy.

The practical consequence is brutal arithmetic. If your Mac generates 20 tokens/second and the model thinks for 1,500 tokens before answering in 200, you wait 75 seconds of thinking for 10 seconds of answer. A standard chat model would have answered in 10 seconds flat — possibly wrongly, but instantly. Reasoning models trade your time for their accuracy, every single query, whether the query deserves it or not. Ask one “what’s the capital of Moravia?” and it may spend 400 tokens deliberating before saying Brno (and then hedging, because technically Moravia isn’t an administrative region anymore — which, fine, is the kind of pedantry I’d produce too).

This is why tokens-per-second matters more for reasoning models than chat models, and why I recommend the largest distill your RAM allows only if the resulting speed stays above roughly 10 t/s. Below that, the thinking phase becomes a coffee break.

Quantization for reasoning models specifically

Quantization advice that’s fine for chat models needs one amendment here: reasoning chains compound errors. A slightly degraded model making a 2% sillier choice at each of 50 reasoning steps drifts off course in a way a single-shot answer never would. My rules after a few months of testing:

  • Q4_K_M is the floor. It’s the default on Ollama and it’s fine. I would not go to Q3 or Q2 on a reasoning model; the chains get visibly flakier — more backtracking, more “wait, let me reconsider” loops that never converge.
  • Q5_K_M or Q6_K if RAM allows. On math-heavy work I see fewer derailed chains at Q6 on the 14B than at Q4. The cost is ~25–40% more RAM.
  • Context is not optional. Reasoning output is enormous. Set num_ctx to at least 8192, ideally 16384, or the model will think itself out of its own context window mid-problem. Budget 1–2 GB extra RAM for that.
ollama run deepseek-r1:14b
/set parameter num_ctx 16384

Realistic performance by RAM tier

Numbers below are from my own machines plus corroborating community benchmarks; generation speed, Q4_K_M, M-series Pro/Max-class chips. Your exact figures will vary ±20% by chip generation and memory bandwidth.

Mac RAMSweet-spot distillRAM usedSpeed (approx.)Verdict
16 GBR1-distill 8B~6 GB25–35 t/sGood daily driver; thinking phases ~30–60 s
24 GBR1-distill 14B~10 GB15–25 t/sThe best value tier
32–48 GBR1-distill 32B / QwQ-32B~21 GB10–15 t/sNoticeably smarter; patience required
64–128 GBR1-distill 70B~42 GB5–9 t/sPowerful but slow — thinking can take 3–5 min
192 GB+Quantized full R1 (extreme quants)130 GB+Single digitsA stunt more than a workflow

Note the 70B row honestly: on my 64 GB Studio it runs, and for a hard code-review task the answers are the best I can get locally — but a five-minute end-to-end response changes how you use it. It becomes a “submit and switch tasks” tool, not a conversation. The 32B class at 10–15 t/s is, for most people with the RAM, the best intelligence-per-wait ratio in local AI right now. The 1.5B distill exists and runs on anything, but treat it as a curiosity — its chains look like reasoning without reliably being reasoning.

When reasoning models earn their wait — and when they’re overkill

After months of running both model types side by side, my routing rule is simple: reasoning models for problems with a right answer, chat models for everything else.

Where the distills clearly win: multi-step math and anything with units and edge cases; code review (“find the bug” — the 32B catches off-by-one and concurrency issues the same-size chat model breezes past); planning tasks with constraints (scheduling, capacity math, “what order do I migrate these services”); logic-heavy document analysis like contract clauses. The visible <think> block is also genuinely useful output — watching where the model second-guesses itself often points at the genuinely ambiguous part of your problem.

Where they’re overkill: summarization, translation (my Czech↔English work gains nothing from 1,000 tokens of deliberation), drafting emails, brainstorming, casual Q&A, and any latency-sensitive pipeline. For all of that, a plain Qwen2.5 or Llama 3.x at the same size is faster and equally good. I keep both installed and route by task; the reasoning model handles maybe 15% of my queries and earns its disk space on those alone.

Setup: the Ollama path and the LM Studio path

Ollama (the scriptable path): brew install ollama, then ollama run deepseek-r1:14b — pick the tag matching your tier from the table. Set num_ctx as above, and OLLAMA_KEEP_ALIVE=2h so you’re not reloading 20 GB every coffee break. The API at localhost:11434 returns the thinking inside <think> tags; strip them in scripts with sed '/<think>/,/<\/think>/d' if you only want the final answer.

LM Studio (the visual path): better for experimentation. It shows the thinking phase live in a collapsible block, exposes quantization variants (that Q6 vs Q4 comparison takes two clicks), uses Apple’s MLX engine for a typical 10–20% speed bonus over GGUF on the same model, and warns you before you load something your RAM can’t hold. If you’re choosing your tier, start in LM Studio to test, then move your winner to Ollama for daily scripted use.

The complete honest summary: you can’t run DeepSeek on your Mac — you can run something distilled from it that’s slower than a chat model, hungrier for context, pickier about quantization, and substantially smarter on exactly the problems where smart matters. Pick the row in the table that matches your RAM, accept the wait, and use it where right answers beat fast ones.