Photo: Unsplash
The M-Series Chip Buying Guide for AI — What Actually Matters
Apple’s configurator is designed to make you agonize over the wrong things. CPU core counts, the extra GPU cores, the Neural Engine TOPS number in the marketing slide — for local AI, almost none of it moves the needle. After buying three Apple Silicon machines for AI work (and watching friends buy the wrong ones), I can compress the entire decision into one sentence:
For local AI, buy memory bandwidth first, RAM size second, and ignore nearly everything else.
Here’s why that’s true, with the numbers to prove it, and exactly what to buy for your situation.
Why tokens per second is really gigabytes per second
LLM inference has a brutally simple inner loop: for every single token generated, the chip must read essentially all of the model’s weights from memory. A 4-bit quantized 8B model is ~4.5 GB of weights; generating 50 tokens/second means streaming ~225 GB/s from RAM to the GPU, every second, forever.
Compute is rarely the bottleneck — moving bytes is. Which gives you a back-of-napkin formula that predicts real-world performance shockingly well:
max tokens/sec ≈ memory bandwidth (GB/s) / model size in RAM (GB)
A 4.5 GB model on a 400 GB/s chip: ~89 tok/s theoretical, ~55–65 in practice (you never hit 100% bandwidth efficiency). The same model on a 120 GB/s base chip: ~20 tok/s real-world. Same GPU architecture, same “Apple Intelligence” branding — 3x difference, purely from bandwidth.
This is why the M-series tier ladder is the AI performance ladder:
| Chip tier | Memory bandwidth | Real-world 8B Q4 generation |
|---|---|---|
| M4 (base) | 120 GB/s | ~18–22 tok/s |
| M4 Pro | 273 GB/s | ~38–45 tok/s |
| M4 Max | 410–546 GB/s (binned) | ~60–75 tok/s |
| M3 Ultra | 819 GB/s | ~90–110 tok/s |
(Earlier generations follow the same shape: M1/M2 Pro ~200 GB/s, M1/M2 Max 400 GB/s, M1/M2 Ultra 800 GB/s. Note the careful buyer’s trap in the Max tier: the cheaper binned M4 Max has 410 GB/s, the full chip 546 GB/s — a 25% AI performance gap hiding behind one configurator click.)
The takeaway most buyers miss: an M2 Max from 2023 beats a brand-new base M4 for LLM inference, by a lot. Generation number is marketing; bandwidth is physics.
RAM tiers: what each one unlocks
Bandwidth sets your speed; RAM sets your ceiling — the largest model you can run at all. Unified memory means the GPU can use most of system RAM (macOS reserves a chunk; plan on ~70–75% being safely available for models), and since Apple solders it, this is the decision you live with for the machine’s whole life.
16 GB — the 8B comfort zone. Llama 3.1 8B or Qwen3 8B at Q4 (~5 GB) runs comfortably alongside your browser and IDE. You can squeeze a 14B in if you close things. Fine for chat, summarization, the Shortcuts automations I’ve written about. Coding assistance at this tier is noticeably below cloud quality.
32–48 GB — the 32B class, the sweet spot. Qwen3 32B or similar at Q4 (~20 GB) fits with room to work. This is the tier where local coding models become genuinely useful rather than a party trick, and where most enthusiasts should land. 48 GB adds headroom for long contexts — KV cache grows with context length and people always forget to budget for it.
64–96 GB — the 70B class. Llama 3.3 70B at Q4 is ~40 GB; on 64 GB it fits with care, on 96 GB it fits with your life still running around it. 70B-class models are where local stops feeling like a compromise for reasoning-heavy work.
128–192 GB+ — frontier-scale open models. This is Mac Studio Ultra territory, and it’s the configuration where Macs do something no consumer NVIDIA setup can: run 100B+ models (and big MoE models like the 200B+ class at aggressive quantization) in a single silent box. The M3 Ultra goes to 512 GB — genuinely unique hardware for open-weight frontier models.
The Neural Engine doesn’t matter (much) — here’s why
The spec sheet shouts about Neural Engine TOPS, and for LLMs it’s nearly irrelevant. The frameworks that actually run local models — llama.cpp via Metal, MLX, anything Ollama or LM Studio wraps — execute on the GPU, not the ANE. The Neural Engine is optimized for small, fixed-shape models (camera processing, Face ID, Apple Intelligence’s on-device features) and isn’t directly programmable in the way LLM runtimes need; only Core ML-converted models touch it, and the LLM ecosystem has voted overwhelmingly for the GPU.
So when comparing chips for AI: read the bandwidth row, read the RAM row, glance at GPU core count (it mainly affects prompt processing speed — relevant for RAG and long documents), and skip the TOPS number entirely.
What to buy, by persona
The student / curious beginner — refurbished M-Pro, 32 GB minimum. Apple’s refurb store and the used market are full of M1 Pro/M2 Pro MacBook Pros. ~200–273 GB/s of bandwidth and 32 GB of RAM gets you the entire 8B–14B world at pleasant speeds and a taste of 32B models, typically for under half the price of a new equivalent. Do not buy 16 GB “to start” — you’ll hit the ceiling in your first month.
The developer — M-Max, 64 GB minimum. If a local coding model is going to sit in your workflow all day, you want 32B-class models at speed plus room for the IDE, containers, and a browser with 60 tabs. A MacBook Pro M4 Max 64 GB (take the 546 GB/s bin) or the equivalent Mac Studio is the configuration I recommend most often, and the one I’d replace my own machine with. 128 GB if the 70B class is part of your plans.
The AI homelab — Mac Studio Ultra, as much RAM as you can justify. 819 GB/s and 128–512 GB of unified memory makes the Studio Ultra a one-box inference server: silent, ~90 W under load, running a 70B+ model behind Ollama for every device in your house (add Tailscale and it serves your phone from anywhere). Price out the NVIDIA equivalent in VRAM terms and the “expensive Mac” suddenly looks like the budget option.
The non-buy: a base-chip Mac “for AI.” The 120 GB/s tier runs small models acceptably and that’s the whole story. Buy it because you want a great everyday computer — not for this hobby.
The depreciation argument: buy more RAM than you need today
Here’s the uncomfortable economics of soldered memory. With a PC you can drop in more RAM in 2027 for a couple hundred euros. With a Mac, the RAM you order is the RAM you retire with — and the model ecosystem is moving in exactly one direction. Two years ago, 7B models were the local standard; today 32B-class models are the quality floor for serious work, and MoE architectures keep raising the RAM bar even as they lower the compute one.
The Apple tax on a RAM upgrade stings at checkout and disappears in hindsight: paying ~€500 more for the next tier amortizes to about €14/month over a three-year ownership — far less than what under-specced buyers lose reselling a 16 GB machine into a market that wants 64. My rule, validated twice now by my own purchases: estimate the RAM you need, then buy one tier above it. I have never met anyone who regretted too much unified memory. I regularly meet people who regretted too little.
The 30-second version
Bandwidth determines speed (base 120 → Pro 273 → Max 410–546 → Ultra 819 GB/s; tokens/sec scales almost linearly). RAM determines the ceiling (16 GB → 8B, 32–48 GB → 32B, 64 GB+ → 70B, 128 GB+ → frontier open models). The Neural Engine is for the camera, not your LLM. And because the RAM is forever: whatever tier you’ve talked yourself into — buy the next one up.