Photo: Unsplash
Apple Intelligence vs. Local LLMs — What Apple Doesn't Tell You
I run both. Apple Intelligence is on across my MacBook and iPhone, and an Ollama server with a rotating cast of 8B–32B models sits two keystrokes away on the same machines’ bigger sibling. After living with both stacks daily, I can tell you exactly where Apple’s marketing ends and the engineering reality begins. The short version: Apple Intelligence is a genuinely clever system wrapped around a deliberately small model, and the thing Apple doesn’t put on the keynote slide — the model’s actual size — explains every frustration you’ve had with it.
What actually runs where (Apple won’t draw you this diagram)
Apple Intelligence is not one AI. It’s a three-tier routing system, and your request silently lands in one of the tiers:
Tier 1 — on-device. A roughly 3-billion-parameter model (Apple’s own published research describes the ~3B on-device foundation model, quantized to about 2–4 bits per weight to fit in a couple of GB of memory) handles Writing Tools rewrites, message and mail summaries, notification prioritization, Smart Reply, and Genmoji. This is the tier Apple’s privacy story is built on, and it genuinely never leaves your hardware.
Tier 2 — Private Cloud Compute. When a request exceeds the small model’s capability, it’s shipped to Apple’s PCC: Apple Silicon servers running a larger server foundation model. More on why this tier is genuinely impressive below.
Tier 3 — third-party frontier models. Ask Siri something hard and it offers to hand the query to an external frontier model partner (ChatGPT integration shipped first; the architecture is explicitly built for multiple providers). This tier is opt-in per request, but it is also the quiet admission in the architecture: for real intelligence, even Apple outsources.
The tell is that you can’t see the routing. There’s no indicator showing which tier answered. For a company that prints “Privacy” on billboards, the absence of a “this request used PCC” badge is a curious omission — the system is designed so you don’t think about the fact that a meaningful slice of “Apple Intelligence” is not on your device at all.
Fair is fair: Private Cloud Compute is genuinely innovative
Before the criticism, the credit, because PCC deserves it. Apple built server nodes from Apple Silicon, with no persistent storage, where each request is cryptographically tied to a software image whose measurements are published in a transparency log that independent researchers can inspect — and your device refuses to send data to any node that can’t prove it’s running the audited build. No admin shells, no remote debugging in production, stateless computation that’s cryptographically attested rather than just promised in a policy document.
Nobody else in the industry runs consumer AI this way. When my wife’s iPhone summarizes a long email thread via PCC, the privacy posture is dramatically better than the same request hitting a default cloud API. If you’re going to use cloud AI, this is the most honest version of it anyone has built. That’s the fair reading, and Apple has earned it.
What Apple doesn’t tell you: the hard limits of 3B
Now the other side. Here’s what living with a ~3B-parameter, aggressively quantized model actually feels like, compared against even a mid-size model you run yourself:
Depth. Ask Writing Tools to rewrite a paragraph and it does fine. Ask it to restructure a 2,000-word argument and it produces grammatical mush, because a 3B model can polish sentences but cannot hold an argument’s logic. The same task given to a local qwen2.5:14b via Ollama on my Mac produces an actual restructuring with preserved reasoning. This isn’t Apple doing AI badly — it’s physics. Capability scales with parameters, and Apple chose battery life and instant latency over depth. A defensible choice; an undisclosed one.
Context. The on-device model works with a small context window (Apple’s documentation for the developer-facing Foundation Models framework has been candid that on-device sessions are limited to several thousand tokens). Mail summarizes a thread; it cannot ingest your 80-page PDF and answer questions about page 60. My local setup with a 32B model and a 32K context does exactly that, offline.
Customization. You cannot set a system prompt for Apple Intelligence. You can’t tell it “always answer in Czech,” can’t give it your terminology, can’t adjust its tone beyond preset Writing Tools styles, can’t swap the model. With Ollama, every one of those is one line in a Modelfile.
Model choice. Apple Intelligence is whatever Apple shipped this OS cycle. The local ecosystem ships meaningfully better open-weights models every few months, and you adopt them the day they hit Hugging Face with ollama pull.
The numbers that make this concrete
On my M3 Pro (36 GB), here’s what “running your own” actually delivers versus the built-in stack: llama3.1:8b at ~28 tokens/sec, qwen2.5:14b at ~15 tokens/sec, and on a Mac Studio-class machine, 32B–70B models that operate in a different universe of capability than the on-device 3B — while still being 100% local, matching Apple’s Tier-1 privacy with roughly 5–10x the parameters. The memory math is the story: Apple budgets ~2 GB for its model because it must run on every supported iPhone. Your 36 GB MacBook Pro can budget 20+ GB. Same silicon family, ten times the brain, and Apple will never ship that configuration as a default because the iPhone 16 in someone’s pocket can’t run it.
That’s the headline secret, plainly stated: Apple Intelligence’s ceiling is set by Apple’s smallest supported device, not by your Mac’s capability. Your M4 Max is idling while a 3B model writes your email summaries.
How I actually use both (they’re complements, not rivals)
After months of running the two stacks side by side, the division of labor settled naturally:
Apple Intelligence owns the ambient layer — the tasks where invocation cost matters more than intelligence. Notification triage (genuinely good), mail thread summaries in the inbox list, quick proofread via Writing Tools in any text field (select text → right-click → Writing Tools → Proofread), removing a photobombing stranger in Photos’ Clean Up. These work because they’re zero-friction and good enough, woven into the OS at a level no third-party tool can reach.
My local stack owns the serious work — anything needing depth, context, or control. Document analysis, code, long-form writing assistance, translation with my own glossary, summarizing transcripts. The glue is trivially simple: Ollama plus a Raycast hotkey means “real” local AI is as accessible as Siri, just two keystrokes instead of a wake word.
The honest buying advice that falls out of this: don’t buy more RAM “for Apple Intelligence” — it runs identically on every supported Mac, 8 GB included, by design. Buy RAM for the models you will run. 16 GB unlocks the 8B tier, 24–36 GB the 14B–32B tier where local AI starts beating last year’s cloud models.
Apple built the best privacy-preserving small AI system in the industry and talks about it as if size doesn’t matter. It does. Use Apple Intelligence for what a brilliant 3B model woven into the OS can do, run your own weights for everything that needs an actual brain — and now you know exactly where that line is, even if the keynote never draws it.
