Photo: Unsplash
The 7 AI Mistakes Mac Power Users Keep Making
I’ve spent a month of this series persuading you to run AI locally on your Mac. Today, the flip side: the seven mistakes I see constantly — in forums, in reader emails, and in my own history, because I’ve personally committed at least five of these. Each one produces the same outcome: a capable setup that feels broken, and a user who concludes local AI is overhyped when the actual problem is one config line.
Format is strict: symptom, cause, fix. Commands included where they exist.
1. Running models too big for your RAM, then blaming the Mac
Symptom. The model takes minutes to respond, the whole Mac stutters, the cursor beachballs, and generation crawls at a token every few seconds. “My M3 can’t handle local AI.”
Cause. The swap death spiral. A model that doesn’t fit in unified memory gets paged to SSD, and since every generated token requires reading the entire model, you’re now running inference off your SSD instead of RAM — a bandwidth difference of two orders of magnitude. Your M3 is fine; your model placement is not.
Fix. Open Activity Monitor → Memory tab and look at two things while the model runs: Memory Pressure (yellow or red = you’re in the spiral) and Swap Used (multiple GB climbing during inference = confirmed). The rule of thumb: the model’s file size — check with ollama list — should stay under roughly 60-65% of total RAM. 16GB Mac? Stay at or below ~9GB models, which means 8–12B at Q4. If you’re in the spiral right now: ollama ps to see what’s loaded, ollama stop <model>, pull a size that fits.
2. Default context length, then wondering why the model “forgets”
Symptom. You paste a long document, ask a question about its beginning, and get hallucinated nonsense. Mid-conversation, the model loses the plot entirely. “Local models have goldfish memory.”
Cause. Historically Ollama defaulted to a small context window (2,048 tokens for years; newer versions default to 4,096) regardless of what the model supports. Everything beyond the window is silently truncated — silently being the operative scandal. The model never saw page one of your document.
Fix. Set the context explicitly. Per session:
ollama run llama3.1:8b
>>> /set parameter num_ctx 16384
Or permanently for the server via the environment:
launchctl setenv OLLAMA_CONTEXT_LENGTH 16384
Or bake it into a Modelfile (see mistake 6). One caveat: context costs memory — the KV cache for a long window can add multiple GB — so this interacts with mistake 1. A 16k window on an 8B model is a comfortable default for 16GB+ machines.
3. Ignoring quantization, grabbing a Q2, concluding local AI is dumb
Symptom. The model’s output is weirdly degraded — repetitive, error-prone, occasionally incoherent — even though the benchmark scores for that model looked great.
Cause. Quantization is lossy compression of model weights, and the loss is not linear. Q8 is near-lossless; Q4_K_M is the sweet spot where quality loss is barely measurable; Q3 is noticeably degraded; Q2 frequently lobotomizes the model. People with limited RAM grab the smallest file of a big model — a Q2 of a 70B — and judge “the 70B” by it. You didn’t run the 70B. You ran its concussed cousin.
Fix. Know what you pulled: ollama show <model> displays the quantization level. The hierarchy that actually serves you: a smaller model at Q4/Q5 beats a bigger model at Q2 almost every time. On a 16GB Mac, gemma3:12b at Q4 will outperform any Q2 monster you can squeeze in. Pull explicit quants with tags: ollama pull llama3.1:8b-instruct-q8_0 when you have headroom, default Q4 tags when you don’t, Q2 essentially never.
4. Leaving the GPU memory limit at default on a high-RAM Mac
Symptom. You bought 64 or 128GB precisely for big models, yet a model that should fit refuses to load fully onto the GPU, offloads layers to CPU, and runs at half the speed it should.
Cause. macOS caps the GPU’s share of unified memory at a default — roughly 65–75% of total RAM depending on configuration. On a 16GB machine that guardrail makes sense. On a 128GB Mac Studio it strands tens of gigabytes you bought specifically for inference.
Fix. Raise the limit with sysctl. To allow, say, 56GB of GPU memory on a 64GB machine (value is in MB):
sudo sysctl iogpu.wired_limit_mb=57344
This resets on reboot, which is also your safety net — if you overdo it and the system gets unstable, restart and you’re back to defaults. Leave macOS at least 8–12GB. On a 128GB Studio I run a 112GB limit, which is the difference between a 70B model fully GPU-resident at conversational speed and the same model limping with CPU-offloaded layers.
5. Cloud-or-local absolutism
Symptom. Two flavors. The local purist spends 40 minutes coaxing a 13B model through a task Claude would nail in one shot “on principle.” The cloud loyalist pastes confidential contracts into a chatbot because local “isn’t good enough,” having last tried it two years ago.
Cause. Identity-driven tool choice. Somewhere along the way “how I run inference” became a tribal affiliation rather than an engineering decision.
Fix. Route by task, not by ideology. My actual routing table: local for anything involving private data (mail, contracts, manuscripts, health), high-volume repetitive jobs (summarization pipelines, tagging), and offline work; cloud for frontier-grade reasoning, gnarly code architecture questions, and anything where a wrong answer costs more than the privacy is worth. The boring truth is that hybrid users get more from both — the cloud subscription gets reserved for problems worthy of it, and the local stack handles the volume. If you can’t articulate why a given task is running where it’s running, that’s the tell.
6. Retyping the same instructions instead of using system prompts and Modelfiles
Symptom. Every session starts with the same ritual paragraph: “Answer concisely, use metric units, respond in English, format code blocks as…” — and the moment you forget it, output quality drops.
Cause. Treating a configurable inference server like a goldfish chat toy. Standing instructions belong in the model’s configuration, not your clipboard.
Fix. Ollama’s Modelfile bakes your defaults into a named model. Create a file called Modelfile:
FROM qwen3:14b
PARAMETER num_ctx 16384
PARAMETER temperature 0.3
SYSTEM """You are a concise technical assistant. Metric units.
No preamble, no recap of the question. Code in fenced blocks
with the language tag. If uncertain, say so explicitly."""
Then build and use it:
ollama create work-assistant -f Modelfile
ollama run work-assistant
Now work-assistant is a first-class model with your context length, your temperature, and your standing orders — usable from every app that talks to Ollama. I keep four of these (work, writing-editor, summarizer, home-automation), and it also quietly fixes half of mistake 2, since num_ctx rides along.
7. Treating AI output as finished work
Symptom. The embarrassing email that committed you to a deadline you never read. The blog post with a confidently invented statistic. The script that worked in the happy path and deleted the wrong directory in the sad one.
Cause. Fluency bias — models produce polished-sounding output at a speed that lulls you into equating polish with correctness. This is the only mistake on the list with no command-line fix, and it’s the one that does actual damage in the world.
Fix. A hard personal rule: AI output is always a draft. Concretely — every generated email gets a full read before sending; every factual claim destined for publication gets verified at its source; every generated script runs once in a sandbox or with echo in place of the destructive command. The review pass costs 10% of the time the AI saved and is the difference between “AI made me faster” and “AI made me confidently wrong at scale.” Power users aren’t the people who automate the most; they’re the people who know exactly which steps must never be automated.
Tally your own score against these seven. Mine, historically, is five out of seven — number 1 cost me an evening of blaming Apple, and number 7 cost me an apology email. Every fix above is one config line, one sysctl, or one habit. The gap between “local AI is overhyped” and “my Mac quietly does the work of a subscription stack” is, in most cases I’ve debugged, exactly that wide.

