Why I Downgraded From the Biggest AI Model — and Got Better Results

Photo: Unsplash

AI Power User

Why I Downgraded From the Biggest AI Model — and Got Better Results

A month on an 8-14B daily driver proved that speed changes behavior more than intelligence does

I spent most of last year chasing the biggest model my Mac Studio could hold. Llama 3.3 70B at 4-bit, all 40 GB of it, because obviously more parameters means better answers, and better answers mean better work. Then in May I ran an experiment that I expected to fail: one month with a 8-14B class model — mostly Qwen 2.5 14B, sometimes Llama 3.1 8B on the MacBook — as my daily driver, with the 70B demoted to a tool I had to consciously invoke.

I got more done. Not “almost as much with acceptable quality loss” — measurably more, with outputs I was happier with. The reasons turned out to have very little to do with model intelligence and everything to do with my own behavior. Here’s the full experiment, where small won, where big still wins, and the two-tier setup I’ve kept since.

The experiment and the numbers

Setup: Mac Studio M2 Ultra for desk work, M3 Pro MacBook for everywhere else, Ollama serving both models. The rule: qwen2.5:14b answers everything by default; escalation to llama3.3:70b requires me to deliberately re-run the prompt.

The raw performance gap that drives everything else, measured on the Studio:

  • Time to first token: ~0.4s for the 14B vs 8-15s for the 70B once context grows past a few thousand tokens (and up to 30s with long documents loaded). Prompt processing is where big models on Apple Silicon really hurt.
  • Generation speed: ~55 tok/s vs ~9 tok/s. A 300-word answer: about 7 seconds versus about 45.
  • On the MacBook: the 8B runs at ~35 tok/s on battery without spinning fans; the 70B doesn’t realistically run at all.

So far that’s just “small model fast,” which everyone knows. The interesting part is what the speed did to me.

Where small won: the psychology of instant

You ask more questions when answers are instant. This was the biggest effect and I have the logs to prove it: in my last 70B month I averaged 31 model interactions per workday. In the 14B month: 74. The 45-second answer has a hidden cost — every question undergoes a subconscious “is this worth the wait?” test, and dozens of small questions fail it silently. You don’t notice the questions you don’t ask. At 0.4 seconds to first token, the model becomes like grep: you don’t deliberate about using it, you just use it. Half my new interactions were tiny — “what’s the pandas idiom for this,” “rewrite this sentence less defensively” — and those tiny ones are where an assistant actually compounds.

Three fast drafts beat one slow draft. With the 70B I’d craft a careful prompt, wait, and then — sunk-cost — try to salvage whatever came back. With the 14B, regeneration is so cheap that my default became: fire a rough prompt, see the failure, fix the prompt, fire again. Three iterations in under a minute. For emails, docs, refactoring suggestions, and naming things, the third fast draft from a small model beat the first slow draft from the big one almost every time, because iteration count matters more than per-shot quality for tasks with taste involved.

The discipline effect. Here’s the counterintuitive one. Small models made me a better prompter, because they punish vagueness immediately and obviously. Give a 14B a mushy prompt and you get visibly mushy output in five seconds — so you fix the prompt. Give a 70B the same mushy prompt and you get something plausible, well-structured, and subtly off — mush with good posture — which you’re tempted to accept because it cost 45 seconds and looks finished. The small model’s fast, honest failures trained me to specify format, audience, and constraints up front. Ironically, my escalated 70B prompts got dramatically better too, because they’d already survived three rounds against the small model.

Battery and heat, the unglamorous wins. A full day of 8B usage on the MacBook costs me roughly 15-20% extra battery. The 70B-class models are a desk-only, fans-audible affair. The best model is the one that’s with you on the train to Brno.

Where big still won: the escalation rule

The title says I got better results, and I did — but not because the small model is smarter. It isn’t, and pretending otherwise would make this post a lie. Three categories went to the 70B (or a frontier cloud model) every single time:

  1. Multi-step reasoning. Planning a data migration, debugging a race condition from symptoms, anything where step 4 depends on step 2 being right. The 14B loses the thread; the 70B holds it. No amount of prompt discipline fixes this.
  2. Long documents. Summarizing or interrogating a 60-page contract or a long spec. Small models technically accept the context window; effectively, recall and synthesis over it degrade noticeably. Big models earn their seconds here.
  3. Subtle code review. The 14B catches syntax-level and obvious-logic issues. The 70B catches “this works but breaks the invariant established in the other file.” That gap is real and occasionally expensive.

So the rule I run now: small by default, escalate on the second failure. If the 14B whiffs twice on the same task after a prompt fix, the task has demonstrated it’s a big-model task — escalate without ego. In practice about 1 in 10 tasks escalates. Which means roughly 90% of my AI usage was over-provisioned for a year.

The deeper point: the bottleneck moved

The month convinced me of something bigger than a model recommendation. For routine knowledge work — drafting, rewriting, summarizing short texts, simple code, translation between Czech and English (where Qwen 14B is genuinely strong) — model quality stopped being the bottleneck somewhere around the 8B mark, sometime in 2024-2025. Today’s 8-14B models are better than the GPT-4 we were all amazed by, for the tasks that make up most of a working day.

When quality stops being the constraint, optimizing it further is wasted effort. The constraint now is workflow: how fast you can get a thought to the model, how cheaply you can iterate, whether the model is available on battery, whether asking feels free. Every one of those is a latency-and-friction problem, and small models beat big ones on all of them. Upgrading from 70B to 405B-class improves my median workday by approximately nothing; cutting time-to-first-token from 15 seconds to 0.4 changed how often I think with the machine at all.

The setup I kept

The two-tier system, concretely:

ollama pull qwen2.5:14b      # the daily driver
ollama pull llama3.3:70b     # the escalation tier

The piece that makes it frictionless is a Raycast hotkey script that re-runs my last prompt on the big model — same prompt, zero retyping, one keystroke when the small model whiffs:

#!/bin/bash
# escalate.sh — resend last prompt to the 70B
PROMPT=$(cat ~/.last_prompt)
curl -s http://127.0.0.1:11434/api/generate \
  -d "{\"model\":\"llama3.3:70b\",\"prompt\":$(jq -Rs . <<<"$PROMPT"),\"stream\":false}" \
  | jq -r .response

(My chat wrapper writes every prompt to ~/.last_prompt; any client that logs prompts can feed the same script.) Escalation costing one keystroke instead of a copy-paste ritual is what makes “small by default” sustainable — the big model is never more than a hotkey away, so defaulting small never feels like a sacrifice.

The uncomfortable summary: I spent a year and a lot of RAM assuming intelligence was the scarce resource, when for most of my actual work the scarce resource was my own willingness to ask. Downgrade the default, keep the big gun loaded behind a hotkey, and watch what happens to the number of questions you ask per day. That number — not the benchmark score — is what your results are made of.