Photo: Unsplash
Your MacBook Battery Can Handle 8 Hours of Local AI — Here's How
The standard assumption is that running an LLM on battery is like gaming on battery: a two-hour countdown to a dead laptop. On an x86 notebook with a discrete GPU, that assumption is correct — the GPU alone pulls 80–150 W under inference load. On Apple Silicon it’s wrong by a factor of four or more, and with the right model size and a few settings, eight hours of genuinely AI-assisted work on one charge is not a stunt — it’s my normal travel configuration. Here’s the methodology, the measured numbers, and the exact setup, tested on an M1 Pro 14” MacBook Pro (70 Wh battery) with corroborating runs on a friend’s M3 Pro.
Why Apple Silicon makes this possible at all
The honest physics first. Inference cost is roughly watts × seconds per answer. A desktop RTX-class GPU generates tokens very fast but at 100+ W; package power on my M1 Pro during 8B-model inference peaks around 18–22 W and idles back to under 2 W between requests. Apple Silicon is slower per token but radically cheaper per token — and crucially, an LLM workload is bursty. You prompt, the machine sprints for 10 seconds, then you read and type for three minutes while the SoC naps. Over a realistic work hour, the duty cycle of actual inference is maybe 5–10%.
That duty-cycle math is the entire trick. Continuous generation would still murder the battery (more below — I measured it). But nobody works that way. The question that matters is: what does an hour of normal, AI-assisted work cost? So that’s what I tested.
The methodology and the numbers
Test conditions, so you can replicate or argue: M1 Pro 14”, battery health 91%, screen at 50%, Wi-Fi on but workload fully offline, Ollama 4-bit (Q4_K_M) models. Two workload types:
- “Active use” — a scripted loop simulating real work: one ~300-token completion every 3 minutes (writing assistance cadence), plus continuous small completions every 20 seconds for 10 minutes of each hour (code-completion cadence).
- “Hammer” — continuous back-to-back generation, no idle. The worst case nobody actually lives in.
Battery drain measured per hour via pmset -g batt logging:
| Model | Size on disk | Active use drain/hr | Hammer drain/hr | Est. runtime (active) |
|---|---|---|---|---|
| Qwen2.5 3B | 1.9 GB | ~7%/hr | ~26%/hr | ~12 hrs |
| Llama 3.1 8B | 4.9 GB | ~11%/hr | ~38%/hr | ~8–9 hrs |
| Qwen2.5 14B | 9.0 GB | ~15%/hr | ~47%/hr | ~6 hrs |
| Qwen2.5 32B | 19 GB | ~22%/hr | ~60%+/hr | ~4 hrs |
(Baseline: the same hour of writing/coding with no AI costs ~5–6%/hr on this machine — so a 3B assistant is nearly free, and an 8B costs about one extra hour of total runtime across the day.)
There’s the title’s claim, delivered: an 8B model under realistic use drains ~11%/hour, which is an 8–9 hour workday on a charge — and the 3B class barely registers. The 32B is the one that turns your MacBook into a 4-hour machine; on battery, big models are for moments, not for the whole flight. One thermal footnote: only the 32B spins the fans audibly, and the chassis warmth is real enough that Mochi, my British lilac cat, has learned to identify long inference runs and relocates onto the keyboard deck accordingly. The 3B produces no such cat.
OLLAMA_KEEP_ALIVE: the setting that matters most on battery
On mains power I tell everyone to set OLLAMA_KEEP_ALIVE=2h or even -1 so models stay resident and respond instantly. On battery, that advice inverts — partially. A loaded model sitting in RAM costs almost nothing in CPU, but it does keep memory pressure high, and macOS responds to memory pressure with compression and swap activity that shows up as a slow background drain (I measured idle drain with a resident 14B at ~1.5–2% per hour above baseline, mostly from the system, not Ollama itself).
The battery-optimal pattern is a middle setting:
# Battery profile: unload after 10 idle minutes
launchctl setenv OLLAMA_KEEP_ALIVE 10m
Ten minutes covers the natural rhythm of a working session (you rarely go >10 min between prompts while actively using AI), but lets the model fully unload during meetings, lunch, or reading. The tradeoff you’re accepting is cold starts: reloading the 8B from SSD takes ~4 seconds, the 14B ~7 seconds. That’s the right trade on battery — a handful of 5-second waits per day versus a continuous tax. On a plane I sometimes preload deliberately before takeoff (ollama run llama3.1 "" loads and exits) so the first real prompt of the flight is instant, then let keep-alive manage the rest.
Low Power Mode (System Settings → Battery) interacts better than expected: it caps peak clocks, which slows generation ~20–30% (my 8B drops from ~30 to ~21 tokens/sec) but cuts the hammer-workload drain by more than that. For pure reading-and-writing days I leave it on; for code-heavy days the latency annoys me more than the watts save me. Try both for a day each — it’s genuinely a preference call.
The 3–4B class is the travel sweet spot
The table makes the case, but let me make it explicit: for battery work, small models punch way above their reputation. The 3–4B class of 2026 — Qwen2.5 3B, Llama 3.2 3B, Phi-class models, Gemma 3 4B — is roughly where 7B models were two years ago, and they cover an honest 80% of what I actually do on a plane:
- Writing assistant: grammar fixes, rephrasing, tightening paragraphs, drafting from bullet points. A 3B does this nearly as well as an 8B; prose editing is not a reasoning task.
- Code completion and small functions: boilerplate, regex, “write the test for this.” For architecture questions I wait until I land.
- Translation: my Czech↔English use case works surprisingly well at 3–4B for everyday register — formal documents I still route to the 14B, plugged in.
- Offline reference: “how does
launchctldiffer fromcronagain” — a 3B is a serviceable offline Stack Overflow for mainstream topics, with the standard hallucination caveat applied double at this size.
My actual travel loadout is two models: Qwen2.5 3B as the default workhorse, Llama 3.1 8B for when the small one’s answer smells off. Combined disk cost under 7 GB. On a Prague→San Francisco flight — eleven and a half hours, no power outlet that worked — this setup plus Low Power Mode got me through roughly nine hours of intermittent writing and coding with 14% remaining on landing. That flight is what convinced me to write this post.
Replicate the test yourself
Don’t trust my battery percentages — batteries age, chips differ, workloads differ. The measurement loop takes one terminal window:
# Log battery percentage every 5 minutes to a file
while true; do
echo "$(date +%H:%M) $(pmset -g batt | grep -o '[0-9]*%')" >> ~/batt-log.txt
sleep 300
done
Run it in the background, work normally with your chosen model for two hours unplugged, then read the log. Two data points to collect: your drain per hour during real use, and your baseline drain doing the same work without AI. The delta is the true cost of your assistant, and for most people running an 8B or smaller it will be startlingly low — a handful of percent per hour, the price of a brighter screen.
The takeaway in one paragraph: local AI on battery fails when you treat your MacBook like a server — big models, always loaded, continuous generation. It succeeds when you treat it like a laptop — a 3–8B model, OLLAMA_KEEP_ALIVE=10m, Low Power Mode on writing days, big models reserved for mains power. Configure it that way and the battery stops being the reason you can’t use AI offline. Eight hours is not the ceiling; it’s the default.

