Photo: Unsplash
The MacBook Pro Setting That Doubles Your AI Performance
There’s a single terminal command that took my 48GB MacBook Pro from “can’t load a 40GB model” to running it comfortably at full GPU speed. No hardware change, no third-party software, no hack of dubious provenance — just a documented macOS kernel parameter that almost nobody outside the local LLM community knows exists.
Here it is up front, then I’ll explain exactly what it does, what’s safe for your RAM tier, and how not to hurt yourself with it:
sudo sysctl iogpu.wired_limit_mb=40960
That’s the headline setting. Two supporting acts — High Power Mode and a thermal trick — round out the article. Combined, the difference between a default machine swapping a too-big model and a tuned machine running it in GPU memory isn’t 20%. It’s the difference between ~2 tokens/sec with disk thrash and 9-10 tokens/sec of clean inference. That’s not double; that’s 4-5x. But even in the milder case — a model that barely fit before, forcing a smaller context window and partial CPU offload — moving fully onto the GPU routinely doubles real throughput. The title promise is the conservative version.
What macOS is hiding from your GPU
Apple Silicon’s unified memory means the GPU can theoretically address all your RAM. In practice, macOS enforces a ceiling on how much memory the GPU may wire (lock for its own use). The default is roughly:
- ~65-70% of total RAM on machines with 36GB or less
- ~75% of total RAM on higher-memory machines
So on a 48GB MacBook Pro, the GPU sees about 36GB. On a 36GB machine, about 27GB. On 64GB, about 48GB. The reservation exists for a sensible reason: macOS needs guaranteed headroom so a runaway GPU allocation can’t starve the window server and kernel.
But the default is conservative — tuned for a machine doing twenty things at once, not a machine whose owner deliberately wants to dedicate it to inference for an hour. The iogpu.wired_limit_mb sysctl (on older macOS versions it was debug.iogpu.wired_limit) lets you move the ceiling yourself. It takes effect immediately, no reboot.
Why does this matter so much? Because model sizes cluster just above the default ceilings. A 4-bit 70B model wants ~40-43GB — doesn’t fit in 48GB’s default 36GB window, fits easily once you raise it. A 4-bit 32B model with a fat context wants ~24-28GB — tight on a 36GB machine’s default 27GB, comfortable after the bump. Raising the limit effectively promotes your machine one full model tier.
Safe values per RAM tier
The golden rule: leave macOS an absolute minimum of 8GB, and 12GB if you keep a browser with 40 tabs alive while inferring. My tested recommendations:
| Total RAM | Default GPU limit | Safe raised limit | Command value (MB) |
|---|---|---|---|
| 36GB | ~27GB | 30GB | 30720 |
| 48GB | ~36GB | 40GB | 40960 |
| 64GB | ~48GB | 56GB | 57344 |
| 128GB | ~96GB | 116GB | 118784 |
Apply it:
# 48GB MacBook Pro example
sudo sysctl iogpu.wired_limit_mb=40960
To revert to defaults, set it to 0:
sudo sysctl iogpu.wired_limit_mb=0
The risks, stated plainly. If you set this too aggressively and the GPU actually wires that memory while the system needs it, macOS has nowhere to go: you’ll see beachballs, app terminations, and in the worst case a hard freeze requiring a forced reboot. I once set a 64GB machine to 62GB out of curiosity, loaded a model that used it all, and watched the UI die in slow motion until I held the power button. Nothing was damaged — the setting doesn’t survive a reboot, which is your built-in safety net — but unsaved work was gone. Stay inside the table above and quit memory-hungry apps before loading a model that needs the raised ceiling.
Making it persistent (deliberately)
The setting resets on every reboot. For most people that’s a feature. If you want it permanent, create a LaunchDaemon:
sudo tee /Library/LaunchDaemons/com.local.iogpu.plist <<'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
"http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key><string>com.local.iogpu</string>
<key>ProgramArguments</key>
<array>
<string>/usr/sbin/sysctl</string>
<string>iogpu.wired_limit_mb=40960</string>
</array>
<key>RunAtLoad</key><true/>
</dict>
</plist>
EOF
sudo launchctl load /Library/LaunchDaemons/com.local.iogpu.plist
Swap 40960 for your tier’s value. To undo it later: sudo launchctl unload /Library/LaunchDaemons/com.local.iogpu.plist && sudo rm /Library/LaunchDaemons/com.local.iogpu.plist.
Verifying it worked. Check the current value with sysctl iogpu.wired_limit_mb. Then load your model and open Activity Monitor → Window → GPU History (Cmd+4) to watch utilization, and the Memory tab to watch the wired memory figure climb. The number you care about: “Memory Used” should rise to your model size while “Swap Used” stays flat. If swap grows during model load, the model still isn’t fitting — drop a quantization level. In Ollama you can confirm placement with ollama ps: it should report 100% GPU, not a CPU/GPU split.
Setting two: High Power Mode
On 14-inch and 16-inch MacBook Pro models with Max chips, macOS ships a literal performance switch that many owners have never opened: System Settings → Battery → Energy Mode → High Power (set it for both “On power adapter” and, if you insist, battery).
High Power Mode raises the fan curve so the chip can hold maximum clocks longer instead of stair-stepping down as heat builds. For a 30-second prompt you won’t notice. For sustained inference — long generations, batch jobs, fine-tuning runs — it’s the difference between holding peak tokens/sec and watching throughput sag 15-25% as the chassis soaks. The fans become audible. That’s the point; they’re finally doing something.
While you’re in there: Automatic mode is fine for daily use, but Low Power Mode will absolutely strangle inference. I’ve seen people benchmark models in Low Power Mode and report half the expected speed. Check this before you check anything else.
The thermal trick nobody mentions: open the lid
Here’s a genuinely obscure one. If you run your MacBook Pro in clamshell mode — lid closed, external display — you’ve disabled a chunk of its cooling. The keyboard deck is a heat-radiating surface, and the thermal management gets more conservative with the lid down because heat builds between the keyboard and the screen.
For sustained AI workloads on an external display, keep the lid open, even if the built-in screen just shows the desktop. On long fine-tuning runs I’ve measured the difference at several hundred MHz of sustained GPU clock, which translates to roughly 10% throughput on a hot day. Elevate the machine on a stand for airflow underneath and you’ve claimed every free degree available.
Putting it together: real numbers
My 48GB M-Max MacBook Pro, running llama3.3:70b at 4-bit (~42GB):
- Default everything: model doesn’t fit in the 36GB GPU window. Ollama splits layers between GPU and CPU. Result: ~2-3 tokens/sec, fans roaring at the CPU, of all things.
- Wired limit raised to 40GB, High Power Mode on, lid open: fully GPU-resident, ~9-10 tokens/sec sustained, system still responsive enough for Safari and Mail alongside.
Same laptop. Three settings. A model tier I “couldn’t run” became my daily driver. Meanwhile a qwen2.5:32b workload that previously just barely fit went from constrained 4K contexts to comfortable 16K contexts at full speed — which in practice doubled my effective throughput on long-document work, because I stopped chunking everything in half.
One command, one settings toggle, one open lid. Run the sysctl before your next model download and check ollama ps — if you’ve been seeing anything other than 100% GPU, today is the day that changes.