AI Power User

The Truth About Mac vs. NVIDIA for AI — I Tested Both

Real tokens-per-second numbers from both camps and a verdict that depends entirely on what you actually do

By Jakub Jirák Jun 20, 2026 5 min read

apple-siliconnvidiallm-inferencebenchmarks

Few topics make the local-AI internet angrier than Mac vs. NVIDIA. The r/LocalLLaMA crowd will tell you a used RTX 3090 destroys any Mac. The Apple crowd will tell you a Mac Studio runs models that don’t even fit in an entire 4-GPU rig. Both are right, both are wrong, and almost nobody states the conditions under which each claim holds.

I run both: a Mac Studio M2 Max with 64 GB of unified memory on my desk, and a Linux box with an RTX 4090 (24 GB VRAM) that I built specifically to settle this argument for myself. Here’s what a year of using both actually taught me, with numbers.

The architectural difference that explains everything

Every result below follows from one fact: NVIDIA gives you enormous compute attached to small, very fast memory; Apple gives you moderate compute attached to large, fast-enough memory.

An RTX 4090 has roughly 1 TB/s of memory bandwidth and vastly more raw FP16 compute than any M-series chip — but only 24 GB of VRAM. My M2 Max has 400 GB/s of bandwidth (an M3 Ultra has 819 GB/s) and can dedicate well over 48 GB to a model, because the GPU shares the full unified memory pool with the CPU.

LLM inference (generating tokens) is memory-bandwidth-bound. LLM training and prompt processing are compute-bound. Once you internalize that, every benchmark stops being surprising.

The numbers from my desk

Same model files, llama.cpp on the Mac (Metal), llama.cpp with CUDA on the Linux box, Q4_K_M quantization, measured generation speed:

Workload	RTX 4090 (24 GB)	M2 Max (64 GB)
Llama 3.1 8B Q4 — generation	~105 tok/s	~58 tok/s
Qwen 2.5 32B Q4 — generation	~38 tok/s	~16 tok/s
Llama 3.3 70B Q4 — generation	doesn’t fit (CPU offload: ~2 tok/s)	~8.5 tok/s
8B — prompt processing, 8k context	~4,800 tok/s	~620 tok/s
Power draw under load (whole system)	450–520 W	65–90 W
Noise under sustained load	clearly audible	effectively silent

Three things jump out. The 4090 is roughly 2x faster at generation when the model fits. It is nearly 8x faster at prompt processing, which matters enormously for RAG and long-context work — feeding a 30-page document takes seconds on CUDA and a noticeable coffee-sip pause on the Mac. And the moment a model exceeds 24 GB, the NVIDIA advantage doesn’t shrink — it collapses. A 70B model split between VRAM and system RAM crawls at ~2 tok/s, while the Mac runs it entirely in fast unified memory at a usable 8–9 tok/s.

Where the Mac honestly wins

Large-model inference per dollar. A Mac Studio M3 Ultra-class machine with 128–192 GB of unified memory runs 70B models comfortably and 100B+ MoE models like Llama 4 Scout-class or Qwen3-235B (quantized) at usable speeds. To match that memory pool with NVIDIA hardware you’re stacking multiple 24 GB cards or buying datacenter GPUs — an A100 80 GB still costs more used than an entire Mac Studio, before you’ve bought the server to put it in.

Power and silence. My Mac Studio draws under 90 W flat-out and I cannot hear it. The 4090 box pulls half a kilowatt and sounds like intake fans on a regional jet. At Czech electricity prices (~7.5 CZK/kWh), running overnight batch jobs five nights a week costs me roughly 60 CZK/month on the Mac and over 400 CZK on the GPU rig. Not life-changing money, but the heat in a small office in July absolutely is.

Single-box simplicity. The Mac is also my development machine, runs macOS apps, and required zero driver fiddling. brew install ollama and I was generating tokens in four minutes. The Linux box has eaten entire evenings to CUDA version mismatches, kernel updates breaking drivers, and one memorable fight between PyTorch and a too-new nvcc.

Where NVIDIA honestly wins

Training and fine-tuning. This isn’t close, and Apple fans should stop pretending otherwise. A QLoRA fine-tune of an 8B model on a few thousand examples takes me about an hour on the 4090 with Unsloth. On the Mac, MLX has made real progress — mlx_lm.lora works and is genuinely usable for small adapters — but the same job runs 4–6x slower, and anything beyond LoRA-scale is a non-starter. If “fine-tuning” is in your weekly vocabulary, you need CUDA, whether you own it or rent it.

Prompt processing and batch throughput. That 8x prompt-processing gap is the Mac’s real weakness, not generation speed. Serving multiple users with vLLM, evaluating models across thousands of test prompts, bulk-embedding a document corpus — continuous batching on CUDA gives the 4090 aggregate throughput the Mac can’t approach. vLLM doesn’t even run on Metal; that tells you who the ecosystem builds for.

The software ecosystem. Every research repo, every new inference engine, every quantization paper assumes CUDA on day one. Metal and MLX support arrives weeks or months later, community-contributed, if at all. Flash Attention variants, FP8 inference, TensorRT-LLM — CUDA-first, CUDA-only, or CUDA-best. Betting on Mac means accepting you live 3–6 months behind the frontier of tooling.

The verdict, by use case

You want to run big models privately and use them — chat, coding assistance, RAG over your documents: buy the Mac, and put your budget into RAM, not CPU cores. A 64 GB MacBook Pro or Mac Studio gives you the 70B class in a silent box that’s also a great computer. This is most people reading this, even the ones who think it isn’t.

You train, fine-tune, or serve models to multiple users: NVIDIA, no debate. A used RTX 3090 (24 GB, ~$700 on the used market) remains the best value entry; a 4090 or 5090 if budget allows. Accept the noise, the power bill, and the Linux maintenance as the cost of doing real work.

The surprising middle — and what I actually do: the best setup is both, asymmetrically. The Mac is the daily driver: it runs a 32B coding model all day at desk-silent power draw and handles every privacy-sensitive task. The CUDA capacity doesn’t even have to be a box you own — when I need a fine-tune, I rent an A100 on RunPod or Lambda for $1.50–2/hour, run the job for three hours, and download the adapter. Total cost per fine-tune: less than a pizza. The “Mac vs. NVIDIA” framing assumes you must pick one box for everything, and that assumption is the actual mistake. Inference wants to live where you live; training wants to live where the compute is cheap and temporary.

The one-sentence truth

If the model fits in 24 GB and speed is everything, NVIDIA wins; if the model doesn’t fit or the machine has to live on your desk, the Mac wins; and if you’re serious about this hobby, you’ll eventually rent the NVIDIA part by the hour and keep the Mac.

That’s the whole war, settled. Now go argue about quantization formats instead.