Photo: Unsplash
Apple MLX Is the Most Underrated AI Framework Right Now
Everyone running local AI on a Mac knows Ollama. Almost nobody outside a small Hugging Face community knows that Apple quietly ships its own open-source machine learning framework — one designed from scratch for Apple Silicon’s unified memory, often faster than the alternatives, and capable of something most people still believe requires an NVIDIA cluster: fine-tuning a language model on a laptop.
It’s called MLX, it came out of Apple’s machine learning research group, and it’s the most underrated tool in the local AI stack right now. I’ve spent months with it. Here’s the case, and a hands-on walkthrough that ends with you fine-tuning a model on your own writing — this afternoon, on a MacBook.
What MLX actually is
MLX is a NumPy-like array framework with a PyTorch-flavored API, built by Apple specifically for Apple Silicon. Three design decisions make it special:
Unified memory as a first-class concept. In CUDA-world frameworks, you constantly shuttle tensors between CPU and GPU memory (tensor.to("cuda") and friends). MLX has no such concept — arrays live in unified memory, and operations run on CPU or GPU without the data moving anywhere. The architecture matches the hardware instead of fighting it.
Lazy evaluation. Computations build a graph and execute only when results are needed, letting MLX fuse and optimize operations. You notice this as efficiency you didn’t have to engineer.
It’s genuinely open. MIT-licensed, developed on GitHub in the open, with Python, Swift, C++, and C frontends. For a company famous for secrecy, MLX is a strange and wonderful artifact.
The flagship use case is the mlx-lm package — LLM inference and training on top of MLX. That’s where the fun is, so let’s go there.
mlx-lm in five minutes
pip install mlx-lm
Generate text with one command — models download automatically from Hugging Face:
mlx_lm.generate \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--prompt "Explain LoRA fine-tuning in two paragraphs." \
--max-tokens 300
That mlx-community namespace is a quiet treasure: a Hugging Face organization with thousands of pre-converted, pre-quantized models in MLX format — Llama, Qwen, Mistral, Gemma, DeepSeek distills, vision models — typically in 4-bit and 8-bit variants, usually available within a day or two of any notable release. You’ll rarely need to convert anything yourself, but if you do: mlx_lm.convert --hf-path <repo> -q handles it.
Need an API? mlx-lm ships an OpenAI-compatible server:
mlx_lm.server --model mlx-community/Qwen2.5-32B-Instruct-4bit --port 8080
Now anything that speaks the OpenAI API — editor plugins, scripts, Open WebUI — can point at http://localhost:8080/v1 and use your local MLX model. This one command makes MLX a drop-in backend for your existing tooling.
Is it actually faster than llama.cpp?
Sometimes, and it’s worth being precise about when, because “MLX is faster” and “llama.cpp is faster” are both true depending on the workload.
In my testing on an M-series Max with the same 4-bit 8B model, MLX consistently delivers 10-20% higher generation throughput — where llama.cpp gives me ~38 tokens/sec, MLX lands around 43-46. Community benchmarks broadly agree. Prompt processing (chewing through a long input before generation starts) is also strong in MLX, which matters enormously for long-document and RAG workloads where you feed 8K tokens in and get 200 out.
Where llama.cpp still wins: maturity of the ecosystem (Ollama’s polish, grammar-constrained output, broader quantization menagerie like the K-quants and i-quants), partial CPU offload for models slightly too big for your GPU ceiling, and portability — llama.cpp runs anywhere; MLX runs on Apple Silicon, full stop.
My honest setup: Ollama/llama.cpp for everyday convenience, MLX when I care about peak throughput on long contexts — and for the thing llama.cpp simply doesn’t really do, which is the next section. That’s also the real reason MLX deserves the “underrated” label: it isn’t just an inference engine. It’s a training framework.
Fine-tune a model on your own writing — this afternoon
Most people believe fine-tuning requires renting A100s. Wrong. LoRA (Low-Rank Adaptation) freezes the base model and trains only small adapter matrices — a few million parameters instead of billions. A 4-bit 8B model plus LoRA adapters trains comfortably in under 16GB of unified memory. Any current MacBook Pro qualifies.
Here’s the complete recipe I used to teach a model to draft blog posts in my voice.
Step 1: Build the dataset (60-90 minutes). mlx-lm wants JSONL files — train.jsonl and valid.jsonl in one directory — with chat-formatted examples:
{"messages": [{"role": "user", "content": "Write an intro paragraph about local Whisper transcription."}, {"role": "assistant", "content": "<an actual intro paragraph you wrote>"}]}
I scripted this: took 80 of my published posts, split them into ~300 instruction/response pairs (section headings became instructions, section bodies became responses), held out 30 pairs for valid.jsonl. Even 100-200 quality pairs produce a noticeable style shift. Quality beats quantity — garbage pairs teach garbage.
Step 2: Train (30-60 minutes).
mlx_lm.lora \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--train \
--data ./data \
--batch-size 2 \
--num-layers 16 \
--iters 600
On my M-series Max this run takes about 35 minutes, memory peaks around 12GB, and the laptop stays usable throughout (plug it in and enable High Power Mode). Watch the validation loss it prints every 200 iterations — when it stops falling, more iterations are just overfitting.
Step 3: Test and fuse (10 minutes).
# Chat with adapters applied
mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--adapter-path ./adapters \
--prompt "Write an intro paragraph about MLX fine-tuning."
# Bake adapters into a standalone model
mlx_lm.fuse --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--adapter-path ./adapters --save-path ./my-style-llama
The result genuinely startled me: the fine-tuned model picked up my sentence rhythm, my habit of front-loading concrete numbers, even my em-dash addiction. It’s not me — it still needs editing — but as a first-draft generator it beats any system prompt I’ve ever written, because style lives in weights, not instructions.
Total cost of this afternoon: $0 and roughly 0.1 kWh. The same experiment on rented cloud GPUs would be trivially cheap too, granted — but the friction difference is the point. When fine-tuning is a local, free, 35-minute loop, you iterate. I’ve since trained adapters for commit-message style, for summarizing in my note-taking format, and one regrettable experiment in writing limericks.
Why “underrated” is the right word
MLX has no marketing budget and no keynote slide. It’s maintained by a small Apple research team and a passionate community, while the spotlight stays on Ollama (deservedly — it’s great). But the capability gap in mindshare is absurd: millions of Apple Silicon Macs are fine-tuning-capable machines whose owners think training is something other people do.
Start tonight: pip install mlx-lm, run the generate command above, and watch your first MLX tokens stream. Then spend a weekend hour scripting your dataset. By Sunday afternoon you’ll have a model that writes a little bit like you — produced on a laptop, on your desk, by a framework most of the AI world hasn’t bothered to notice yet.