AI Power User

Fine-Tune Your Own AI Model on a MacBook — Yes, Really

A complete afternoon project using MLX LoRA to teach an 8B model your writing voice on Apple Silicon

By Jakub Jirák Jun 15, 2026 5 min read

“Fine-tuning is for ML teams with GPU clusters” was true in 2022. Today, on the MacBook Pro you may already own, you can fine-tune an 8B-parameter model on your own data in roughly the time it takes to watch a football match — using MLX, Apple’s open-source ML framework built specifically for Apple Silicon’s unified memory.

I’ve now done this four times for real tasks: a model that writes support replies in my product’s voice, and one that formats messy meeting notes into my exact template. This post is the complete, reproducible afternoon project. Every command is real and tested on an M3 Max with 36GB of RAM; I’ll flag where smaller machines differ.

First, the honest part: what fine-tuning actually does

Fine-tuning — specifically LoRA (Low-Rank Adaptation), which is what we’re doing — does not teach a model new facts. If you fine-tune on your company wiki and ask the model a question from page 47, it will hallucinate confidently. Knowledge injection is what RAG is for.

What LoRA is spectacularly good at:

Style and voice. Train on 300 of your email replies and the model writes like you — your sentence rhythm, your sign-offs, your level of formality.
Output format. If you need strict JSON, a specific Markdown template, or a house style for changelogs, fine-tuning beats even elaborate prompting and never “forgets” mid-conversation.
Domain vocabulary and conventions. A model tuned on radiology reports or legal memos uses the right register and structure without three paragraphs of system prompt.

The mental model: prompting is giving instructions; fine-tuning is hiring someone who already worked at your company. LoRA achieves this by training small adapter matrices (a few million parameters) on top of the frozen base model — which is why it fits in a laptop’s memory.

Step 1: install MLX and pick a base model

pip install mlx-lm

That’s it. For the base model I recommend Mistral-7B-Instruct-v0.3 or Meta-Llama-3.1-8B-Instruct in 4-bit, both available pre-quantized from the mlx-community Hugging Face org. The 4-bit 8B model needs about 5GB of memory for inference and roughly 12-16GB during training — fine on a 16GB Mac if you close Chrome, comfortable on 32GB+.

# quick smoke test — this downloads the model too
mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --prompt "Write a one-line haiku about unified memory."

Step 2: build the dataset (this is 80% of the outcome)

MLX expects JSONL files: train.jsonl and valid.jsonl in one directory, one JSON object per line. The simplest format is chat:

{"messages": [{"role": "user", "content": "Customer asks: Can I get a refund after 30 days?"}, {"role": "assistant", "content": "Hi! Our standard window is 30 days, but if the issue is a bug on our side, we make exceptions — just reply with your order ID and I'll sort it out personally. — Jakub"}]}

For my support-voice model I exported 340 real email replies, paired each with the customer message that prompted it, and spent two hours cleaning them — removing one-offs, fixing typos I didn’t want the model to learn, deduplicating near-identical answers. Put ~90% in train.jsonl and ~10% in valid.jsonl.

How much data do you need? For style and format, 100-500 high-quality examples genuinely works. Quality dominates quantity: 200 carefully curated examples beat 2,000 noisy ones every single time. If you can’t articulate what every example has in common, the model can’t learn it either.

Step 3: train

The actual command, with the flags that matter:

mlx_lm.lora \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --train \
  --data ./data \
  --batch-size 2 \
  --num-layers 16 \
  --iters 600 \
  --learning-rate 1e-5 \
  --adapter-path ./adapters \
  --val-batches 25 \
  --steps-per-report 10 \
  --steps-per-eval 100

What the knobs do: --num-layers 16 applies LoRA to the top 16 transformer layers (more layers = more capacity = more memory); --iters 600 is the number of training steps — for ~300 examples at batch size 2 that’s roughly 4 epochs; --batch-size 2 keeps memory in check (drop to 1 on a 16GB machine, raise on a Max/Ultra).

Training time, measured: on my M3 Max, 600 iterations on the 4-bit 8B model ran at ~0.6 seconds per step — about 7 minutes. An M1/M2 Pro lands around 15-25 minutes for the same run. This is the part people refuse to believe: yes, minutes, not days. A few hundred examples is a tiny dataset and LoRA only updates ~0.1% of the weights.

Watch the logs. You want training loss and validation loss both descending. The moment validation loss flattens or starts climbing while training loss keeps falling, you’re overfitting — note the iteration count and use the checkpoint from before that point (MLX saves them in ./adapters).

Step 4: evaluate like you mean it

Test the adapter without any fusing:

mlx_lm.generate \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --adapter-path ./adapters \
  --prompt "Customer asks: Do you offer student discounts?" \
  --max-tokens 200

My evaluation method is unsophisticated and effective: I keep 20 held-out prompts in a text file, generate answers from base model and tuned model, and blind-compare. For the support model, the tuned output matched my voice in 17 of 20 — correct sign-off, correct “warm but brief” register, and it stopped opening every reply with “Thank you for reaching out,” which alone was worth the afternoon.

Also probe for damage: ask the tuned model a few general questions (“explain DNS in two sentences”). Heavy overtraining causes catastrophic forgetting — the model starts answering everything in customer-support voice. If that happens, you over-trained; cut --iters or learning rate.

Step 5: fuse and serve with Ollama

To make the model a regular citizen of your toolchain, fuse the adapter into the weights and hand it to Ollama:

mlx_lm.fuse \
  --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
  --adapter-path ./adapters \
  --save-path ./my-support-model \
  --export-gguf

The --export-gguf flag writes a GGUF file Ollama can ingest. Create a Modelfile:

FROM ./my-support-model/ggml-model-f16.gguf
SYSTEM "You are Jakub's support assistant. Answer in his established style."
PARAMETER temperature 0.4

ollama create support-jakub -f Modelfile
ollama run support-jakub

Now it’s available to every Ollama-aware app on your Mac — BoltAI, Raycast, your scripts — as just another model.

The failure modes that eat afternoons

I’ve hit all of these so you don’t have to:

Overfitting on tiny datasets is failure mode number one. Symptom: the model regurgitates training examples verbatim, or answers unrelated questions with memorized replies. With under 200 examples, keep it to 2-3 epochs and consider --num-layers 8.

Wrong learning rate is number two. Too high (1e-4 on a small dataset) and loss oscillates or the model degrades into incoherence; too low (1e-6) and 600 iterations change nothing and you’ll conclude fine-tuning “doesn’t work.” Start at 1e-5; change it only based on what the loss curves tell you.

Inconsistent training data is the silent one. If a third of your examples sign off “Best, Jakub” and the rest don’t, the model flips a weighted coin every time. Whatever behavior you want, it must be near-universal in the data.

Expecting knowledge. Worth repeating: if the answer to a question isn’t derivable from style and format, fine-tuning won’t provide it. Combine the tuned model with RAG when you need both voice and facts.

The afternoon, totaled

Realistic time budget: 2 hours building and cleaning the dataset, 10 minutes setting up MLX, 7-25 minutes training depending on your chip, 30 minutes evaluating and one retrain, 15 minutes fusing and wiring into Ollama. Call it three to four hours, most of it data work, all of it on hardware you already own, at an electricity cost of approximately one espresso.

The title promised a fine-tuned model on a MacBook in an afternoon. You now have the exact commands, the real flags, the measured timings, and the failure modes. The only ingredient I can’t ship in a code block is the 300 good examples — that part is yours.