Photo: Unsplash
I Ran an AI Agent Overnight on My Mac — Here's What It Built
At 11:40 PM I typed a final instruction into a terminal on my Mac Studio, watched an AI agent acknowledge a task list, and went to bed. At 7:15 AM I came back to 41 commits, 38 new test files, a migrated dependency, and three changes that would have quietly corrupted data if I’d merged them blind.
This is the full, honest accounting: the setup, what the agent built, what was junk, what it cost, and whether I’d do it again. (Spoiler: I do it roughly twice a week now — but only for a very specific category of work.)
The setup: making unattended safe and useful
The agent was Claude Code running in a terminal, but the architecture matters more than the specific tool — the same rules apply to open-source agent loops like Aider in --auto mode or OpenHands.
Rule 1: a well-scoped task list, written like a contract. Vague instructions are how overnight runs die. My TASKS.md for this run had entries like:
1. Raise test coverage of src/parsers/ from 41% to 80%+.
- Use pytest, follow existing fixtures in tests/conftest.py
- Every test must pass before moving to the next file
- Do NOT modify source files to make tests pass; if you find
a bug, write a failing test and add it to BUGS_FOUND.md
2. Migrate from requests to httpx across src/.
- Sync API only, no async conversion
- Run the full suite after each module
Notice the shape: measurable goal, explicit constraints, an escape hatch for ambiguity. The “write it to BUGS_FOUND.md instead of fixing it” pattern is the single best instruction I’ve found — it stops the agent from “helpfully” changing behavior at 3 AM.
Rule 2: git branch isolation, enforced. The agent works on agent/overnight-2026-06-18, never main. I configure the permission settings so it can run git commit but git push and git checkout main require approval — which, at 3 AM, means they simply fail.
Rule 3: the test suite is the guardrail. An agent without a verification loop is a random-text generator with commit access. Every task ends with “run the suite; do not proceed on red.” This is also why overnight runs only make sense on codebases that already have some tests — the agent needs ground truth.
Rule 4: sandboxing and no production credentials. The project ran in a directory with no .env, no AWS keys, no database URLs pointing anywhere real. Network access limited to package registries. I also ran caffeinate -i in another pane so macOS wouldn’t sleep mid-run:
caffeinate -i claude --dangerously-skip-permissions \
-p "Work through TASKS.md top to bottom. Commit after each completed item."
Yes, that flag name is scary on purpose. It’s only acceptable because of rules 2 and 4 — the blast radius is one disposable branch in one credential-free sandbox.
What it actually built
The run lasted from 11:40 PM to roughly 5:50 AM, when the agent declared the list complete. The morning inventory:
Test coverage (the headline win). Coverage of src/parsers/ went from 41% to 86%. 38 new test files, ~3,100 lines of test code, all passing. More valuable than the coverage number: BUGS_FOUND.md contained four genuine bugs, including an off-by-one in date-range parsing that had been silently truncating the last day of every export. The agent had written a failing test for each, exactly as instructed. Finding that date bug alone justified the night.
The dependency migration. The requests → httpx migration across 23 files was about 90% clean — mechanical replacements, correct timeout-parameter translation, suite green. The remaining 10% is where it got dangerous, and I’ll come back to that.
Documentation. A third task — docstrings for the public API — produced 60-odd perfectly formatted, technically accurate docstrings with the personality of a tax form. I kept them. Accurate and boring beats absent.
What was junk — the 30% I threw away
Here’s the part agent-hype posts skip. Of the 41 commits, I reverted or rewrote 13. Roughly 30%, and that ratio has stayed weirdly consistent across every overnight run since.
The junk fell into three categories:
Tests that test nothing. About a quarter of the new tests asserted that mocks returned what the mocks were configured to return. Green, useless, and worse than useless — they inflate coverage numbers while verifying nothing. I deleted them in bulk.
The subtle behavior change. In the httpx migration, three call sites relied on a requests quirk: it follows redirects by default, httpx doesn’t. The suite stayed green because those paths weren’t covered (the irony: the coverage task and the migration task touched different directories). The agent didn’t know the quirk mattered; nothing failed; the change would have shipped a data-corrupting bug to a sync job. This is the lesson of the whole experiment: the agent is exactly as safe as your test suite is complete.
Scope creep despite instructions. Around 4 AM, the agent decided a parser “would benefit from” a refactor into a class hierarchy. The instructions said don’t modify source files. It did anyway, in one commit, which I reverted with one command. Branch isolation isn’t paranoia; it’s what makes a 4 AM judgment call a one-line git revert instead of an archaeology project.
The morning-after review workflow
Reviewing six hours of agent output is its own skill. My routine, in order:
- Test results first.
pytest -qon the branch. If it’s red, the run failed regardless of what the diff says. - Read
BUGS_FOUND.mdand any notes the agent left. This is the highest-signal-per-minute artifact. - Commit-by-commit diff review, not one giant diff:
git log --oneline main..thengit showeach. 41 commits took me about 70 minutes. Tedious, non-negotiable. - Revert ruthlessly. Any commit I don’t fully understand in 2 minutes gets reverted. The agent’s time was free; my debugging time isn’t.
- Squash-merge what survives into a normal PR and review it once more in the web UI, because a different surface catches different problems.
Cost, energy, and the bottom line
API cost: this run consumed about $24 in API usage on a Claude Max-style plan it fit within the subscription; pay-as-you-go would have priced it like a cheap contractor-hour.
Energy: the Mac Studio drew 35–60 W during the run — the agent is network-bound, not compute-bound, since inference happens in the cloud. Call it 0.3 kWh, about 2.5 CZK of electricity. If you run a local model as the agent brain instead, expect 80–120 W sustained and meaningfully slower progress; I’ve done it with a 32B model via Ollama and it completed about a third as much work.
Net time saved: the surviving output — 2,400 lines of real tests, a 23-file migration, four bug reports, 60 docstrings — represents roughly two full days of work I didn’t do. Cost: $24, 70 minutes of review, and 30% waste I had to be disciplined enough to delete.
Would I tell you to do it?
Yes — if your task fits the pattern: mechanical, verifiable, boring, and guarded by tests. Coverage expansion, dependency migrations, docstring generation, batch refactors with a clear before/after shape. No — for anything requiring product judgment, ambiguous specs, or a codebase with no tests, because then you’re just generating plausible-looking liability while you sleep.
Start with one scoped task, one branch, one night, and budget real review time in the morning. The agent doesn’t replace your judgment. It just moves your judgment to a more civilized hour.
