Photo: Unsplash
RAG on Your Own Notes — Build a Second Brain That Actually Answers
I have nine years of notes in Obsidian: 4,300 files of meeting notes, research clippings, project decisions, and half-finished ideas. The “second brain” promise was always that this archive would compound into something smarter than me. The reality was that it compounded into something I searched, badly, with keyword matching, and usually gave up.
What finally changed it: RAG over the vault, running entirely on my Mac. Now I type “what did we decide about the pricing tiers in the March meeting with Tomáš?” and get the answer, with links to the exact notes it came from, in about four seconds, from a model running on my own hardware. This post explains the concept in plain words, compares the three easiest paths on a Mac, and walks through the Obsidian one step by step.
RAG in plain words
RAG — retrieval-augmented generation — is three steps wearing a trench coat:
- Embed. Every note gets split into chunks, and each chunk is converted into a vector — a list of numbers that captures its meaning. Chunks about “quarterly pricing decisions” end up numerically near each other, even with zero shared keywords. The vectors go into an index on disk.
- Retrieve. When you ask a question, the question is embedded the same way, and the index returns the handful of chunks closest in meaning to it.
- Generate. Those chunks get pasted into the LLM’s context along with your question, with an instruction like “answer using only this material and cite your sources.”
That’s it. The model isn’t trained on your notes and doesn’t memorize them — it’s handed the relevant pages at question time, like a researcher with a very fast librarian. This is also why RAG is the right tool for personal knowledge where fine-tuning is the wrong one: RAG retrieves facts; fine-tuning teaches style.
The three easy paths on a Mac
All three options below run fully locally against Ollama. Prerequisite for each:
brew install ollama
ollama pull llama3.1:8b # the answering model
ollama pull nomic-embed-text # the embedding model (tiny, excellent)
Path 1: Obsidian + a Copilot-style community plugin. If your notes are already in Obsidian, this is the winner, and it’s the walkthrough below. Plugins like Copilot for Obsidian or Smart Connections index your vault and add a chat sidebar that answers with [[wikilinks]] to source notes. Friction: lowest. Your notes never leave their folder.
Path 2: AnythingLLM. A free desktop app where you create a workspace, drag in folders (Markdown, PDFs, Word docs — not just notes), pick Ollama as both LLM and embedder, and chat. Best choice if your knowledge is scattered across formats, or you want RAG over documents plus notes. Friction: low. Polish: highest.
Path 3: Open WebUI. A self-hosted web interface for Ollama with solid built-in document RAG — upload files or point it at a directory, then reference them with # in chat. Best if you already run Open WebUI as your local ChatGPT replacement, or want one interface shared across machines (pair it with Tailscale). Friction: medium — it’s a Docker container, not a double-clickable app.
The walkthrough: Obsidian + Copilot + Ollama
Fifteen minutes, start to finish.
Step 1 — serve Ollama with the right CORS setting. Obsidian plugins call Ollama from an app context, so it must allow that origin:
OLLAMA_ORIGINS="app://obsidian.md*" ollama serve
(To make it permanent: launchctl setenv OLLAMA_ORIGINS "app://obsidian.md*" and restart Ollama.)
Step 2 — install the plugin. In Obsidian: Settings → Community plugins → Browse → search “Copilot” → Install → Enable.
Step 3 — point it at local models. In the Copilot settings, add a custom model with provider Ollama: chat model llama3.1:8b, embedding model nomic-embed-text, base URL http://localhost:11434. Skip every cloud API key field — you don’t need any.
Step 4 — index the vault. Run the command “Copilot: Index vault for QA” from the command palette. This is the embedding pass: my 4,300-note vault took about 9 minutes on an M-series Max, one-time, with incremental updates afterwards as notes change. The index lives inside your vault folder as local files.
Step 5 — ask. Switch the chat sidebar to Vault QA mode and ask real questions:
“What were the open objections in the Q3 partner negotiations?” “Summarize everything I’ve noted about MLX fine-tuning, with links.” “When did I last change the home server backup strategy, and why?”
Answers arrive with wikilinks to the source notes — click through and verify. That verification loop is the whole game: a second brain you can’t audit is just a confident stranger.
What it’s great at — and where it disappoints
Two months of daily use, honestly scored.
Genuinely great: factual recall from your own material. “What did X say in the meeting on the 14th,” “what dosage did the vet recommend,” “which library did I pick for the PDF parsing and what was the alternative” — near-perfect, because the answer lives in one or two chunks and retrieval excels at finding them. Same for research summaries scoped to a topic: “summarize my notes on local LLM quantization” produces a tight digest with sources. It has also resurfaced notes I’d completely forgotten existed, which keyword search never did because I couldn’t remember the keywords.
Disappointing: reasoning across many documents. “How has my thinking about pricing evolved over two years” or “find contradictions between my project plans” produces shallow, confident-sounding mush. The mechanics explain why: retrieval returns the top handful of chunks, so any question whose true answer is spread across forty notes gets answered from five. RAG is a precision tool for finding and synthesizing a few passages — it is not a analyst that reads your whole archive. Knowing this boundary up front is the difference between loving the tool and rage-quitting it.
Also mundane but real: answer quality is capped by note quality. Untitled fragments and orphaned bullet points retrieve poorly. RAG rewards the structured note-taking you were already supposed to be doing.
Chunking and embeddings, in one approachable paragraph
The two knobs that matter, demystified. Chunking is how your notes get split before embedding: too small (a sentence) and chunks lose their context; too large (a whole note) and the retrieved material is mostly irrelevant padding. The sweet spot for notes is a few hundred tokens with some overlap, and splitting on Markdown headings beats splitting blindly — which is one reason well-structured notes retrieve better. The embedding model is what decides which chunks count as “similar,” and you don’t need a big one: nomic-embed-text (about 270MB) embeds an entire vault in minutes and punches far above its size; mxbai-embed-large is a fine alternative. One warning that saves real pain: if you ever switch embedding models, re-index everything — vectors from different models live in different mathematical spaces and comparing them produces silent garbage.
The privacy dividend
Worth stating plainly, because notes are the most intimate dataset you own — health, salary negotiations, complaints about colleagues, half-formed opinions you’d never publish. In this setup, every component runs on localhost: Ollama serves the models, the embedding index is files inside your vault, and the plugin talks to 127.0.0.1:11434. No accounts, no telemetry decisions to audit, no terms-of-service update that suddenly applies to your journal. I verified with a network monitor during indexing and querying: zero outbound connections. The second brain stays in your skull’s general vicinity.
Start with twenty notes
The mistake is treating this as a Grand Archive Project. Don’t. Install the plugin tonight, index whatever vault you have, and ask it five questions you actually want answered. Either it earns its place in your daily loop within a week — for me it took two days, when it answered a question about a 2021 decision I’d have sworn was undocumented — or you delete the plugin and you’re out fifteen minutes.
The promise of the title is the part that surprised me most: the second brain finally answers. Nine years of accumulated notes stopped being a write-only archive the day a 4GB model and a 270MB embedder started reading them back to me — with citations, on my own machine, for free.
