The Problem
I have an Obsidian vault full of notes representing what I know (or at least what I’ve bothered to write down). When I pick up a new technical book, I have no idea which chapters will be review, which will be genuinely new, and which will connect ideas I already have in ways I haven’t thought of. I end up reading linearly and either skimming stuff I know or missing the parts that would actually fill gaps.
What if I could “diff” a book against my knowledge base and get a personalized reading plan?
Why This Should Work
The core insight comes from how Embeddings work. When you embed text — whether it’s a paragraph from my notes or a section from a book — you get a vector in high-dimensional space where proximity encodes semantic similarity. Two chunks about the same topic will be near each other even if they use completely different words.
That means I can literally measure the “distance” between what a book talks about and what I’ve already written about. Chunks of the book that are far from anything in my vault are likely new territory. Chunks that are close to my notes are probably review. And the interesting stuff is in between — topics I’ve touched on but where the book goes deeper.
The Obsidian vault is a great starting point because it gives us more than just raw text. The [[wiki-links]] between notes form an explicit knowledge graph — a map of how I think concepts relate. That graph structure enables a type of gap detection that pure embeddings can’t do: finding novel relationships between concepts I already know individually.
Architecture
Data Sources
- My knowledge: Obsidian vault (markdown files + link graph)
- Reference material: Books, textbooks, docs (PDF/epub/markdown)
Core Components
- Vault Parser — Read all markdown, extract text content, extract
[[link]]graph, pull metadata (tags, frontmatter, folder structure) - Embedding Pipeline — Chunk and embed both vault notes and reference material using the same model
- Knowledge Graph Builder — Construct graph from Obsidian links, optionally enrich with LLM-extracted concepts
- Diff Engine — Compare the two embedding clouds + graph structure to identify gaps
- Report Generator — Output a readable, actionable summary (ideally as a new Obsidian note with links back into the vault)
Tech Stack (Initial Thinking)
| Component | Tool | Reasoning |
|---|---|---|
| Vault parsing | obsidiantools (Python) | Handles markdown + link extraction, gives NetworkX graph |
| Embeddings | OpenAI text-embedding-3-small or open-source bge-large | Good quality, cheap. Open-source option avoids API dependency |
| Vector storage | ChromaDB | Lightweight, local, good enough for thousands of chunks. No need for Pinecone at this scale |
| Graph analysis | NetworkX | Already a dependency via obsidiantools, solid for this use case |
| Concept extraction | Claude API or local LLM | For enriching the knowledge graph beyond just wiki-links |
| Book processing | PyMuPDF or ebooklib | PDF/epub → text extraction |
How the Diff Works
Three Layers of “New to Me”
Layer 1: Novel Concepts (easiest to detect) The book discusses topics that have no nearby neighbors in my vault’s embedding space. These are completely new territory — I haven’t written about anything similar. Detection: for each book chunk, find nearest neighbor in vault embeddings. If distance exceeds threshold → novel.
Layer 2: Depth Gaps (moderately hard) I have a note or two touching on a topic, but the book has an entire chapter. The embedding distance is small (I’m “near” the topic) but my coverage is thin. Detection: compare embedding density — count how many vault chunks vs. book chunks fall in the same region. A ratio skewed heavily toward the book signals a depth gap.
Layer 3: Novel Relationships (hardest, most valuable) The book draws a connection between two concepts I know independently but haven’t linked. My knowledge graph has both nodes but no edge. Detection: when a book chunk is semantically close to two vault notes that aren’t linked to each other in the Obsidian graph, that’s a candidate. This is where the wiki-link graph really pays off.
The Output
For a given book, generate something like:
## Reading Plan: "Designing Data-Intensive Applications"
### High Priority (Likely New)
- Chapter 9: Consistency and Consensus — no coverage in vault
- Chapter 7: Transactions (sections on serializable isolation) —
you have surface notes on transactions but nothing on isolation levels
### Medium Priority (Depth Gaps)
- Chapter 5: Replication — you have notes on leader-follower
but nothing on leaderless or conflict resolution
- Chapter 3: Storage Engines — your LSM-tree notes are thin
compared to the book's treatment
### Novel Connections
- The book links [[CAP Theorem]] to [[Linearizability]] in a way
your vault doesn't — Chapter 9 draws this out explicitly
### Likely Review (Skim or Skip)
- Chapter 1: Foundations — high overlap with your existing
[[System Design]] and [[Distributed Systems]] notes
Phases
Phase 1: MVP — Embedding-Only Diff
Get the basic pipeline working without the knowledge graph layer. Parse vault, embed everything, embed a book, compute nearest-neighbor distances, produce a ranked list of “most novel” sections. This alone is useful and validates the approach.
Deliverables: Python script that takes a vault path and a book file, outputs a markdown report.
Phase 2: Add the Knowledge Graph
Incorporate Obsidian’s link structure. Build the graph, use it to detect novel-relationship candidates. Enrich the graph with LLM-extracted concepts from notes that don’t have many wiki-links.
Deliverables: Enhanced report with relationship gap detection. Output as an Obsidian note with [[links]] back into the vault.
Phase 3: RLM Exploration
Experiment with using a Recursive Language Model instead of (or alongside) the embedding pipeline. Instead of pre-computing all embeddings, give an RLM the vault contents and the book as variables, and let it programmatically compare them. This might handle nuance better than pure vector similarity — an RLM can understand why two passages are related, not just that they’re similar.
Deliverables: RLM-based diff implementation, comparison against embedding-only approach.
Phase 4: Polish and Iterate
Interactive mode (“tell me more about this gap”), better chunking strategies, support for multiple books/reference corpora, maybe a simple UI. Consider whether this is worth packaging as an Obsidian plugin.
Open Questions
- Chunking strategy: Should I chunk by heading (which respects document structure) or by fixed token count (which is simpler)? Heading-based is probably better for books with clear structure. Need to experiment.
- Embedding model choice: OpenAI’s models are easy but introduce API costs and dependency. Open-source models like
bge-large-en-v1.5run locally and are nearly as good. Worth benchmarking both. - How to handle vault coverage bias: My vault represents what I’ve written about, not everything I know. Some topics I understand well but never made notes on. This will create false “gaps.” Possible mitigation: use an LLM to interview me about detected gaps before finalizing the report.
- Threshold tuning: What cosine similarity score separates “you know this” from “this is new”? Probably needs to be calibrated per-vault. Could bootstrap by having me label a few book sections as known/unknown and fitting the threshold.
- Multi-book support: Eventually I want to load multiple reference books and diff against all of them — “across these 5 distributed systems books, here’s your aggregate gap map.” Straightforward extension but adds complexity to the report.
Related Concepts
- Embeddings
- Retrieval-Augmented Generation
- Recursive Language Model
- Vector Databases
- Cosine Similarity
- Knowledge Graphs
- Out-of-Core Algorithms
References
- Recursive Language Models paper — Zhang, Kraska, Khattab (MIT, 2025)
- RLM GitHub repo — Drop-in inference library
- obsidiantools — Python library for parsing Obsidian vaults
- ChromaDB — Lightweight local vector database
- OpenAI Embeddings — API docs for text-embedding modelsOk