Knowledge Diff - Obsidian Vault vs. Reference Corpus

The Problem

I have an Obsidian vault full of notes representing what I know (or at least what I’ve bothered to write down). When I pick up a new technical book, I have no idea which chapters will be review, which will be genuinely new, and which will connect ideas I already have in ways I haven’t thought of. I end up reading linearly and either skimming stuff I know or missing the parts that would actually fill gaps.

What if I could “diff” a book against my knowledge base and get a personalized reading plan?

Why This Should Work

The core insight comes from how Embeddings work. When you embed text — whether it’s a paragraph from my notes or a section from a book — you get a vector in high-dimensional space where proximity encodes semantic similarity. Two chunks about the same topic will be near each other even if they use completely different words.

That means I can literally measure the “distance” between what a book talks about and what I’ve already written about. Chunks of the book that are far from anything in my vault are likely new territory. Chunks that are close to my notes are probably review. And the interesting stuff is in between — topics I’ve touched on but where the book goes deeper.

The Obsidian vault is a great starting point because it gives us more than just raw text. The [[wiki-links]] between notes form an explicit knowledge graph — a map of how I think concepts relate. That graph structure enables a type of gap detection that pure embeddings can’t do: finding novel relationships between concepts I already know individually.

Architecture

Data Sources

My knowledge: Obsidian vault (markdown files + link graph)
Reference material: Books, textbooks, docs (PDF/epub/markdown)

Core Components

Vault Parser — Read all markdown, extract text content, extract [[link]] graph, pull metadata (tags, frontmatter, folder structure)
Embedding Pipeline — Chunk and embed both vault notes and reference material using the same model
Knowledge Graph Builder — Construct graph from Obsidian links, optionally enrich with LLM-extracted concepts
Diff Engine — Compare the two embedding clouds + graph structure to identify gaps
Report Generator — Output a readable, actionable summary (ideally as a new Obsidian note with links back into the vault)

Tech Stack (Initial Thinking)

Component	Tool	Reasoning
Vault parsing	`obsidiantools` (Python)	Handles markdown + link extraction, gives NetworkX graph
Embeddings	OpenAI `text-embedding-3-small` or open-source `bge-large`	Good quality, cheap. Open-source option avoids API dependency
Vector storage	ChromaDB	Lightweight, local, good enough for thousands of chunks. No need for Pinecone at this scale
Graph analysis	NetworkX	Already a dependency via obsidiantools, solid for this use case
Concept extraction	Claude API or local LLM	For enriching the knowledge graph beyond just wiki-links
Book processing	PyMuPDF or `ebooklib`	PDF/epub → text extraction

How the Diff Works

Three Layers of “New to Me”

Layer 1: Novel Concepts (easiest to detect) The book discusses topics that have no nearby neighbors in my vault’s embedding space. These are completely new territory — I haven’t written about anything similar. Detection: for each book chunk, find nearest neighbor in vault embeddings. If distance exceeds threshold → novel.

Layer 2: Depth Gaps (moderately hard) I have a note or two touching on a topic, but the book has an entire chapter. The embedding distance is small (I’m “near” the topic) but my coverage is thin. Detection: compare embedding density — count how many vault chunks vs. book chunks fall in the same region. A ratio skewed heavily toward the book signals a depth gap.

Layer 3: Novel Relationships (hardest, most valuable) The book draws a connection between two concepts I know independently but haven’t linked. My knowledge graph has both nodes but no edge. Detection: when a book chunk is semantically close to two vault notes that aren’t linked to each other in the Obsidian graph, that’s a candidate. This is where the wiki-link graph really pays off.

The Output

For a given book, generate something like:

## Reading Plan: "Designing Data-Intensive Applications"

### High Priority (Likely New)
- Chapter 9: Consistency and Consensus — no coverage in vault
- Chapter 7: Transactions (sections on serializable isolation) — 
  you have surface notes on transactions but nothing on isolation levels

### Medium Priority (Depth Gaps)  
- Chapter 5: Replication — you have notes on leader-follower 
  but nothing on leaderless or conflict resolution
- Chapter 3: Storage Engines — your LSM-tree notes are thin 
  compared to the book's treatment

### Novel Connections
- The book links [[CAP Theorem]] to [[Linearizability]] in a way 
  your vault doesn't — Chapter 9 draws this out explicitly
  
### Likely Review (Skim or Skip)
- Chapter 1: Foundations — high overlap with your existing 
  [[System Design]] and [[Distributed Systems]] notes

Phases

Phase 1: MVP — Embedding-Only Diff

Get the basic pipeline working without the knowledge graph layer. Parse vault, embed everything, embed a book, compute nearest-neighbor distances, produce a ranked list of “most novel” sections. This alone is useful and validates the approach.

Deliverables: Python script that takes a vault path and a book file, outputs a markdown report.

Phase 2: Add the Knowledge Graph

Incorporate Obsidian’s link structure. Build the graph, use it to detect novel-relationship candidates. Enrich the graph with LLM-extracted concepts from notes that don’t have many wiki-links.

Deliverables: Enhanced report with relationship gap detection. Output as an Obsidian note with [[links]] back into the vault.

Phase 3: RLM Exploration

Experiment with using a Recursive Language Model instead of (or alongside) the embedding pipeline. Instead of pre-computing all embeddings, give an RLM the vault contents and the book as variables, and let it programmatically compare them. This might handle nuance better than pure vector similarity — an RLM can understand why two passages are related, not just that they’re similar.

Deliverables: RLM-based diff implementation, comparison against embedding-only approach.

Phase 4: Polish and Iterate

Interactive mode (“tell me more about this gap”), better chunking strategies, support for multiple books/reference corpora, maybe a simple UI. Consider whether this is worth packaging as an Obsidian plugin.

Open Questions

Chunking strategy: Should I chunk by heading (which respects document structure) or by fixed token count (which is simpler)? Heading-based is probably better for books with clear structure. Need to experiment.
Embedding model choice: OpenAI’s models are easy but introduce API costs and dependency. Open-source models like bge-large-en-v1.5 run locally and are nearly as good. Worth benchmarking both.
How to handle vault coverage bias: My vault represents what I’ve written about, not everything I know. Some topics I understand well but never made notes on. This will create false “gaps.” Possible mitigation: use an LLM to interview me about detected gaps before finalizing the report.
Threshold tuning: What cosine similarity score separates “you know this” from “this is new”? Probably needs to be calibrated per-vault. Could bootstrap by having me label a few book sections as known/unknown and fitting the threshold.
Multi-book support: Eventually I want to load multiple reference books and diff against all of them — “across these 5 distributed systems books, here’s your aggregate gap map.” Straightforward extension but adds complexity to the report.

References

Recursive Language Models paper — Zhang, Kraska, Khattab (MIT, 2025)
RLM GitHub repo — Drop-in inference library
obsidiantools — Python library for parsing Obsidian vaults
ChromaDB — Lightweight local vector database
OpenAI Embeddings — API docs for text-embedding modelsOk

The notes of Justin Abrahms

Recently updated

Welcome to my digital brain

Abstract Data Types

Agentic LLM

Explorer

Knowledge Diff - Obsidian Vault vs. Reference Corpus

The Problem

Why This Should Work

Architecture

Data Sources

Core Components

Tech Stack (Initial Thinking)

How the Diff Works

Three Layers of “New to Me”

The Output

Phases

Phase 1: MVP — Embedding-Only Diff

Phase 2: Add the Knowledge Graph

Phase 3: RLM Exploration

Phase 4: Polish and Iterate

Open Questions

References

Graph View

Table of Contents

Backlinks

The notes of Justin Abrahms

Recently updated

Welcome to my digital brain

Abstract Data Types

Agentic LLM

Explorer

Knowledge Diff - Obsidian Vault vs. Reference Corpus

The Problem

Why This Should Work

Architecture

Data Sources

Core Components

Tech Stack (Initial Thinking)

How the Diff Works

Three Layers of “New to Me”

The Output

Phases

Phase 1: MVP — Embedding-Only Diff

Phase 2: Add the Knowledge Graph

Phase 3: RLM Exploration

Phase 4: Polish and Iterate

Open Questions

Related Concepts

References

Graph View

Table of Contents

Backlinks