2
The Architecture of Ingestion and the Staging Layer 0:57 Miles: So, to really get our hands dirty with this, we have to start at the very beginning—the `raw/` directory. This is the bedrock of the entire system. Think of it as an immutable fortress for your source materials.
1:11 Lena: I love that word—immutable. It sounds so final. So, in this setup, once a file hits that `raw/` folder, it’s basically sacred? We don’t touch it, we don’t edit it, and the AI definitely doesn't write back to it?
1:24 Miles: Spot on. If you start editing your raw sources, the wiki becomes a lie. You want a clean separation of concerns. The `raw/` folder is where your PDFs, web clips, and meeting transcripts live in their original, messy glory. Karpathy’s workflow is very specific about this—the LLM reads from `raw/` but only ever writes to the `wiki/` directory.
1:47 Lena: Okay, so if I’m a researcher and I find a new paper on transformer scaling laws, I just drop it into `raw/`. But how does it get there effectively? I mean, copy-pasting text from a website is a nightmare for formatting.
2:00 Miles: That’s where the tooling comes in. One of the big recommendations from the community is the Obsidian Web Clipper. It’s a browser extension that takes a web page and turns it into clean Markdown. It even grabs the metadata—the URL, the author, the date—and puts it in a YAML frontmatter block at the top.
2:18 Lena: YAML frontmatter—that’s just the block of data at the top of a text file, right? Like a digital ID card for the document?
0:12 Miles: Exactly. And Karpathy takes it a step further. He suggests downloading images locally. There’s actually a hotkey setup in Obsidian—like `Ctrl+Shift+D`—that pulls every image from a clipped article and saves it to a local `assets/` folder.
2:39 Lena: Why go through the trouble of downloading them? Can’t the LLM just follow the URL?
2:43 Miles: Links break, Lena! If a site goes down, your wiki loses its visual context. By keeping the images in your `assets/` folder, the LLM can reference them directly. Now, a little technical hurdle here—most LLMs can’t "see" an image while they’re reading Markdown in one single pass. The workaround is to have the agent read the text first, then explicitly "look" at the specific images to get that extra layer of understanding.
3:10 Lena: That makes sense. It’s like a human researcher looking at a chart after reading the paragraph describing it. So we’ve got this staging area, but what about the messy reality of versioning? If I have three different versions of an HR policy or a research draft, doesn't the system get confused?
3:26 Miles: This is a massive pitfall. In enterprise settings, about 72% of these types of projects fail in the first year because they just "ingest everything" without a gatekeeper. You end up with the LLM retrieving a deprecated policy from 2022 because it was semantically similar to your question.
3:32 Lena: Oh, so the "librarian" needs to be a bit of a snob about what gets into the library.
3:32 Miles: Totally. You need a "Step Zero." Before anything gets embedded or compiled, you have to audit the sources. You designate a "system of record." If a document exists in three places, you pick the one authoritative version. You also want to tag files with sensitivity tiers—Public, Internal, Confidential. You don’t want your self-maintaining wiki accidentally surfacing the CEO’s salary to the summer intern just because they asked a question about "compensation trends."
4:01 Lena: Right, so the `raw/` directory isn't just a trash can; it’s a curated gallery. And Karpathy mentions using a custom CLI or a tool like `qmd` for this discovery layer, right?
4:11 Miles: Yeah, Tobi Lütke from Shopify built `qmd`, which is a hybrid search tool. It uses BM25—which is basically traditional keyword matching—and vector search together. When your `raw/` folder starts getting huge, you need that hybrid power to help the agent find exactly which files it needs to "compile" next.
4:31 Lena: It sounds like we’re building a very disciplined pipeline. It’s not just "chat with your docs"—it’s more like "building a codebase of knowledge."
4:39 Miles: That’s the perfect analogy. Karpathy literally says: "Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase." And just like a codebase, you need a way to track what’s already been processed. Some implementations use a manifest file or a simple git hash system. If the hash of a file in `raw/` changes, the system knows it needs to re-compile that specific slice of the wiki.
5:04 Lena: So it’s incremental. It doesn't rebuild the whole world every time I add a single PDF.
5:09 Miles: Right, because that would be incredibly expensive and slow. We’re talking about "token throughput" here. You want the agent to be surgical—read the new stuff, see which 10 or 15 wiki pages it affects, update those, and then go back to sleep.