The Four Moves
Every RAG pipeline does the same four things. Here is what each one looks like at its simplest.
The Four Moves
Every RAG pipeline, no matter how sophisticated, does the same four things. Ingest. Retrieve. Generate. Log. Module 1 does each in its simplest form.
Ingest
Read files off disk. For Module 1, only .md and .txt. PDFs, DOCX, HTML are more annoying and deferred to Module 2. The question at ingest time is: how do you cut the documents into pieces?
We cut on character count. 1800 characters per chunk, 200 characters of overlap between neighbours. Why character count and not tokens? Because token-aware chunking needs a tokenizer that matches your embedding model, and Module 1 would rather not take that dependency yet. Why 1800 and 200? Because they work. They are also wrong for roughly a third of the questions you will ask.
Fixed-size chunking is the simplest thing that works. Chunks might slice a sentence in half. A fact might land at the boundary between two chunks. That is fine for Module 1. Module 2 meets semantic and hierarchical chunking and shows you the price fixed-size charges.
Each chunk gets an id of the form source/slug#n — the filename, then a chunk index. The id matters because the model is going to be asked to cite sources by id, and the ship gate is going to grade on whether the cited id appears in the top-k retrieved chunks.
Embed
For each chunk, call an embedding model and get back a vector of floats. Module 1 defaults to Voyage's voyage-3. OpenAI's text-embedding-3-small is the fallback if you only have an OpenAI key.
Which one is better? It depends on your corpus. That is the whole content of Module 2 — you will compare them against each other on your own documents and see the delta.
For now: one embedding model, one provider, one corpus. Write the vectors and the chunk text into a vector database keyed by chunk id. The track uses Chroma, which runs locally out of the box.
Embedding models turn text into points in high-dimensional space. Chunks about similar ideas land close together. A question gets embedded the same way as the chunks, and retrieval is just "which chunks are closest to the question?" If you remember this one mental model, every embedding decision later in the track makes sense.
Retrieve
When a question comes in, embed it the same way you embedded the chunks. Ask the vector database for the top five nearest neighbours by cosine similarity. That is it. No reranking. No hybrid with BM25. No query rewriting. Five chunks come out, five chunks go into the prompt.
This is the step that will bite you first. Watch for it on your own corpus — when the right chunk is in position 6 instead of 5, the answer falls apart. When the chunk embedding looks more like a lexically similar-but-wrong chunk than the right one, the answer falls apart. Modules 3 and 4 exist to make that happen less often.
Generate
Build a prompt that looks roughly like this:
System: You are a careful technical assistant answering from a provided corpus.
Use only the numbered sources. Cite sources inline by id.
If the sources do not contain the answer, say so.
User:
Question: <the question>
Sources:
[1] id=source-a
<chunk text>
[2] id=source-b
<chunk text>
...Send this to Claude Sonnet 4.6. Read the response. Done.
No decomposition. No agentic loops. No iterative retrieval. One turn. This is called a "stuff" prompt because you stuff all the retrieved chunks in at once. It works when chunks fit in context and when the right chunk is in the retrieved set. When either of those assumptions fails, the answer is wrong or empty.
A stuff prompt passing five retrieved chunks to Claude Sonnet 4.6 with the citation rule above typically costs a fraction of a cent per query at Module 1 corpus sizes. You can run the full boss challenge (ten questions) for less than five cents. That is cheap enough that the baseline is not something you have to protect — run it as often as you want while you build.
Log Cost
Every embed call, every query embedding, every generation writes one line to a costs.jsonl file. Tokens in, tokens out, dollars spent. The RAG works without this; you need it anyway.
The reason: when Module 2 asks you to compare three embedding providers across five hundred documents, you want the cost answer to be a tail of a log file, not a guess. Same for Module 6's cost dashboard. Start logging now, read the log later.
It is tempting to skip cost logging because the baseline is cheap. Do not skip it. Module 2 and Module 3 run the same pipeline many times with different configurations, and without a log you will lose track of what each run cost. The discipline matters more than the number.
That Is The Whole Pipeline
Four moves. Ingest, retrieve, generate, log. If you understand each at the sketch level above, you understand every RAG system on the planet — the production ones just have smarter versions of each move.
Next lesson: the build task. You run it.
Exercises
0/4Which of these is NOT one of the four moves of a RAG pipeline?
Why does Module 1 chunk by character count instead of token count?
What does a "stuff prompt" do?
The baseline retrieves top-5 chunks and stuffs them all. Describe one realistic scenario where this approach produces a wrong answer, even though the right information exists somewhere in the corpus.
Hint: Think about what could push the correct chunk to rank 6 or lower, or what could cause the LLM to ignore the correct chunk even when it is included.