School/RAG & Evals — Ship the Stack/Module 1 — Baseline RAG
4/4
Wave 95 minintermediate

What the Next Five Modules Unlock

A preview of Modules 2 through 6 — and the specific failure each one is designed to solve.

What the Next Five Modules Unlock

You have a working baseline. It passes the ship gate. It probably failed 50-70% of the boss challenge. Each of the remaining five modules takes aim at a specific failure category and teaches the technique that fixes it.

Key Concept

Modules 2 through 6 are rolling out in the build environment. This lesson previews each one so you know what is coming. Check the repository and the school for new lessons as each module ships.

Module 2 — Embeddings and Chunking Strategy

The failure it solves: definitional drift and cross-chunk synthesis.

When your baseline fails because the corpus says "context caching" but the question asked about "prompt caching," that is an embeddings problem. When an answer spans two chunks that fixed-size chunking put far apart, that is a chunking problem. Module 2 makes you compare:

  • Voyage vs. OpenAI embeddings on your corpus
  • Fixed-size vs. semantic vs. hierarchical chunking
  • Retrieval-metric deltas across all combinations

Ship gate: a documented comparison across three configurations, with a clear winner.

Module 3 — Hybrid Search and Reranking

The failure it solves: needle-in-haystack.

Pure vector retrieval ranks ten lexically similar chunks roughly equal. When the right one is at rank 6, your stuff prompt never sees it. Module 3 adds:

  • BM25 keyword retrieval alongside vector retrieval
  • Reciprocal Rank Fusion (RRF) to combine the two
  • A cross-encoder reranker (Cohere Rerank 3) to reshuffle the top 20 into the top 5 you actually send

Ship gate: measurable Recall@5 improvement over Module 1 on your own corpus.

Module 4 — Query Transformation

The failure it solves: multi-hop questions.

"Which chapter governs public works solicitations on HIePRO?" requires retrieving what HIePRO is, then retrieving the chapter number. Baseline RAG cannot do two hops. Module 4 teaches:

  • HyDE (Hypothetical Document Embeddings)
  • Multi-query rewriting
  • Question decomposition

Ship gate: a query transformer with A/B comparison against the raw-question baseline.

Module 5 — Evaluation Framework

The failure it solves: you have no idea if changes help.

Up through Module 4, you have been eyeballing whether your RAG got better. That does not scale. Module 5 builds:

  • Retrieval metrics (Recall@k, MRR, nDCG)
  • LLM-as-judge for answer quality, using Claude Haiku 4.5 as the cheap judge
  • A regression harness that compares runs across module configurations

Ship gate: a 30-question eval set scored across every prior module's configuration. You can finally answer "is Module 3 actually better than Module 1 on my corpus?"

Module 6 — Production Concerns

The failure it solves: "it works on my laptop."

Module 6 takes your best RAG and makes it production-shaped:

  • Prompt caching (Anthropic's cache_control) to cut costs
  • Structured logging for observability
  • Cost dashboards — reading the JSONL log you have been writing since Module 1
  • Prompt versioning so you can A/B test prompt changes
  • Failure-mode handling for empty retrievals, rate limits, and timeouts
  • Deploy behind a reverse proxy

Ship gate: a production-deployed RAG with cost dashboards and prompt versioning, measurably better than Module 1 by at least 25% on your boss challenge.

Where the Track Ends

By the end of Module 6, you will have:

  • Six configurations of the same pipeline, all measured against the same boss
  • Documented cost and latency for each
  • A deployed production instance
  • A completion artifact (a signed PDF plus a /verify/<hash> URL) describing what you built

The completion artifact is portfolio-grade. It says "I built a RAG system, here is how it performs on my corpus, here is what I learned at each module." That is the output.

Pro Tip

The single biggest predictor of whether someone finishes this track is whether they finished Module 1 within 48 hours of starting. The momentum from shipping a working baseline carries through. Starting and stopping between modules burns energy. If you have done Module 1, do Module 2 this week.

Go ship.

Exercises

0/3
Quiz+5 XP

Which module solves "the right chunk is in position 6, so my top-5 retrieval misses it"?

Quiz+5 XP

Which module is the one that finally lets you quantitatively compare configurations against the same corpus?

Reflection+15 XP

Looking at your own boss-challenge score from Module 1, which of the next five modules seems most likely to deliver the biggest lift against your specific failure pattern, and why?

Hint: Revisit the question categories your baseline failed most often on (drift, multi-hop, needle, negation, cross-chunk). Each category maps to a specific module.