Swiftbeard

RAG Explained Three Ways

Three mental models for understanding Retrieval-Augmented Generation — and when to reach for each one.

ragaiarchitecturellm

Retrieval-Augmented Generation is one of those concepts that's easy to misunderstand because the name is both accurate and useless. "We retrieve things, then generate things" — sure, great. But that tells you nothing about how to reason about it in your architecture.

Here are three mental models I've found actually useful. Each one emphasizes different aspects of the system.

Model 1: The Librarian

Think of RAG as a librarian who doesn't know everything, but knows where to find things.

When you ask the librarian a question, they don't answer from memory alone. They go find the relevant books, read the relevant passages, and synthesize an answer from that material. The librarian's job is to retrieve, then reason.

This model is useful when you're thinking about what to index. The question becomes: what's in the library? What books are on the shelves? If the information isn't in the library, the librarian can't find it. Garbage in, garbage out — your retrieval is only as good as your corpus.

Use this model when designing your chunking and indexing strategy.

Model 2: The Search Engine

Think of RAG as a search engine bolted onto an LLM. The user submits a query, the search engine returns the top-N most relevant documents, and the LLM reads those documents and writes the answer.

query → embed → vector search → top-k chunks → LLM prompt → response

This model is useful when you're debugging retrieval quality. When an answer is wrong, you can ask: did the search return the right chunks? You can log and inspect the retrieved context independently from the generation step.

# Debug your retrieval separately from your generation
chunks = retriever.retrieve(query, k=5)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk.score:.3f}{chunk.text[:100]}")

Use this model when you're iterating on your embeddings, chunking strategy, or similarity threshold.

Model 3: The Memory Module

Think of the LLM as a stateless function with a fixed context window. RAG is a mechanism for giving it dynamic, external memory — one that can grow beyond what fits in the context.

The model itself has no persistent state between calls. RAG is the system that says "before you answer, here's what you need to remember."

This model is useful when reasoning about system design and scaling. The LLM's "memory" is now your vector database. You control what it can remember by controlling what you index. You can update the memory without retraining the model. You can have different memories for different users or contexts.

Use this model when designing multi-tenant systems or when you need to update your knowledge base frequently without retraining.

Which Model to Use When

They're not mutually exclusive. I use all three at different times:

SituationMental Model
Deciding what to indexLibrarian
Debugging wrong answersSearch Engine
Designing the overall systemMemory Module

The Common Failure Modes

Each model also surfaces different failure modes:

Librarian failures: The information isn't in the corpus, or is buried in unindexed formats (PDFs, images, tables).

Search engine failures: The query doesn't match the chunks because they're embedded differently — a question about "customer churn" doesn't match a chunk that talks about "user retention."

Memory module failures: The memory is stale, inconsistent, or contains contradictory information from different versions of your docs.

RAG is a powerful pattern, but it's also a system — with all the failure modes that implies. Picking the right mental model helps you debug the right part of that system when something goes wrong.