Retrieval-Augmented Generation is one of those concepts that's easy to misunderstand because the name is both accurate and useless. "We retrieve things, then generate things" — sure, great. But that tells you nothing about how to reason about it in your architecture.
Here are three mental models I've found actually useful. Each one emphasizes different aspects of the system.
Model 1: The Librarian
Think of RAG as a librarian who doesn't know everything, but knows where to find things.
When you ask the librarian a question, they don't answer from memory alone. They go find the relevant books, read the relevant passages, and synthesize an answer from that material. The librarian's job is to retrieve, then reason.
This model is useful when you're thinking about what to index. The question becomes: what's in the library? What books are on the shelves? If the information isn't in the library, the librarian can't find it. Garbage in, garbage out — your retrieval is only as good as your corpus.
Use this model when designing your chunking and indexing strategy.
Model 2: The Search Engine
Think of RAG as a search engine bolted onto an LLM. The user submits a query, the search engine returns the top-N most relevant documents, and the LLM reads those documents and writes the answer.
query → embed → vector search → top-k chunks → LLM prompt → response
This model is useful when you're debugging retrieval quality. When an answer is wrong, you can ask: did the search return the right chunks? You can log and inspect the retrieved context independently from the generation step.
# Debug your retrieval separately from your generation
chunks = retriever.retrieve(query, k=5)
for i, chunk in enumerate(chunks):
print(f"Chunk {i}: {chunk.score:.3f} — {chunk.text[:100]}")
Use this model when you're iterating on your embeddings, chunking strategy, or similarity threshold.
Model 3: The Memory Module
Think of the LLM as a stateless function with a fixed context window. RAG is a mechanism for giving it dynamic, external memory — one that can grow beyond what fits in the context.
The model itself has no persistent state between calls. RAG is the system that says "before you answer, here's what you need to remember."
This model is useful when reasoning about system design and scaling. The LLM's "memory" is now your vector database. You control what it can remember by controlling what you index. You can update the memory without retraining the model. You can have different memories for different users or contexts.
Use this model when designing multi-tenant systems or when you need to update your knowledge base frequently without retraining.
Which Model to Use When
They're not mutually exclusive. I use all three at different times:
| Situation | Mental Model |
|---|---|
| Deciding what to index | Librarian |
| Debugging wrong answers | Search Engine |
| Designing the overall system | Memory Module |
The Common Failure Modes
Each model also surfaces different failure modes:
Librarian failures: The information isn't in the corpus, or is buried in unindexed formats (PDFs, images, tables).
Search engine failures: The query doesn't match the chunks because they're embedded differently — a question about "customer churn" doesn't match a chunk that talks about "user retention."
Memory module failures: The memory is stale, inconsistent, or contains contradictory information from different versions of your docs.
RAG is a powerful pattern, but it's also a system — with all the failure modes that implies. Picking the right mental model helps you debug the right part of that system when something goes wrong.