What Is RAG? A Practical Guide to Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is the pattern of fetching relevant documents at query time and handing them to a language model as context, so its answer is grounded in your data instead of only its training data. That single idea — retrieve first, then generate — is what lets a general-purpose model answer questions about your internal docs, your codebase, or last night's support tickets without retraining anything.

If you've only seen RAG as a demo ("load PDFs into a vector database, ask a question"), it's easy to underestimate. The demo is five lines of code. The production system is a retrieval pipeline with real engineering at every stage, and this guide walks the whole thing end to end. Think of it as the map; each section links out to a deeper article on that stage.

Why RAG exists

A language model only knows what was in its training data, frozen at a cutoff date. It cannot cite your Q3 roadmap, your API's error codes, or a customer's contract, because it never saw them. You have two ways to fix that: put the knowledge into the model (fine-tuning) or put it into the prompt (retrieval). For facts that change — docs, tickets, inventory, policies — retrieval wins, because you update an index in seconds instead of retraining for hours.

The technique comes from the 2020 paper that named it, Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". Six years on, the architecture has gotten more sophisticated, but the thesis holds: grounding a model in retrieved evidence reduces hallucination and lets you cite sources.

"Why not just use a long-context model and paste everything in?" Long context and retrieval solve different problems. Quality still degrades in the middle of very long prompts — the well-documented lost-in-the-middle effect — and you pay for every token on every call. Retrieval keeps the prompt small, current, and citable.

The pipeline, stage by stage

A RAG system has two phases. Indexing happens offline: you split documents into chunks, embed them, and store the vectors. Querying happens per request: embed the question, retrieve candidate chunks, optionally rerank them, and generate an answer from the best ones.

INDEX (offline):   documents → chunk → embed → store vectors
QUERY (per call):  question → embed → retrieve → rerank → generate

Every stage is a place where quality is won or lost. Here is what each one does and where to go deeper.

1. Chunking

You can't embed a whole document and expect precise retrieval — the embedding becomes an average of everything in it. So you split documents into chunks sized to hold roughly one idea. Get the boundaries wrong and no downstream step can recover; retrieval quality is bounded above by chunk quality. The biggest wins come from splitting on the document's own structure (headings, sections, code fences) rather than fixed character counts. We cover the full playbook — structure-aware splitting, contextual retrieval, token budgets — in RAG isn't dead, but your chunking strategy probably is.

2. Embeddings and the vector store

Each chunk is converted to a vector — a list of numbers that places semantically similar text near each other in space. At query time you embed the question and find the nearest chunk vectors. The OpenAI embeddings guide is a solid primer on the mechanics; you don't need a dedicated vector database to start, as pgvector brings approximate nearest-neighbor search straight into Postgres.

Embeddings aren't static, either. When you change embedding models — or your corpus drifts away from the distribution you indexed on — retrieval silently degrades. Knowing when to re-index is its own discipline; see embedding drift: when and how to re-index your vector store.

3. Retrieval: dense, sparse, and hybrid

Dense vector search is great at meaning and bad at exact tokens — product codes, function names, error numbers. Keyword (sparse) search is the opposite. Production systems combine both, which is hybrid search. Because this is the stage most teams under-build, it gets its own deep dive: vector search explained — dense vs sparse vs hybrid.

4. Reranking

Cheap retrieval optimizes for recall: get the right chunk somewhere in the top 40. A reranker — usually a cross-encoder — then reorders those candidates for precision, so the best few land at the top of the prompt. It is consistently the highest ROI-per-line change in a RAG pipeline, and it gets its own deep dive in reranking explained — cross-encoders and the precision step.

candidates = vector_store.search(query, k=40)   # cheap, high recall
top = reranker.rank(query, candidates)[:6]       # expensive, high precision
answer = llm.generate(prompt(query, top))

5. Generation and grounding

Finally the model writes the answer from the retrieved chunks. The prompt should instruct it to answer only from the provided context and to say "I don't know" when the context doesn't cover the question — otherwise it falls back on parametric memory and you're back to hallucinating. Validating that the output stays grounded and well-formed is the job of guardrails that validate LLM output before it reaches users, and forcing machine-readable answers is covered in structured outputs beat prompt-and-pray JSON parsing.

How to know it's working

The single most common mistake is judging retrieval by reading final answers. By then generation has masked your retrieval bugs. Evaluate the retrieval step in isolation with a labeled set of (question, relevant_chunk) pairs:

def recall_at_k(eval_set, retrieve, k=5):
    hits = 0
    for question, gold_chunk_id in eval_set:
        retrieved = retrieve(question, k=k)
        if gold_chunk_id in {c.id for c in retrieved}:
            hits += 1
    return hits / len(eval_set)   # the number you tune against

Track recall@k (is the right chunk in the top k at all?) and a rank-sensitive metric like MRR or nDCG (is it near the top?). Tools such as Ragas formalize this, but a CSV of 50 hand-labeled questions and a nightly recall number catches most regressions. Treating these checks like tests is exactly the mindset in evals are unit tests for non-deterministic systems.

Build the eval set before you tune anything. Without it you can't tell a real improvement from a lucky prompt — you're not engineering, you're vibing.

Making it cheap and fast

A naïve RAG call re-sends the same system prompt and the same retrieved context on every request, and pays full price each time. Two levers cut that dramatically: prompt caching, the optimization most LLM teams skip, and routing easy queries to cheaper models, covered in cutting LLM cost without cutting quality. Once your system starts acting on retrieved data — calling tools, taking steps — the control-flow problems in agentic tool-calling loops that don't spiral out of control become the next thing to get right.

When NOT to use RAG

RAG is not free. If your knowledge fits in a few thousand stable tokens, just put it in the prompt. If you need the model to learn a style or format rather than recall facts, fine-tuning fits better. And if your retrieval corpus is tiny and rarely queried, a plain keyword search over a database may beat an embedding pipeline you have to maintain. Reach for RAG when the knowledge is large, changes often, and needs to be cited.

Takeaways

RAG = retrieve, then generate. It grounds a model in your current data without retraining, which is why it beats fine-tuning for facts that change.
It's a pipeline, not a feature. Chunking, embeddings, retrieval, reranking, and generation each decide quality independently — and the weakest stage caps the rest.
Hybrid search + reranking are where most teams find the biggest, cheapest wins.
Evaluate retrieval in isolation. Recall@k on a small labeled set tells you more than reading answers ever will.
Use it deliberately. Large, changing, citable knowledge says RAG; small or static knowledge often doesn't.

Start with the AI Engineering cluster to go deeper on any stage above — each article assumes the map you just read.

FAQ

Is RAG the same as fine-tuning? No. Fine-tuning changes the model's weights to teach behavior or style; RAG leaves the model alone and changes its input by retrieving facts at query time. They're complementary — you can fine-tune a model and still feed it retrieved context.

Do I need a vector database? Not to start. pgvector adds vector search to Postgres, which is plenty for most workloads. Reach for a dedicated vector DB when scale or latency demands it.

Does long context make RAG obsolete? No. Pasting everything into a long prompt is slower, more expensive per call, and degrades in quality through the lost-in-the-middle effect. Retrieval keeps the prompt small, fresh, and citable.