DevgainsDevgainsDevgains
All articles

RAG Isn't Dead — But Your Chunking Strategy Probably Is

·5 min read·Updated Jun 29, 2026
RAG Isn't Dead — But Your Chunking Strategy Probably Is

Cover: gradient generated for Devgains

Every few months someone declares RAG dead. Long-context models will swallow your whole corpus, the argument goes, so why bother retrieving? Then they paste 200K tokens into a prompt, watch quality collapse in the middle, and quietly go back to retrieval. Long context and retrieval solve different problems, and the research on lost-in-the-middle effects is still very real in 2026.

The uncomfortable truth: most RAG systems that "don't work" aren't bottlenecked on the model. They're bottlenecked on what you feed it. And the single biggest lever on retrieval quality is the least glamorous part of the pipeline — chunking. If you're new to the whole architecture, start with what RAG is and how the pipeline fits together; this article zooms in on the one stage that caps everything downstream.

Why chunking decides everything

A retrieval pipeline can only return chunks that exist. If your chunk boundaries cut a function signature away from its explanation, no embedding model or reranker can reassemble them. Retrieval quality is bounded above by chunk quality.

Three failure modes show up constantly:

  • Chunks too large. A 2,000-token chunk contains five topics. Its embedding is the average of five things, so it matches everything weakly and nothing strongly.
  • Chunks too small. A 64-token chunk loses the context that makes it meaningful. "It returns 429 when the limit is exceeded" is useless without knowing what "it" is.
  • Chunks split mid-thought. Fixed-size character splitting cuts tables, code blocks, and sentences in half.

If you remember one thing: fixed-size character chunking is a baseline, not a strategy. It's where you start, not where you ship.

Start with structure, not character counts

Documents already have structure — headings, sections, list items, code fences. Split on that boundary first, then pack sections into chunks up to a token budget. LangChain's recursive and structure-aware splitters and LlamaIndex's node parsers both do this, but the idea is simple enough to own yourself:

def chunk_markdown(doc: str, max_tokens: int = 512) -> list[str]:
    sections = split_on_headings(doc)        # respect the document's own structure
    chunks, current, size = [], [], 0
    for section in sections:
        n = count_tokens(section)
        if size + n > max_tokens and current:
            chunks.append("\n\n".join(current))
            current, size = [], 0
        current.append(section)
        size += n
    if current:
        chunks.append("\n\n".join(current))
    return chunks

Measure your chunks in tokens, not characters, using the same tokenizer family as your embedding model. OpenAI documents this in the embeddings guide; a "500 character" chunk can be anywhere from 80 to 200 tokens depending on the content.

Add context back with contextual retrieval

Here's the technique that moved the needle most for us. Before embedding a chunk, prepend a short, model-generated description of where it sits in the document. Anthropic published this as contextual retrieval and reported large drops in retrieval failure rate.

Original chunk:
  "It returns 429 when the rate limit is exceeded."
 
Contextualized chunk:
  "From the Billing API reference, Rate Limits section:
   It returns 429 when the rate limit is exceeded."

The contextualized chunk embeds into a far more useful region of vector space. You pay for it once at index time, then benefit on every query.

Don't trust a single vector

Dense embeddings are great at semantics and bad at exact terms — product codes, function names, error numbers. Hybrid search (dense + a keyword signal like BM25) fixes the long tail; the mechanics of fusing the two are covered in vector search explained — dense vs sparse vs hybrid. pgvector now supports this well enough that you don't need a separate vector database for most workloads; see the pgvector docs and Postgres full-text search.

Then rerank. Pull the top 20–50 candidates with cheap retrieval, and reorder them with a cross-encoder reranker before they hit the model. Reranking is the highest ROI-per-line change in most pipelines, and it's worth understanding why a cross-encoder beats the embedding model that fetched the candidates — reranking explained walks through it:

candidates = vector_store.search(query, k=40)      # cheap, high recall
top = reranker.rank(query, candidates)[:6]          # expensive, high precision
answer = llm.generate(prompt(query, top))

Measure retrieval on its own

The mistake almost everyone makes: judging retrieval by reading final answers. By then, generation has masked your retrieval bugs. Evaluate the retrieval step in isolation with a labeled set of (question, relevant_chunk) pairs and track:

  • Recall@k — is the right chunk in the top k at all?
  • MRR / nDCG — is it near the top?

Tools like Ragas formalize this, but a CSV of 50 hand-labeled questions and a nightly recall number will catch 90% of regressions.

Build the eval set before you tune anything. Without it you're not engineering, you're vibing — and you can't tell a real improvement from a lucky prompt.

The order that actually works

If you're starting over, do it in this order — each step only matters once the previous one is solid:

  1. Structure-aware chunking sized in tokens.
  2. Contextual retrieval to re-inject document context.
  3. Hybrid search for the exact-term long tail.
  4. Reranking to fix precision at the top.
  5. An isolated retrieval eval so you can prove each change helps.

RAG isn't dead. It's just that "throw documents in a vector DB and pray" was never RAG — it was a demo. The engineering is in the retrieval, and the retrieval starts with how you cut the text.

5 min read

Read next