Reranking Explained: Cross-Encoders and the Precision Step
Cover: gradient generated for Devgains
Reranking is the step that takes the candidate chunks your retriever found and reorders them by how well each one actually answers the query — so the best two or three land at the very top of the prompt, where the model reads most carefully. It is the precision stage of a RAG pipeline, and it is consistently the highest return-on-effort change you can make once retrieval is in place.
The reason it matters is a mismatch most teams discover the hard way. Your retriever optimizes for recall — get the right chunk somewhere in the top 40 — because that is what cosine distance and BM25 are cheap enough to do at scale. But the model only attends closely to the first few chunks. A reranker bridges that gap: pull a wide net of candidates, then spend a little more compute to put the genuinely relevant ones first. This article explains how a reranker does that, why cross-encoders beat the embedding models that fetched the candidates, and what the precision costs you.
Why retrieval alone leaves precision on the table
Dense vector search and keyword search both score a query against a document using vectors that were computed independently. Your embedding model turned each chunk into a vector once, at index time, with no idea what query would arrive. At query time it embeds the question, also in isolation, and compares the two with cosine distance. That independence is exactly what makes it fast — you precompute millions of chunk vectors and do approximate nearest-neighbor lookups in milliseconds — but it is also why the ranking is coarse.
This design is called a bi-encoder: two separate passes through the model, one per side, that never see each other. The query and the document only ever meet as two finished vectors. Subtle distinctions — does this passage answer the question or just mention the same nouns? — get flattened into a single dot product. So the right chunk often lands at rank 7 or 18 instead of rank 1, even when retrieval "worked."
Recall and precision are different jobs. Retrieval maximizes recall: is the right chunk in the candidate set at all? Reranking maximizes precision: is it at the top? A pipeline can have excellent recall and still feed the model mediocre context.
Cross-encoders: let the query and document talk
A cross-encoder throws out the independence assumption. Instead of encoding the
query and document separately, it concatenates them into a single input — [query] [SEP] [document] — and runs that pair through a transformer together. Now every token
of the query can attend to every token of the document. The model isn't comparing two
frozen summaries; it is reading the question and the passage as one text and scoring
how well they match.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
query = "how do I rotate a leaked API key without downtime"
candidates = vector_store.search(query, k=40) # bi-encoder recall: fast, coarse
# Score each (query, chunk) PAIR jointly — this is the expensive, accurate part
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
ranked = [c for _, c in sorted(zip(scores, candidates), key=lambda x: -x[0])]
top = ranked[:5] # high precision for the promptThat joint pass is what makes cross-encoders accurate — and what makes them expensive. A bi-encoder embeds a document once and reuses it forever. A cross-encoder must run a full forward pass for every (query, document) pair, at query time, because the score depends on both. The technique was popularized for search by Nogueira and Cho's "Passage Re-ranking with BERT", and the Sentence-Transformers documentation lays out the same retrieve-then-rerank pattern this section uses.
The mental model: a bi-encoder is a librarian who pre-sorts the whole library so any question lands you in roughly the right aisle. A cross-encoder is an expert who reads your exact question and the 40 books from that aisle, then hands you the best three. You can't afford the expert for the whole library — only for the shortlist.
Where reranking fits: retrieve wide, rerank narrow
The architecture that makes cross-encoders affordable is two-stage retrieval. Stage one is cheap and high-recall: hybrid search pulls 20–100 candidates. Stage two is expensive and high-precision: the cross-encoder scores just those candidates and keeps the top handful. You never run the slow model over the full corpus — only over the shortlist the fast model already narrowed down.
QUERY → hybrid retrieve (top 40, ~5ms) → rerank (cross-encoder, ~40 scored, ~80ms)
→ keep top 5 → generateTwo parameters define the trade-off. The retrieval depth (how many candidates you rerank) sets your recall ceiling — the reranker can only promote a chunk that retrieval actually fetched, so if the answer wasn't in the top 40, reranking can't save you. The final cut (how many you keep) sets how much context reaches the model. Common defaults are rerank 25–50, keep 3–8; tune both against your eval set rather than guessing.
This is also the cleanest answer to "my hybrid search returns the right chunk but it's at position 9." Hybrid search fixed your recall; reranking is the piece that fixes the ordering. The two are designed to sit next to each other.
Hosted rerankers and late-interaction models
You don't have to self-host a cross-encoder. Managed reranking APIs — Cohere Rerank,
Jina, Voyage — take your query and candidate list over HTTP and return scored results,
trading per-call cost for zero infrastructure. Cohere's
rerank documentation is a good
reference for the request shape, which mirrors the predict(pairs) call above.
There's also a middle ground between bi-encoders and cross-encoders: late-interaction models like ColBERT, introduced by Khattab and Zaharia. Instead of one vector per document, ColBERT keeps a vector per token and computes a fine-grained match at query time — more precise than a single-vector bi-encoder, cheaper than a full cross-encoder, at the cost of a much larger index. For most teams a hosted cross-encoder reranker is the simpler first move; reach for late interaction when its latency becomes the bottleneck.
Measure it, or you're guessing
Reranking is easy to add and easy to fool yourself about, because better-looking answers can come from luck. Evaluate it the same way you evaluate the rest of retrieval: on a labeled set, with rank-sensitive metrics. Recall@k barely moves when you rerank — the same chunks are present, just reordered — so the number to watch is a metric that rewards position, like nDCG or MRR.
def mrr(eval_set, retrieve, rerank, k=5):
total = 0.0
for question, gold_id in eval_set:
candidates = retrieve(question, k=40)
ranked = rerank(question, candidates)[:k]
for rank, c in enumerate(ranked, start=1):
if c.id == gold_id:
total += 1 / rank # reward putting gold near the top
break
return total / len(eval_set)Run it twice — once feeding retrieve straight through, once with the reranker — and
compare. If MRR jumps, the reranker is earning its latency. The broader discipline of
treating these checks like a test suite is covered in
evals are unit tests for non-deterministic systems,
and the standard academic yardstick for retrieval-plus-reranking quality is the
BEIR benchmark, which measures exactly this kind of
zero-shot ranking across many datasets.
The cost, and when to skip it
Reranking adds a synchronous model call on the critical path of every query — typically tens to a couple hundred milliseconds for 25–50 candidates, plus per-token cost if it's hosted. That's usually worth it, but not always. Skip or trim reranking when latency is your hard constraint and recall is already good, when your corpus is small enough that the top-5 bi-encoder results are reliably correct, or when queries are simple lookups rather than nuanced questions. And when you do keep it, the same belt-tightening that applies elsewhere in the stack applies here: cutting LLM cost without cutting quality covers caching and routing so reranking isn't paying full price on easy traffic.
One thing reranking does not fix: a stale index. If your embeddings have drifted or the corpus moved on, the candidate set is wrong before the reranker ever sees it, and reordering garbage gives you well-ordered garbage. Keep an eye on embedding drift and when to re-index, and remember that retrieval quality is still capped upstream by your chunking strategy.
Takeaways
- Reranking is the precision step. Retrieval gets the right chunk into the candidate set; reranking gets it to the top, where the model actually reads.
- Cross-encoders beat bi-encoders because they score the query and document together instead of comparing two independently-computed vectors — accurate, but too expensive to run over the whole corpus.
- Retrieve wide, rerank narrow. Two-stage retrieval makes cross-encoders affordable: fetch 20–100 candidates cheaply, then rerank only those.
- Measure with rank-sensitive metrics. Recall@k won't move; watch nDCG or MRR, and A/B the reranker against raw retrieval before trusting it.
- It costs latency. Worth it for nuanced questions over large corpora; skippable when recall is already high and speed is the binding constraint.
Keep going in the AI Engineering cluster, or start from the RAG guide to see where reranking sits in the full pipeline.
FAQ
What's the difference between a reranker and a retriever? The retriever scans the whole corpus quickly to produce candidates (high recall, coarse order). The reranker scores only those candidates slowly and accurately to reorder them (high precision). You need both — they optimize different things.
Does reranking replace hybrid search? No, it sits after it. Hybrid search maximizes the chance the right chunk is in the candidate set; reranking maximizes the chance it's at the top. They compose.
Will reranking improve a bad retriever? Only within limits. A reranker can only reorder chunks the retriever already returned — if the answer isn't in the top-k candidates, no amount of reranking puts it there. Fix recall first, then add precision.

