How reranking turns high-recall retrieval into high-precision context: cross-encoders vs bi-encoders, where rerankers fit in a RAG pipeline, and the cost.
AI Engineering🔥
Building real software with LLMs: RAG, agents, evals, prompt engineering, vector search, and production AI systems that actually ship.
11 articles
How vector search actually retrieves text: dense embeddings vs sparse keyword search, why hybrid wins, and how to fuse the two with reciprocal rank fusion.
What RAG is, when to use it, and how the retrieval pipeline actually works — chunking, embeddings, hybrid search, reranking, and evaluation, end to end.
Most failing RAG systems don't have a model problem, they have a retrieval problem. Here's how chunking, embeddings, and reranking actually decide whether your answers are any good.
An LLM will confidently return malformed JSON, leaked prompts, or unsafe content. Treat its output as untrusted input and validate it like you would a form submission.
Most LLM bills are bloated by sending every request to your biggest model. Routing and caching cut cost dramatically while holding quality steady.
Your RAG retrieval quality decays silently as data, models, and queries shift. A practical guide to detecting embedding drift and re-indexing safely.
Begging a model for JSON and hoping it parses is a bug waiting to happen. Schema-constrained structured outputs make it a guarantee. Here's how.
Agents call tools in a loop. Without the right guardrails that loop burns money, hangs, or repeats itself forever. Here's how to keep it bounded.
You wouldn't ship code without tests. Stop shipping prompts without evals. A practical guide to building evaluation suites for LLM features.
Prompt caching can cut latency and cost on repeated context by an order of magnitude. Here's how it works and why most teams leave it on the table.

