Retrieval‑Augmented Generation (RAG) is a pattern for building AI agents that can answer questions and complete tasks using your data, without retraining the model. Instead of relying purely on what a model “remembers” from pretraining, a RAG agent retrieves relevant context (docs, tickets, code, policies, database rows) at runtime and feeds it into the model to generate grounded outputs.
This post is a demo‑style walkthrough of a typical RAG pipeline, the components you need, and what changes when you wrap RAG inside an agent loop (tool use, planning, memory, and evaluation).
What problem RAG solves
Hallucinations and stale knowledge: base models can invent details or miss your latest updates.
Private/domain data: your internal docs aren’t in public training data.
Traceability: you often need citations back to sources.
Cost and agility: updating an index is usually cheaper and faster than fine‑tuning.
RAG in one sentence
RAG = Retrieve the right context + Generate an answer constrained by that context.
The RAG pipeline (end to end)
A production RAG system has two phases:
1) Indexing (offline): prepare and store knowledge so it can be found later.
2) Retrieval + generation (online): at question time, fetch the best context and answer with it.
Here’s the canonical flow:
Ingest sources (PDFs, HTML, Notion, Confluence, Git, tickets, DB exports).
Clean + normalize (remove boilerplate, fix encoding, split by logical sections).
Chunk text into retrieval units (e.g., 200–800 tokens with overlap).
Embed chunks into vectors using an embedding model.
Store chunks + metadata in a vector database (and/or search index).
At query time: embed the user question.
Retrieve top‑k similar chunks (optionally with filters).
Rerank results with a cross‑encoder / LLM scoring step (optional but common).
Compose a prompt that includes the retrieved chunks + instructions.
Generate the answer with the LLM (often with citations).
Post‑process (format, safety checks, tool calls, logging).
Indexing: getting your knowledge ready
Indexing quality determines retrieval quality. If you index messy content, you’ll retrieve messy context—and the model will faithfully produce messy answers.
1) Ingestion + cleaning
Common steps include de‑duplicating pages, stripping navigation menus, preserving headings, and attaching metadata such as source, url, title, owner, updated_at, and access_level.
2) Chunking strategy
Chunking is the art of choosing what your retriever returns. Too small and you lose meaning; too large and you waste context window and dilute similarity.
Token‑based chunks: e.g., 400 tokens with 80 token overlap. Simple and effective.
Structure‑aware chunks: split by headings, paragraphs, code blocks, table rows.
Semantic chunking: split when topic shifts (more complex, sometimes better).
Add a “chunk title”: prepend the section heading to each chunk to improve retrieval.
Rule of thumb: start with 300–600 token chunks + ~10–20% overlap, then iterate using retrieval evaluation (recall@k, answer quality, citation accuracy).
3) Embeddings + metadata
Each chunk becomes a vector embedding plus metadata. Metadata enables filtered retrieval (e.g., only documents the user is allowed to see, only the most recent policy version, only a specific product line).
4) Where to store: vector DB + hybrid search
Most teams use a vector database (or a search engine with vector support) and often add hybrid search (BM25 keyword search + vector similarity). Hybrid retrieval helps when queries include exact terms (error codes, IDs, function names) that embeddings might blur.
Online retrieval: answering questions with your index
At runtime, the system turns the user’s question into a retrieval query, fetches supporting context, and prompts the model to answer using that context.
Query rewriting (optional, but powerful)
Users ask vague questions (“Can I share this doc externally?”). Query rewriting converts that into a search‑friendly query (“external sharing policy for customer documentation; allowed methods; approvals required”). In agent settings, rewriting may also include the user’s goal, recent turns, and known constraints.
Retrieval: similarity + filters
Compute embedding for the query.
Retrieve top‑k chunks by cosine similarity (or equivalent).
Apply metadata filters (tenant, permissions, region, product, freshness).
Optionally combine vector + keyword scores (hybrid).
Reranking: choose the best few
Initial retrieval often returns “pretty relevant” chunks. A reranker (cross‑encoder or LLM scoring) reorders candidates by true relevance to the question. Many pipelines retrieve k=20–50 then rerank and keep n=4–10 for the prompt.
Prompt construction: constrain the model
A good RAG prompt clearly separates instructions from context, and tells the model what to do when context is missing.
SYSTEM: You are a helpful assistant. Answer using ONLY the provided context.
If the context does not contain the answer, say "I don't know based on the provided documents" and ask a clarifying question.
USER QUESTION:
{question}
CONTEXT (cite sources):
[1] {chunk_1_text}
Source: {chunk_1_source}
[2] {chunk_2_text}
Source: {chunk_2_source}
...
ASSISTANT:
- Answer:
- Citations: (e.g., [1], [2])
- Follow‑up question (if needed):If you don’t explicitly define “what to do when context is insufficient,” models will often guess. Make refusal/clarification part of the contract.
What changes when RAG lives inside an AI agent
In a chat RAG app, retrieval typically happens once per user question. In an agent, retrieval becomes one tool among many, and the agent may call it multiple times while planning and executing a task.
Multi‑step retrieval: the agent may retrieve policies, then retrieve a specific exception process, then retrieve a form template.
Tool selection: the agent decides when to retrieve vs. when to call APIs (CRM, ticketing, database).
State + memory: the agent uses conversation state (and sometimes long‑term memory) to shape the query.
Verification loop: the agent can retrieve again to validate claims or resolve contradictions.
Structured outputs: the agent may need JSON, a checklist, or an action plan grounded in sources.
A simple agent loop with RAG
while not done:
plan = LLM("What is the next best step?")
if plan.requires_knowledge:
query = rewrite(question, state)
candidates = retrieve(query, filters=permissions)
context = rerank_and_select(candidates)
if plan.requires_action:
tool_result = call_tool(plan.tool, plan.args)
answer = LLM(prompt(instructions, context, tool_result, state))
state.update(answer, tool_result)
done = stop_condition(state)The key idea: retrieval is iterative. Agents can adapt based on what they find (or don’t find), ask clarifying questions, and gather additional context before committing to an answer.
A concrete demo scenario: support agent answering with citations
Imagine you’re building a support agent for an internal product. The agent must answer questions using the latest troubleshooting guide and link back to source pages.
User question
“Our webhook deliveries are failing with error 429. What should I do?”
Agent behavior (high level)
Rewrite query: “webhook 429 rate limit retries backoff headers recommended limits”.
Retrieve + rerank relevant sections from docs and incident runbooks.
If docs mention multiple rate‑limit types, ask a clarification (customer vs. platform vs. destination).
Generate a step‑by‑step fix with citations.
Optionally call an API tool to check the customer’s current delivery rate and suggest settings.
Example answer format
Diagnosis: what 429 means in this system (cite).
Immediate mitigation: reduce concurrency / enable backoff (cite).
Long‑term fix: request limit increase or implement queueing (cite).
Next question: ask for the destination provider and current rate settings if missing.
Common failure modes (and how to debug them)
Bad chunks in, bad answers out: inspect retrieved chunks for boilerplate or partial sentences; fix cleaning/chunking.
Low recall: increase
k, improve query rewriting, add hybrid search, or re-embed with a better model.Wrong docs (freshness/versioning): store
version/updated_atand filter or boost recent docs.Prompt leakage: model uses prior knowledge; tighten instructions and add refusal behavior.
Citations don’t match claims: force citation per sentence, or post‑hoc verify each claim against sources.
Permission leaks: enforce access filters before retrieval and never pass unauthorized text into the prompt.
Debug tip: log the full retrieval trace—rewritten query, filters, top‑k titles, reranked scores, final prompt context. Most RAG issues become obvious when you can see what the model saw.
How to evaluate a RAG agent
Evaluate both retrieval and generation—and do it on realistic tasks.
Retrieval metrics: recall@k, MRR, nDCG, and “was the needed chunk retrieved?”.
Answer metrics: groundedness (faithfulness to sources), completeness, correctness, refusal quality.
Citation metrics: citation precision/recall; “does each cited source support the claim?”.
Agent metrics: tool accuracy, number of steps, time/cost per task, success rate on end‑to‑end scenarios.