Retrieval-augmented generation (RAG) is a technique for giving a language model your own documents: instead of relying only on what it learned in training, the system retrieves the most relevant pieces of your data and puts them in the prompt, so the model answers from those sources — with citations — rather than guessing. It is the standard way to build a chatbot over a company wiki, a contract set, or a product manual, and it is the single most effective cure for hallucination on private knowledge.
This guide covers how RAG works, the stack you build it with, how to get a basic pipeline right, and the techniques that go beyond it.
How RAG works
There are two phases. Ingestion (done once, then updated): split your documents into chunks, convert each chunk into a vector with an embedding model, and store those vectors in a database. Query (every question): embed the user's question, retrieve the handful of chunks whose vectors are most similar, optionally rerank them, and hand them to the LLM as context so it can answer grounded in your data. The pipeline in the hero above is exactly that flow, end to end.
The stack
A RAG system is a few interchangeable parts:
- Embedding model — turns text into vectors. Popular choices include OpenAI's text-embedding-3, the open BGE family, and Voyage. You can run open embedding models locally; see our guide to running models locally.
- Vector database — stores and searches the vectors: Pinecone, Weaviate, Qdrant, Chroma, or
pgvectorif you already use Postgres. - Orchestration framework — LangChain, LlamaIndex, or Haystack wire the steps together so you are not gluing APIs by hand.
- The generator — any LLM, open or closed. For cost-sensitive or private deployments, an open-source model works well here.
Get the basics right
The most important lesson in RAG is that most quality problems come from data preparation and retrieval, not from the model. Three settings carry most of the weight:
A sensible default pipeline: split text recursively into chunks of roughly 400 tokens with about 100 tokens of overlap, embed them with one good model used consistently across ingestion and query, and retrieve the top three to five chunks by cosine similarity — adding a reranker if precision matters. Get those right and a mid-tier model will outperform a frontier model on a careless pipeline.
Beyond basic RAG
Two directions push past the simple flow. Agentic RAG lets an AI agent decide when and what to retrieve — running multiple searches, following up, and reasoning across results rather than doing one fixed lookup; tools like n8n ship this as a built-in node. Graph RAG retrieves over a knowledge graph instead of flat chunks, which helps with questions that span many documents. And it is worth knowing when not to use RAG: with today's million-token context windows you can sometimes just put the whole document in the prompt — but for large or frequently changing corpora, retrieval is still cheaper, faster, and more accurate than stuffing everything in. (RAG is also what gives a voice agent its real knowledge.)
Want AI news before everyone else?
The morning's most important AI stories, straight to your inbox. No fluff.