RAG Explained – How Retrieval-Augmented Generation Works

Last updated: March 2026

Language models know what they were trained on. They do not know what happened after training, they do not know your company's internal documents, and they hallucinate when asked to recall specific facts. Retrieval-Augmented Generation (RAG) solves this by giving the model access to external knowledge at query time. It is one of the most important patterns in production AI development.

What is Retrieval-Augmented Generation

RAG is a technique that enhances LLM responses by retrieving relevant documents from an external knowledge base and injecting them into the prompt as context before asking the model to answer.

Instead of relying on the model's training data alone, a RAG system:

Takes the user's question
Retrieves the most relevant documents from a vector database
Injects those documents into the prompt
Asks the model to answer based on the provided context

The model's job becomes synthesis and reasoning over the retrieved text — not memorization. This grounds answers in real, up-to-date, verifiable source material.

Why RAG Matters for Developers

RAG solves three core problems that developers face when building AI applications:

Knowledge cutoff — LLMs are trained on data up to a certain date. RAG lets you query real-time or recent data without retraining the model.

Private data — Models do not know your internal documentation, product data, or knowledge base. RAG injects this context at query time without exposing training data.

Hallucination reduction — When instructed to answer only from provided context, models hallucinate far less. The retrieved documents anchor the answer.

Citation and auditability — RAG systems can return the source documents alongside the answer, making responses verifiable and auditable.

For building a full RAG application, see how to build a RAG app. For the underlying retrieval mechanism, see vector databases explained.

How RAG Works

RAG has two main phases: indexing (done once, offline) and retrieval + generation (done at query time).

Phase 1: Indexing

plaintext

Documents → Chunking → Embedding → Vector Store

Load documents — PDFs, web pages, markdown files, database records
Chunk — Split into segments of 200–500 tokens with overlap
Embed — Convert each chunk to a dense vector using an embedding model
Store — Save vectors and text in a vector database

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# Embed and store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)

Phase 2: Retrieval and Generation

plaintext

User Query → Embed Query → Vector Search → Top-K Chunks → Prompt → LLM → Answer

Python

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based ONLY on the context below. If the answer is not in the context, say you do not know.\n\nContext:\n{context}"),
    ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n---\n\n".join(d.page_content for d in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Get answer
answer = chain.invoke("What is our refund policy?")
print(answer)

# Get sources separately
sources = retriever.invoke("What is our refund policy?")
print([doc.metadata for doc in sources])

The RAG Prompt

The prompt template is the core of the RAG pattern:

plaintext

Answer the question based ONLY on the provided context.
If the context does not contain enough information to answer,
say "I do not have enough information to answer this."

Context:
{retrieved_chunks}

Question: {user_question}

Answer:

The instruction "based ONLY on the provided context" is critical. Without it, the model may mix retrieved information with its prior knowledge, reducing grounding.

Practical Examples

Document Q&A System

Python

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load and index
loader = PyPDFLoader("company_handbook.pdf")
docs = loader.load()
chunks = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=50
).split_documents(docs)

vectorstore = FAISS.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Build LCEL chain
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based ONLY on the context. If not in context, say you do not know.\n\nContext:\n{context}"),
    ("human", "{question}"),
])

chain = (
    {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model="gpt-4o-mini", temperature=0)
    | StrOutputParser()
)

answer = chain.invoke("How many vacation days do employees get?")
print(answer)

Tools and Frameworks

LangChain — The most common framework for building RAG pipelines. Provides document loaders, text splitters, vector store integrations, and retrieval chains.

LlamaIndex — Specializes in data-connected LLM applications. Excellent for complex document hierarchies and multi-document reasoning.

Chroma — Open-source, embedded vector database. Easy to set up locally, no server required.

FAISS — Facebook's library for efficient similarity search. Fast and battle-tested but does not persist to disk natively.

Pinecone / Weaviate / Qdrant — Managed cloud vector databases with production features: scaling, filtering, and metadata search.

For a deep dive on vector databases, see vector databases explained. For embeddings, see embeddings explained.

Common Mistakes

Chunks too large or too small — Chunks that are too large dilute the relevant signal. Too small and they lose context. The sweet spot is usually 200–500 tokens with 10–15% overlap.

No metadata filtering — When your knowledge base spans multiple domains or document types, filter by metadata at retrieval time. Otherwise you retrieve irrelevant chunks from the wrong context.

Retrieving too few or too many chunks — Three to five chunks is typical. Too few misses relevant information. Too many fills the context window with noise and dilutes attention.

Not grounding the model in the context — Without explicit instructions to use only the provided context, models blend retrieved facts with prior knowledge and hallucinate. Always include grounding instructions.

Ignoring retrieval quality — If retrieval returns irrelevant chunks, the generation step has nothing to work with. Measure retrieval precision and recall independently of generation quality.

Best Practices

Evaluate retrieval and generation separately — A good RAG system requires both good retrieval (right chunks returned) and good generation (right answer from those chunks). Measure each independently.
Add metadata to your documents — Source, date, document type, section. Good metadata enables filtered retrieval and source citation.
Test with adversarial queries — Queries that should return "I do not know" are as important to test as queries that should return an answer.
Use a re-ranker for high-stakes retrieval — A cross-encoder re-ranker (like Cohere Rerank) improves precision by scoring retrieved chunks against the query before passing them to the model.
Monitor chunk quality — Chunks split in the middle of a sentence or table lose meaning. Review a sample of chunks manually after indexing.

Frequently Asked Questions

What is the difference between RAG and fine-tuning? RAG retrieves external knowledge at query time and injects it into the prompt — model weights never change. Fine-tuning updates the model's weights on new training data, baking knowledge into the model. Use RAG when your data changes frequently or you need source attribution. Use fine-tuning when you need to change the model's behavior, tone, or output format in ways that prompting cannot achieve.

How many documents can a RAG system handle? The practical limit depends on your vector database, not on RAG as a technique. Chroma handles millions of vectors on a single machine. Managed databases like Pinecone and Qdrant scale to hundreds of millions. A 10,000-document knowledge base with 100 chunks per document — 1 million vectors — is well within range of any production vector store.

What embedding model should I use for RAG? text-embedding-3-small from OpenAI is the best default: it balances cost, speed, and quality for most English-language tasks. For multilingual corpora, use multilingual-e5-large. For a fully local, zero-cost setup, BAAI/bge-small-en-v1.5 from HuggingFace performs comparably.

When does RAG fail and how do I fix it? RAG fails in two ways: retrieval failure (right chunks not returned) and generation failure (right chunks returned but model ignores them). For retrieval failures: shrink chunk size, increase k, add a reranker, or improve your embedding model. For generation failures: strengthen the grounding instruction ("answer ONLY from provided context") and evaluate with RAGAS faithfulness scores.

Does RAG work with non-English documents? Yes. Use a multilingual embedding model (multilingual-e5-large or paraphrase-multilingual-mpnet-base-v2) and an LLM that supports your target language. The RAG architecture is language-agnostic — the embedding model handles the semantic matching, and the LLM handles generation.

What is the difference between RAG and semantic search? Semantic search retrieves relevant documents. RAG retrieves relevant documents AND synthesizes an answer from them using an LLM. Semantic search is the retrieval component of RAG — RAG adds the generation step on top.

Can I use RAG without OpenAI? Yes. The entire pipeline runs locally. Use Ollama + Llama 3 for the LLM, BAAI/bge-small-en-v1.5 for embeddings (via HuggingFace), and Chroma for the vector store. No API keys required.

Key Takeaways

RAG gives LLMs access to external knowledge at query time without changing model weights — ideal for private data, up-to-date content, and citation requirements
The pipeline has two phases: indexing (chunk → embed → store, run once) and querying (embed query → retrieve → generate, run on every request)
The grounding instruction in the prompt ("answer ONLY from the provided context") is critical — without it, models blend retrieved content with training-data hallucinations
Use RecursiveCharacterTextSplitter at 400–512 chars with 10–15% overlap as the default chunking strategy
Use text-embedding-3-small (OpenAI) or BAAI/bge-small-en-v1.5 (free, local) for embeddings
Use LCEL (prompt | llm | StrOutputParser()) — not deprecated RetrievalQA.from_chain_type()
Evaluate retrieval and generation separately — most RAG problems are retrieval problems, not generation problems
RAG beats fine-tuning for use cases that need up-to-date data, source attribution, or fast knowledge base updates

What to Learn Next

Build a full RAG pipeline → LangChain RAG Tutorial
Evaluate your RAG system → RAG Evaluation with RAGAS
Handle multiple documents → Multi-Document RAG
RAG vs fine-tuning decision guide → RAG vs Fine-tuning