6 min

RAG — giving AI access to your data

How retrieval-augmented generation works, when to use it, and the simpler alternatives most people overlook.

A trained model knows what it read during training. It doesn't know your company's wiki, last week's emails, or the PDF you just opened. RAG — Retrieval-Augmented Generation — is the standard pattern for fixing that. The name sounds complicated; the idea is simple.

The 30-second mental model

You have a pile of documents the model doesn't know about.
You ask a question.
A search step finds the most relevant chunks from your pile.
The model answers with those chunks attached to the prompt.

That's it. Retrieval (find the docs) + Generation (answer with them in context).

How retrieval actually works

Chunk. Split every document into ~500–1500 token pieces. Overlap them slightly so context isn't sliced apart.
Embed. Run each chunk through an embedding model (OpenAI, Cohere, Voyage, open-source like nomic-embed). Out comes a high-dimensional vector — a "fingerprint of meaning".
Store. Save the chunks + their vectors in a vector database — Postgres with pgvector, Pinecone, Qdrant, Weaviate, LanceDB.
Search. When a question comes in, embed it too. Find the chunks whose vectors are nearest (cosine similarity).
Generate. Stuff the top 3–10 chunks into the LLM's prompt: "Answer the user's question using only these snippets." Cite the source where possible.

When RAG is the right answer

Internal company knowledge. Wiki, runbooks, sales playbooks, policy docs.
Customer support. Help center articles + past tickets.
Code search. "Find a function that does X" across a large codebase.
Anywhere with a long tail of facts that can't all fit in a system prompt.

The "you might not need RAG" alternatives

Most non-engineers reach for RAG and over-engineer. Try these first:

Long context window. 2026 models hold 200K–2M tokens. If your data is <100K tokens (~75K words), just paste it into the prompt or attach as a file. Simpler, no infrastructure.
Projects / Custom GPTs / Gems. Upload your docs as knowledge files. The product handles the chunking + retrieval for you. See the projects lesson.
A really good system prompt. Often the "data" you wanted to retrieve is just a few hundred words of policy. Bake it into the system prompt and skip retrieval entirely.
MCP filesystem server. Let the agent read your files on demand. No vector DB needed; the agent grep's its way to the answer. See the MCP lesson.

Where RAG goes wrong

Symptom	Likely cause
Model invents an answer	Retrieval didn't return relevant chunks; model filled the gap
Answer cites a wrong chunk	Chunking sliced the context away from the answer
Retrieval misses obvious matches	Embedding model can't tell the question and the doc share a topic — try a better embedder or add keyword search
Slow / expensive	Too many retrieved chunks, or chunks too large — top-3 of ~800 tokens is usually plenty
Stale answers	Embedding store never re-indexes — add a cron / webhook

2026 patterns worth knowing

Hybrid search. Combine semantic (embeddings) with keyword (BM25). Names, IDs, exact phrases — keywords still beat embeddings for those.
Reranking. Retrieve 30 candidates, then ask a small reranker model to score and keep the top 3. Big quality jump for small cost.
Agentic retrieval. Instead of one search step, let the agent search, read, search again. Costs more but answers harder questions.
Citation-required prompts. "Cite the chunk number for every claim." Cuts hallucinations and makes verification possible.

Build path: if you want to ship a RAG app, the shortest route in 2026 is OpenAI Assistants API or Anthropic's file search — both handle chunking, embedding, retrieval for you. Roll-your-own pays off only when you need control over the embeddings, the reranker, or the storage layer.

Is fine-tuning the same thing?

No, opposite ends of the spectrum. Fine-tuning bakes patterns into the model's weights — permanent, slow to change. RAG keeps the data outside the model — fast to update, fast to swap. Most use cases want RAG; fine-tune only when the model needs to learn a consistent format or style it can't get from a prompt.

What does an embedding model actually output?

A vector of 768 or 1536 or 3072 numbers (depending on the model). Each number doesn't mean anything individually; distances between vectors are what matter. Two vectors close together = two pieces of text that are about similar things.

Do I need a vector DB or can I use Postgres?

Postgres + pgvector is excellent for under ~10M chunks and is the default recommendation now. Reach for a dedicated vector DB (Pinecone, Qdrant, Weaviate) when you need distributed scale, fancy hybrid search, or specific operational features.