RAG — giving AI access to your data
How retrieval-augmented generation works, when to use it, and the simpler alternatives most people overlook.
A trained model knows what it read during training. It doesn't know your company's wiki, last week's emails, or the PDF you just opened. RAG — Retrieval-Augmented Generation — is the standard pattern for fixing that. The name sounds complicated; the idea is simple.
The 30-second mental model
- You have a pile of documents the model doesn't know about.
- You ask a question.
- A search step finds the most relevant chunks from your pile.
- The model answers with those chunks attached to the prompt.
That's it. Retrieval (find the docs) + Generation (answer with them in context).
How retrieval actually works
- Chunk. Split every document into ~500–1500 token pieces. Overlap them slightly so context isn't sliced apart.
- Embed. Run each chunk through an embedding model (OpenAI, Cohere, Voyage, open-source like nomic-embed). Out comes a high-dimensional vector — a "fingerprint of meaning".
- Store. Save the chunks + their vectors in a vector database — Postgres with pgvector, Pinecone, Qdrant, Weaviate, LanceDB.
- Search. When a question comes in, embed it too. Find the chunks whose vectors are nearest (cosine similarity).
- Generate. Stuff the top 3–10 chunks into the LLM's prompt: "Answer the user's question using only these snippets." Cite the source where possible.
When RAG is the right answer
- Internal company knowledge. Wiki, runbooks, sales playbooks, policy docs.
- Customer support. Help center articles + past tickets.
- Code search. "Find a function that does X" across a large codebase.
- Anywhere with a long tail of facts that can't all fit in a system prompt.
The "you might not need RAG" alternatives
Most non-engineers reach for RAG and over-engineer. Try these first:
- Long context window. 2026 models hold 200K–2M tokens. If your data is <100K tokens (~75K words), just paste it into the prompt or attach as a file. Simpler, no infrastructure.
- Projects / Custom GPTs / Gems. Upload your docs as knowledge files. The product handles the chunking + retrieval for you. See the projects lesson.
- A really good system prompt. Often the "data" you wanted to retrieve is just a few hundred words of policy. Bake it into the system prompt and skip retrieval entirely.
- MCP filesystem server. Let the agent read your files on demand. No vector DB needed; the agent grep's its way to the answer. See the MCP lesson.
Where RAG goes wrong
| Symptom | Likely cause |
|---|---|
| Model invents an answer | Retrieval didn't return relevant chunks; model filled the gap |
| Answer cites a wrong chunk | Chunking sliced the context away from the answer |
| Retrieval misses obvious matches | Embedding model can't tell the question and the doc share a topic — try a better embedder or add keyword search |
| Slow / expensive | Too many retrieved chunks, or chunks too large — top-3 of ~800 tokens is usually plenty |
| Stale answers | Embedding store never re-indexes — add a cron / webhook |
2026 patterns worth knowing
- Hybrid search. Combine semantic (embeddings) with keyword (BM25). Names, IDs, exact phrases — keywords still beat embeddings for those.
- Reranking. Retrieve 30 candidates, then ask a small reranker model to score and keep the top 3. Big quality jump for small cost.
- Agentic retrieval. Instead of one search step, let the agent search, read, search again. Costs more but answers harder questions.
- Citation-required prompts. "Cite the chunk number for every claim." Cuts hallucinations and makes verification possible.
Build path: if you want to ship a RAG app, the shortest route in 2026 is OpenAI Assistants API or
Anthropic's file search — both handle chunking, embedding, retrieval for you. Roll-your-own pays off only when
you need control over the embeddings, the reranker, or the storage layer.
Is fine-tuning the same thing?
No, opposite ends of the spectrum. Fine-tuning bakes patterns into the model's weights —
permanent, slow to change. RAG keeps the data outside the model — fast to update,
fast to swap. Most use cases want RAG; fine-tune only when the model needs to learn a
consistent format or style it can't get from a prompt.
What does an embedding model actually output?
A vector of 768 or 1536 or 3072 numbers (depending on the model). Each number doesn't mean
anything individually; distances between vectors are what matter. Two vectors close
together = two pieces of text that are about similar things.
Do I need a vector DB or can I use Postgres?
Postgres + pgvector is excellent for under ~10M chunks and is the default recommendation now.
Reach for a dedicated vector DB (Pinecone, Qdrant, Weaviate) when you need distributed scale,
fancy hybrid search, or specific operational features.