Learn AI
    navigate Enter open Esc close Open with K or /

    6 min

    RAG — giving AI access to your data

    How retrieval-augmented generation works, when to use it, and the simpler alternatives most people overlook.

    A trained model knows what it read during training. It doesn't know your company's wiki, last week's emails, or the PDF you just opened. RAG — Retrieval-Augmented Generation — is the standard pattern for fixing that. The name sounds complicated; the idea is simple.

    The 30-second mental model

    1. You have a pile of documents the model doesn't know about.
    2. You ask a question.
    3. A search step finds the most relevant chunks from your pile.
    4. The model answers with those chunks attached to the prompt.

    That's it. Retrieval (find the docs) + Generation (answer with them in context).

    Your question Retriever Your docs top-k chunks LLM Grounded answer

    How retrieval actually works

    1. Chunk. Split every document into ~500–1500 token pieces. Overlap them slightly so context isn't sliced apart.
    2. Embed. Run each chunk through an embedding model (OpenAI, Cohere, Voyage, open-source like nomic-embed). Out comes a high-dimensional vector — a "fingerprint of meaning".
    3. Store. Save the chunks + their vectors in a vector database — Postgres with pgvector, Pinecone, Qdrant, Weaviate, LanceDB.
    4. Search. When a question comes in, embed it too. Find the chunks whose vectors are nearest (cosine similarity).
    5. Generate. Stuff the top 3–10 chunks into the LLM's prompt: "Answer the user's question using only these snippets." Cite the source where possible.

    When RAG is the right answer

    • Internal company knowledge. Wiki, runbooks, sales playbooks, policy docs.
    • Customer support. Help center articles + past tickets.
    • Code search. "Find a function that does X" across a large codebase.
    • Anywhere with a long tail of facts that can't all fit in a system prompt.

    The "you might not need RAG" alternatives

    Most non-engineers reach for RAG and over-engineer. Try these first:

    • Long context window. 2026 models hold 200K–2M tokens. If your data is <100K tokens (~75K words), just paste it into the prompt or attach as a file. Simpler, no infrastructure.
    • Projects / Custom GPTs / Gems. Upload your docs as knowledge files. The product handles the chunking + retrieval for you. See the projects lesson.
    • A really good system prompt. Often the "data" you wanted to retrieve is just a few hundred words of policy. Bake it into the system prompt and skip retrieval entirely.
    • MCP filesystem server. Let the agent read your files on demand. No vector DB needed; the agent grep's its way to the answer. See the MCP lesson.

    Where RAG goes wrong

    SymptomLikely cause
    Model invents an answerRetrieval didn't return relevant chunks; model filled the gap
    Answer cites a wrong chunkChunking sliced the context away from the answer
    Retrieval misses obvious matchesEmbedding model can't tell the question and the doc share a topic — try a better embedder or add keyword search
    Slow / expensiveToo many retrieved chunks, or chunks too large — top-3 of ~800 tokens is usually plenty
    Stale answersEmbedding store never re-indexes — add a cron / webhook

    2026 patterns worth knowing

    • Hybrid search. Combine semantic (embeddings) with keyword (BM25). Names, IDs, exact phrases — keywords still beat embeddings for those.
    • Reranking. Retrieve 30 candidates, then ask a small reranker model to score and keep the top 3. Big quality jump for small cost.
    • Agentic retrieval. Instead of one search step, let the agent search, read, search again. Costs more but answers harder questions.
    • Citation-required prompts. "Cite the chunk number for every claim." Cuts hallucinations and makes verification possible.
    Build path: if you want to ship a RAG app, the shortest route in 2026 is OpenAI Assistants API or Anthropic's file search — both handle chunking, embedding, retrieval for you. Roll-your-own pays off only when you need control over the embeddings, the reranker, or the storage layer.
    Is fine-tuning the same thing?
    No, opposite ends of the spectrum. Fine-tuning bakes patterns into the model's weights — permanent, slow to change. RAG keeps the data outside the model — fast to update, fast to swap. Most use cases want RAG; fine-tune only when the model needs to learn a consistent format or style it can't get from a prompt.
    What does an embedding model actually output?
    A vector of 768 or 1536 or 3072 numbers (depending on the model). Each number doesn't mean anything individually; distances between vectors are what matter. Two vectors close together = two pieces of text that are about similar things.
    Do I need a vector DB or can I use Postgres?
    Postgres + pgvector is excellent for under ~10M chunks and is the default recommendation now. Reach for a dedicated vector DB (Pinecone, Qdrant, Weaviate) when you need distributed scale, fancy hybrid search, or specific operational features.