RAG

Retrieval-Augmented Generation · February 17, 2026

Summary

RAG (Retrieval-Augmented Generation) is a technique that makes AI assistants smarter by letting them look up information before answering. Instead of relying only on what they learned during training, RAG systems search through documents, databases, or the web to find relevant facts — then use those facts to generate accurate, up-to-date responses.

Elementary School

Ages 8-10

Imagine you have a really smart friend who knows a lot of things. But sometimes, when you ask them a question, they're not 100% sure of the answer — maybe it's about something that happened last week, or something very specific.

RAG is like giving your smart friend a backpack full of books and notes. Before answering your question, they can quickly flip through their backpack to find the right information. Then they give you an answer based on what they just read!

So instead of just guessing or saying what they think they know, your friend can actually look it up and give you a better answer. That's what RAG does for AI — it lets the AI "look things up" before responding to you.

High School

Ages 14-18

Historical context: The term "Retrieval-Augmented Generation" was coined by Facebook AI Research (now Meta AI) in a 2020 paper. It emerged because large language models (LLMs) like GPT have a fundamental limitation: their knowledge is frozen at the time of training. If you train a model in 2023, it doesn't know about events in 2024.

Before RAG, there were two separate worlds: retrieval systems (like Google Search, which finds documents) and generation systems (like ChatGPT, which writes text). RAG combined them into a single pipeline.

Question → Retrieve relevant documents → Generate answer using those documents

The core insight: Instead of trying to cram all knowledge into a model's weights (which is expensive and gets outdated), let the model access an external knowledge base at inference time.

Why this matters:

Accuracy: Answers are grounded in actual documents, not just "vibes"
Freshness: Knowledge can be updated without retraining
Attribution: You can cite sources for where the information came from
Domain specialization: A general AI can become an expert on your company's documents

Example

You ask an AI: "What was our company's revenue last quarter?" A vanilla LLM would have no idea. A RAG system searches your internal financial documents, finds the quarterly report, and answers: "$4.2M, up 15% from Q2" — with a link to the source.

College Undergraduate

Ages 18-22

The RAG pipeline explained:

1. Indexing (offline): Documents are chunked into smaller pieces (typically 256-512 tokens), then converted into vector embeddings using an embedding model (like OpenAI's text-embedding-3 or open-source alternatives like BGE, E5). These vectors are stored in a vector database.

2. Retrieval (at query time): When a user asks a question, the query is also converted to a vector. The system performs a similarity search (cosine similarity, dot product) to find the k most relevant document chunks.

3. Augmentation: The retrieved chunks are inserted into the LLM's prompt as context, typically in a format like: "Based on the following documents: [chunks]. Answer the user's question: [query]"

4. Generation: The LLM generates an answer, ideally grounded in the provided context.

Key technical components:

Embedding models: Convert text to dense vectors that capture semantic meaning
Vector databases: Specialized databases (Pinecone, Weaviate, Chroma, pgvector) optimized for similarity search
Chunking strategies: How to split documents — by sentences, paragraphs, or semantic boundaries
Reranking: A second-stage model that re-scores retrieved chunks for relevance

Chunking Matters

If you chunk too small (single sentences), you lose context. If you chunk too large (entire documents), retrieval becomes imprecise. The sweet spot depends on your use case — customer support FAQs might use small chunks; legal document analysis might need larger ones with overlap.

Graduate Student

Advanced degree level

Advanced RAG patterns:

Hybrid search: Combining dense vector search with sparse lexical search (BM25). Dense search captures semantic similarity ("car" matches "automobile"), while sparse search excels at exact matches (product codes, names). Production systems often use both.

Query transformation: Before retrieval, transform the user query to improve results:

HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, then search for documents similar to that answer
Query expansion: Add synonyms or related terms
Query decomposition: Break complex questions into sub-questions

Agentic RAG: Instead of a single retrieve-then-generate step, an agent iteratively decides: "Do I need more information? Should I search a different source? Is this answer complete?" This handles complex, multi-step reasoning.

Multi-modal RAG: Extending retrieval beyond text to images, tables, and structured data. Models like GPT-4V can process retrieved images; table retrieval requires special handling (text-to-SQL, structured extraction).

Evaluation metrics:

Retrieval: Recall@k, MRR (Mean Reciprocal Rank), NDCG
Generation: Faithfulness (does the answer match the sources?), relevance, completeness
End-to-end: Answer correctness, human preference scores

The "Lost in the Middle" Problem

Research shows LLMs pay more attention to information at the beginning and end of long contexts, neglecting the middle. This affects RAG: if the most relevant chunk lands in the middle of 10 retrieved chunks, the model might ignore it. Solutions include reranking, limiting context, or position-aware prompting.

Expert

Researchers & practitioners

GraphRAG and structured retrieval:

Microsoft Research's GraphRAG (2024) constructs a knowledge graph from source documents before retrieval. For queries requiring synthesis across multiple documents ("What are the common themes in all customer complaints?"), graph traversal outperforms flat vector search. The graph captures relationships that vector similarity misses.

Fine-tuning vs. RAG trade-offs:

Fine-tuning: Better for style/format, faster inference, no retrieval latency. But expensive, knowledge gets stale, hard to update
RAG: Easy to update, attributable, handles long-tail knowledge. But adds latency, retrieval can fail, context window limits
Hybrid: Production systems often fine-tune a base model on domain style while using RAG for factual grounding

Retrieval-augmented pre-training (RETRO, REALM):

Instead of adding retrieval only at inference, these architectures incorporate retrieval during training. The model learns to use retrieved documents as part of its core reasoning, not just as a bolted-on feature.

Self-RAG and adaptive retrieval:

Self-RAG (2023) trains the model to decide when to retrieve (not every query needs it) and to critique its own outputs for faithfulness. This reduces unnecessary retrieval calls and improves answer quality.

Production considerations:

Latency budget: Embedding + vector search + reranking + generation adds 200-500ms per query
Cost: Embedding 1M documents, storing vectors, LLM inference with large contexts
Security: Access control — can this user see these retrieved documents?
Observability: Logging retrieval quality, attribution tracking, feedback loops

Corrective RAG (CRAG)

CRAG adds a self-correction step: after retrieval, a lightweight evaluator scores document relevance. Low-confidence retrievals trigger web search fallback or query rewriting. This handles cases where the local knowledge base lacks coverage.

Companies & Tools in This Space

Pinecone
The leading managed vector database, purpose-built for RAG. Used by Shopify, Notion, Gong. Raised $138M Series B at $750M valuation (2023).
Weaviate
Open-source vector database with built-in hybrid search and generative modules. Strong developer community. Raised $50M Series B.
Chroma
Open-source embedding database optimized for AI applications. Simple API, runs locally. YC-backed, $18M seed.
LlamaIndex
The most popular RAG framework. Provides data connectors, indexing strategies, and query engines. $18.5M Series A.
LangChain
LLM application framework with extensive RAG tooling. Powers thousands of AI products. Raised $25M Series A, now valued at $200M+.
Cohere
Enterprise LLM provider with strong embedding models and RAG-optimized APIs (Embed, Rerank, Chat). Raised $270M, valued at $2.1B.
Contextual AI
Founded by RAG paper co-authors. Building specialized RAG systems for enterprise. $20M seed from Greycroft.
Vectara
End-to-end RAG platform with built-in hallucination detection. Founded by ex-Google AI researchers. $28M Series A.
Glean
Enterprise AI search using RAG over company knowledge. Integrates with 100+ work apps. Raised $200M at $2.2B valuation.
Unstructured
Data preprocessing for RAG — extracts text from PDFs, images, tables. The "plumbing" that makes RAG work. $25M Series A.

Summary

Companies & Tools in This Space

Sources