EN 日本語
← Back to archive

RAG

Retrieval-Augmented Generation · February 17, 2026

Summary

RAG (Retrieval-Augmented Generation) is a technique that makes AI assistants smarter by letting them look up information before answering. Instead of relying only on what they learned during training, RAG systems search through documents, databases, or the web to find relevant facts — then use those facts to generate accurate, up-to-date responses.

1
Elementary School
Ages 8-10

Imagine you have a really smart friend who knows a lot of things. But sometimes, when you ask them a question, they're not 100% sure of the answer — maybe it's about something that happened last week, or something very specific.

RAG is like giving your smart friend a backpack full of books and notes. Before answering your question, they can quickly flip through their backpack to find the right information. Then they give you an answer based on what they just read!

So instead of just guessing or saying what they think they know, your friend can actually look it up and give you a better answer. That's what RAG does for AI — it lets the AI "look things up" before responding to you.

2
High School
Ages 14-18

Historical context: The term "Retrieval-Augmented Generation" was coined by Facebook AI Research (now Meta AI) in a 2020 paper. It emerged because large language models (LLMs) like GPT have a fundamental limitation: their knowledge is frozen at the time of training. If you train a model in 2023, it doesn't know about events in 2024.

Before RAG, there were two separate worlds: retrieval systems (like Google Search, which finds documents) and generation systems (like ChatGPT, which writes text). RAG combined them into a single pipeline.

Question → Retrieve relevant documents → Generate answer using those documents

The core insight: Instead of trying to cram all knowledge into a model's weights (which is expensive and gets outdated), let the model access an external knowledge base at inference time.

Why this matters:

  • Accuracy: Answers are grounded in actual documents, not just "vibes"
  • Freshness: Knowledge can be updated without retraining
  • Attribution: You can cite sources for where the information came from
  • Domain specialization: A general AI can become an expert on your company's documents
Example

You ask an AI: "What was our company's revenue last quarter?" A vanilla LLM would have no idea. A RAG system searches your internal financial documents, finds the quarterly report, and answers: "$4.2M, up 15% from Q2" — with a link to the source.

3
College Undergraduate
Ages 18-22

The RAG pipeline explained:

1. Indexing (offline): Documents are chunked into smaller pieces (typically 256-512 tokens), then converted into vector embeddings using an embedding model (like OpenAI's text-embedding-3 or open-source alternatives like BGE, E5). These vectors are stored in a vector database.

2. Retrieval (at query time): When a user asks a question, the query is also converted to a vector. The system performs a similarity search (cosine similarity, dot product) to find the k most relevant document chunks.

3. Augmentation: The retrieved chunks are inserted into the LLM's prompt as context, typically in a format like: "Based on the following documents: [chunks]. Answer the user's question: [query]"

4. Generation: The LLM generates an answer, ideally grounded in the provided context.

Key technical components:

  • Embedding models: Convert text to dense vectors that capture semantic meaning
  • Vector databases: Specialized databases (Pinecone, Weaviate, Chroma, pgvector) optimized for similarity search
  • Chunking strategies: How to split documents — by sentences, paragraphs, or semantic boundaries
  • Reranking: A second-stage model that re-scores retrieved chunks for relevance
Chunking Matters

If you chunk too small (single sentences), you lose context. If you chunk too large (entire documents), retrieval becomes imprecise. The sweet spot depends on your use case — customer support FAQs might use small chunks; legal document analysis might need larger ones with overlap.

4
Graduate Student
Advanced degree level

Advanced RAG patterns:

Hybrid search: Combining dense vector search with sparse lexical search (BM25). Dense search captures semantic similarity ("car" matches "automobile"), while sparse search excels at exact matches (product codes, names). Production systems often use both.

Query transformation: Before retrieval, transform the user query to improve results:

  • HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, then search for documents similar to that answer
  • Query expansion: Add synonyms or related terms
  • Query decomposition: Break complex questions into sub-questions

Agentic RAG: Instead of a single retrieve-then-generate step, an agent iteratively decides: "Do I need more information? Should I search a different source? Is this answer complete?" This handles complex, multi-step reasoning.

Multi-modal RAG: Extending retrieval beyond text to images, tables, and structured data. Models like GPT-4V can process retrieved images; table retrieval requires special handling (text-to-SQL, structured extraction).

Evaluation metrics:

  • Retrieval: Recall@k, MRR (Mean Reciprocal Rank), NDCG
  • Generation: Faithfulness (does the answer match the sources?), relevance, completeness
  • End-to-end: Answer correctness, human preference scores
The "Lost in the Middle" Problem

Research shows LLMs pay more attention to information at the beginning and end of long contexts, neglecting the middle. This affects RAG: if the most relevant chunk lands in the middle of 10 retrieved chunks, the model might ignore it. Solutions include reranking, limiting context, or position-aware prompting.

5
Expert
Researchers & practitioners

GraphRAG and structured retrieval:

Microsoft Research's GraphRAG (2024) constructs a knowledge graph from source documents before retrieval. For queries requiring synthesis across multiple documents ("What are the common themes in all customer complaints?"), graph traversal outperforms flat vector search. The graph captures relationships that vector similarity misses.

Fine-tuning vs. RAG trade-offs:

  • Fine-tuning: Better for style/format, faster inference, no retrieval latency. But expensive, knowledge gets stale, hard to update
  • RAG: Easy to update, attributable, handles long-tail knowledge. But adds latency, retrieval can fail, context window limits
  • Hybrid: Production systems often fine-tune a base model on domain style while using RAG for factual grounding

Retrieval-augmented pre-training (RETRO, REALM):

Instead of adding retrieval only at inference, these architectures incorporate retrieval during training. The model learns to use retrieved documents as part of its core reasoning, not just as a bolted-on feature.

Self-RAG and adaptive retrieval:

Self-RAG (2023) trains the model to decide when to retrieve (not every query needs it) and to critique its own outputs for faithfulness. This reduces unnecessary retrieval calls and improves answer quality.

Production considerations:

  • Latency budget: Embedding + vector search + reranking + generation adds 200-500ms per query
  • Cost: Embedding 1M documents, storing vectors, LLM inference with large contexts
  • Security: Access control — can this user see these retrieved documents?
  • Observability: Logging retrieval quality, attribution tracking, feedback loops
Corrective RAG (CRAG)

CRAG adds a self-correction step: after retrieval, a lightweight evaluator scores document relevance. Low-confidence retrievals trigger web search fallback or query rewriting. This handles cases where the local knowledge base lacks coverage.

Companies & Tools in This Space

Sources