Skip to main content

Command Palette

Search for a command to run...

Retrieval-Augmented Generation (RAG)

Updated
5 min read
Retrieval-Augmented Generation (RAG)

Have you ever asked ChatGPT something like:

“Who won the IPL 2024 finals?”

…and it confidently gave you the wrong answer?

That happens because most AI models, including GPT, don’t actually know everything. They’re trained on huge amounts of data, but their knowledge is frozen at the time of training. If you ask about recent events or company-specific data, they might hallucinate — meaning they make things up.

Now imagine this instead:

  • You have your own knowledge base (a large source of information)

  • AI first searches in your database

  • Then it understands the context

  • Finally, it generates a smart, relevant answer

That’s exactly what Retrieval-Augmented Generation (RAG) does.
It bridges the gap between an AI model’s training data and your real-world, up-to-date information.

Why Do We Need RAG?

Think of a library.

  • GPT is like a librarian who has read millions of books.

  • But the librarian can’t remember everything perfectly.

  • Sometimes, you want fresh information or specific documents that aren’t in their memory.

RAG acts like giving the librarian a catalog system:

  • First, they search the right shelf (retrieval)

  • Then, they summarize and explain (generation)

This makes AI:
More accurate
More reliable
More context-aware
Perfect for real-time knowledge

How RAG Works (Retriever + Generator)

Let’s break it into two main components:

Step 1 — Retriever 🔍

  • Think of it like Google Search for your knowledge base.

  • It finds the most relevant documents based on your query from the Data Source.

  • Uses vector embeddings to compare meaning, not just keywords.

For example:

You ask: “How to install Ubuntu on Raspberry Pi?”

  • Retriever looks into your docs/wiki

  • Finds the most relevant guides

  • Sends them to the generator

Step 2 — Generator ✍️

  • This is your LLM (e.g., GPT, Claude, Gemma).

  • It reads the retrieved documents and uses them to create an accurate, human-like answer.

Example answer:

“To install Ubuntu on a Raspberry Pi, download the Ubuntu Server image, flash it using Raspberry Pi Imager, insert the SD card, and boot your Pi. Make sure to enable SSH if needed.”

Quick Example Flow

You ask: “Who is the CEO of OpenAI?”

  • Retriever: Searches your knowledge base → finds a doc saying “Sam Altman is the CEO.”

  • Generator: Reads it → gives you a natural reply:

“The current CEO of OpenAI is Sam Altman.”

What is Indexing?

Before AI can retrieve anything, we need a searchable structure. That’s where indexing comes in.

Think of indexing like a table of contents in a book:

  • It breaks your documents into chunks

  • Converts them into vectors (we’ll get there in a sec)

  • Stores them in a vector database like Pinecone, Weaviate, Milvus, or FAISS

  • When you search, AI compares your query vector to these stored vectors and fetches the closest matches.

Why We Perform Vectorization?

Normal keyword search sucks for AI. Why?

  • If you search “AI laws”, a normal search engine might skip documents that say “legal regulations for artificial intelligence.”

  • But AI needs meaning, not exact words.

That’s why we use vector embeddings:

  • We convert text → numerical vectors in a high-dimensional space.

  • Sentences with similar meaning end up closer together.

  • This makes retrieval semantic instead of keyword-based.

Example:

  • “Install Ubuntu on Pi” → Vector A

  • “Setup Raspberry Pi with Ubuntu” → Vector B

  • A & B are close in vector space → retriever understands both are related

Why Do RAGs Exist?

We created RAG because LLMs alone aren’t enough:

  • They forget private, domain-specific knowledge

  • They hallucinate when uncertain

  • They can’t access real-time data

  • They don’t know your internal documents

RAG lets you connect AI to your data safely, without retraining the whole model.
That’s why companies, chatbots, SaaS platforms, and knowledge assistants rely on RAG.

Why We Perform Chunking

Imagine dumping a 500-page PDF into ChatGPT.
It would struggle to find the relevant parts efficiently.

That’s why we split documents into smaller pieces → called chunks.

  • Typical chunk size = 300 to 800 tokens

  • Each chunk is indexed separately

  • This makes searching faster and more accurate

Why Overlapping is Used in Chunking

Sometimes, the important context lies between two chunks.

Example:

  • Chunk 1 ends with: “The API key should be stored securely.”

  • Chunk 2 starts with: “Never commit secrets to GitHub.”

If we don’t overlap, AI might miss the connection between them.

That’s why we use sliding windows:

  • Each chunk shares some sentences with the previous one

  • Ensures AI always has full context

Final Thoughts

Retrieval-Augmented Generation (RAG) is like giving your AI Google + Brain Power:

  • Retriever → finds the right knowledge

  • Generator → writes smart answers

  • Indexing + Vectorization → make search semantic

  • Chunking + Overlap → make results accurate

If you’re building:

  • AI-powered chatbots 🤖

  • Document assistants

  • Knowledge search systems

  • Customer support bots

…you’ll definitely need RAG.

Quick Summary

ConceptWhy It Matters
RAGCombines retrieval + generation for accurate answers
RetrieverFinds the most relevant documents
GeneratorUses docs + LLM to create responses
IndexingStores documents in a searchable vector format
VectorizationFinds meaning, not just keywords
ChunkingSplits large docs for faster, better search
OverlapPreserves context between chunks