🔍 Retrieval Augmented Generation

Unit 5: The Secret Sauce of Modern AI

Building Context-Aware LLM Systems

📚 What is RAG and Why Does It Matter?

The Problem with Plain LLMs

LLMs have knowledge cutoff dates and don't know about your private data

❌ Without RAG

Q: "What's in our Q3 2024 financial report?"

A: "I don't have access to your company's financial reports."

→

✅ With RAG

Q: "What's in our Q3 2024 financial report?"

A: "According to your Q3 report, revenue increased 23% to $5.2M..."

🎯 RAG = Give LLMs access to external knowledge without retraining!

5.1 Introduction to RAG

🎯 What is RAG?

Retrieval Augmented Generation

Retrieve relevant information from external sources →
Augment the prompt with this context →
Generate accurate, grounded responses

Basic RAG Flow

1. User Query

"What's our vacation policy?"

→

2. Retrieve Relevant Documents

Search knowledge base for vacation policy

→

3. Augment Prompt

Context: [vacation policy text] + Question

→

4. Generate Answer

"Employees receive 15 days PTO annually..."

5.1 Introduction to RAG

💡 Why RAG Over Fine-tuning?

Aspect	Fine-tuning	RAG
Cost	💰💰💰 High (retraining)	💰 Low (just retrieval)
Update Speed	🐌 Slow (retrain needed)	⚡ Instant (update docs)
Knowledge Source	Baked into weights	External, auditable
Accuracy	Can hallucinate	Grounded in sources
Transparency	Black box	Can cite sources
Best For	Behavior/style changes	Knowledge/facts

💡 Best of Both Worlds: Fine-tune for behavior + RAG for knowledge = powerful combination!

5.2 Understanding Tokens

🔤 What Are Tokens?

Definition

Tokens = The basic units LLMs process. Not exactly words—subword pieces!

Tokenization Examples

Text: "Hello, how are you?"

Tokens: ["Hello", ",", " how", " are", " you", "?"]

Count: 6 tokens

Text: "Supercalifragilisticexpialidocious"

Tokens: ["Super", "cal", "ifrag", "ilistic", "exp", "ial", "idoc", "ious"]

Count: 8 tokens (long word = more tokens!)

💡 Rule of Thumb: 1 token ≈ 0.75 words in English. 100 words ≈ 133 tokens

5.2 Understanding Tokens

⚠️ Token Limitations & Context Windows

Context Window Sizes

Model	Context Window	Equivalent
GPT-3.5	4,096 tokens	~3,000 words
GPT-3.5-16k	16,384 tokens	~12,000 words
GPT-4	8,192 tokens	~6,000 words
GPT-4-32k	32,768 tokens	~24,000 words
Claude 3	200,000 tokens	~150,000 words

The RAG Problem

Your company knowledge base = 10,000 documents = millions of tokens!

Solution: Retrieve only the MOST RELEVANT chunks, not everything

5.3 Vector Embeddings

🎯 What Are Vector Embeddings?

Definition

Embeddings = Numerical representations of text that capture semantic meaning

From Text to Numbers

Text: "The cat sat on the mat"

Embedding (1536 dimensions):

[0.023, -0.145, 0.678, 0.234, ..., 0.892]

A vector of 1536 numbers!

Text: "A feline rested on the rug"

Embedding:

[0.028, -0.142, 0.681, 0.229, ..., 0.895]

Similar meaning = similar vectors!

🎯 Magic: Sentences with similar meanings have similar embeddings, even with different words!

5.3 Vector Embeddings

🧠 How Embeddings Capture Meaning

Semantic Similarity in Vector Space

"king" is close to "queen", "monarch"

"dog" is close to "cat", "puppy"

"king" - "man" + "woman" ≈ "queen"

Generating Embeddings

        from langchain.embeddings import OpenAIEmbeddings

                # Initialize embeddings model
                embeddings = OpenAIEmbeddings()

                # Generate embedding for text
                text = "Machine learning is amazing"
                vector = embeddings.embed_query(text)

                print(len(vector))  # 1536 dimensions
                print(vector[:5])   # [0.023, -0.145, 0.678, 0.234, -0.012]

                # Embed multiple documents at once
                docs = ["doc1", "doc2", "doc3"]
                doc_vectors = embeddings.embed_documents(docs)

5.3 Vector Embeddings

🤖 Popular Embedding Models

Model	Dimensions	Provider	Best For
text-embedding-ada-002	1536	OpenAI	General purpose, most popular
text-embedding-3-small	1536	OpenAI	Faster, cheaper
text-embedding-3-large	3072	OpenAI	Best quality
all-MiniLM-L6-v2	384	Open source	Fast, free, local
BGE-large	1024	Open source	High quality, free

💡 Cost Consideration: OpenAI embeddings cost ~$0.0001 per 1K tokens. For 1M documents, that's ~$100!

5.4 Vector Databases

🗄️ What Are Vector Databases?

Definition

Specialized databases optimized for storing and searching high-dimensional vectors

Traditional Database

Search: "Find rows where name = 'John'"

→ Exact match

VS

Vector Database

Search: "Find vectors similar to [0.23, 0.45, ...]"

→ Semantic similarity

Why We Need Them

⚡ Speed

Find similar vectors among millions in milliseconds

📏 Similarity

Built-in distance calculations (cosine, euclidean)

🏗️ Scalability

Handle billions of vectors efficiently

5.4 Vector Databases

🗄️ Popular Vector Databases

Pinecone

Type: Managed cloud
Pros: Easy setup, scalable, reliable
Cons: Costs money, vendor lock-in
Best for: Production apps

Chroma

Type: Open source, local
Pros: Free, easy, LangChain integration
Cons: Not for massive scale
Best for: Development, prototypes

Weaviate

Type: Open source + cloud
Pros: Feature-rich, GraphQL API
Cons: Complex setup
Best for: Enterprise apps

FAISS

Type: Library (Meta)
Pros: Very fast, battle-tested
Cons: Just a library, not full DB
Best for: Custom solutions

💡 Recommendation: Start with Chroma (free, easy). Scale to Pinecone for production.

5.4 Vector Databases

🔧 Using Chroma (Hands-on)

# Install: pip install chromadb
        from langchain.vectorstores import Chroma
        from langchain.embeddings import OpenAIEmbeddings

        # 1. Create embeddings model
        embeddings = OpenAIEmbeddings()

        # 2. Sample documents
        docs = [
            "Paris is the capital of France",
            "Python is a programming language",
            "The Eiffel Tower is in Paris",
            "Machine learning uses Python"
        ]

        # 3. Create vector store from documents
        vectorstore = Chroma.from_texts(
            texts=docs,
            embedding=embeddings,
            persist_directory="./chroma_db"  # Save to disk
        )

        # 4. Search for similar documents
        query = "Tell me about France"
        results = vectorstore.similarity_search(query, k=2)

        for doc in results:
            print(doc.page_content)

Output:

Paris is the capital of France
The Eiffel Tower is in Paris

5.5 Chunking Strategies

✂️ Why Chunking Matters

The Chunking Problem

Documents are too large to fit in context windows. We must split them into chunks!

❌ Bad Chunking

Chunk 1: "The company was founded in"
Chunk 2: "1998 by two engineers who"

Sentence split! Context lost! 😱

VS

✅ Good Chunking

Chunk 1: "The company was founded in 1998 by two engineers who wanted to..."

Complete thought preserved! ✨

Chunking Trade-offs

Too small: Not enough context, many API calls
Too large: Irrelevant info, expensive tokens
Just right: Goldilocks zone—meaningful context, efficient retrieval

5.5 Chunking Strategies

🎯 Chunking Strategies

1. Fixed-Size Chunking

from langchain.text_splitter import CharacterTextSplitter

        splitter = CharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50
        )
        chunks = splitter.split_text(text)

✅ Simple, predictable
❌ May split sentences

2. Recursive Splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", " ", ""]
        )

✅ Preserves structure
✅ Most recommended

3. Semantic Chunking

from langchain.text_splitter import SemanticChunker

        splitter = SemanticChunker(
            embeddings=OpenAIEmbeddings()
        )
        chunks = splitter.split_text(text)

✅ Topic-based splits
❌ Slower, more expensive

4. Document-Specific

Markdown: Split by headers
Code: Split by functions
HTML: Split by sections

✅ Respects structure
✅ Domain-specific

5.5 Chunking & Metadata

🏷️ Adding Metadata to Chunks

Why Metadata?

Attach additional information to chunks for better filtering and context

from langchain.schema import Document

        # Create document with metadata
        doc = Document(
            page_content="Paris is the capital of France",
            metadata={
                "source": "geography.pdf",
                "page": 42,
                "author": "John Smith",
                "date": "2024-01-15",
                "category": "geography"
            }
        )

        # Store in vector DB
        vectorstore = Chroma.from_documents([doc], embeddings)

        # Filter by metadata during search
        results = vectorstore.similarity_search(
            query="capitals",
            filter={"category": "geography"}
        )

✨ Use Cases: Filter by date, source, author, department, security level, etc.

5.6 Similarity Search

🔍 How Similarity Search Works

The Core Concept

Find vectors that are "closest" to the query vector in high-dimensional space

Distance Metrics

Cosine Similarity

similarity = cos(θ)
Range: [-1, 1]
1 = identical
0 = orthogonal
-1 = opposite

✅ Most common for text

Euclidean Distance

d = √Σ(a_i - b_i)²
Range: [0, ∞]
0 = identical
larger = farther

✅ Intuitive "distance"

Dot Product

dp = Σ(a_i × b_i)
Range: [-∞, ∞]
higher = more similar

✅ Fast to compute

💡 Default: Cosine similarity works best for most RAG applications!

5.6 Similarity Search

🎯 Retrieval Methods in LangChain

from langchain.vectorstores import Chroma

        # Assume vectorstore is already created

        # 1. Basic similarity search (top k)
        results = vectorstore.similarity_search(
            query="What is machine learning?",
            k=3  # Return top 3 most similar
        )

        # 2. Similarity search with scores
        results = vectorstore.similarity_search_with_score(
            query="What is machine learning?",
            k=3
        )
        for doc, score in results:
            print(f"Score: {score:.3f} - {doc.page_content}")

        # 3. MMR (Maximum Marginal Relevance) - diverse results
        results = vectorstore.max_marginal_relevance_search(
            query="machine learning",
            k=3,
            fetch_k=10,  # Fetch 10, return 3 diverse ones
            lambda_mult=0.5  # Balance relevance vs diversity
        )

        # 4. With metadata filtering
        results = vectorstore.similarity_search(
            query="policy updates",
            k=3,
            filter={"year": 2024}

)

5.7 Hybrid Search Methods

🔀 Hybrid Search: Best of Both Worlds

What is Hybrid Search?

Combine semantic search (embeddings) + keyword search (BM25) for better results

Semantic Search Only

Query: "Python programming"

Finds: "Coding in Python", "Programming languages"

✅ Understands meaning
❌ Might miss exact term matches

+

Keyword Search (BM25)

Query: "Python programming"

Finds: Exact word matches of "Python" and "programming"

✅ Precise term matching
❌ Misses synonyms

🎯 Hybrid = Semantic understanding + exact keyword matching = best results!

5.7 Hybrid Search Methods

🔧 Implementing Hybrid Search

from langchain.retrievers import BM25Retriever, EnsembleRetriever
        from langchain.vectorstores import Chroma

        # 1. Semantic retriever (vector search)
        vectorstore = Chroma.from_documents(docs, embeddings)
        semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

        # 2. Keyword retriever (BM25)
        bm25_retriever = BM25Retriever.from_documents(docs)
        bm25_retriever.k = 3

        # 3. Combine them with weights
        ensemble_retriever = EnsembleRetriever(
            retrievers=[semantic_retriever, bm25_retriever],
            weights=[0.5, 0.5]  # 50% semantic, 50% keyword
        )

        # 4. Use hybrid retriever
        results = ensemble_retriever.get_relevant_documents(
            "What is the company vacation policy?"
        )

        for doc in results:
            print(doc.page_content)

💡 Tuning Weights: Try 0.7/0.3 or 0.6/0.4. Test what works best for your data!

5.8 Re-ranking Techniques

🎯 Re-ranking: The Secret Weapon

The Problem

Initial retrieval might return 100 documents, but only top 3-5 go to LLM. Order matters!

Re-ranking Flow

Step 1: Initial Retrieval

Fetch 100 potentially relevant documents (fast, rough)

Step 2: Re-rank

Score and re-order the 100 documents (slower, precise)

Step 3: Return Top K

Send top 3-5 best documents to LLM

🎯 Result: Better accuracy without sending 100 documents to expensive LLM!

5.8 Re-ranking Techniques

🔧 Re-ranking Methods

1. Cross-Encoder Models

Models specifically trained to score query-document pairs

from langchain.retrievers import ContextualCompressionRetriever
        from langchain.retrievers.document_compressors import CohereRerank

        # Using Cohere's reranker
        compressor = CohereRerank()
        retriever = ContextualCompressionRetriever(
            base_compressor=compressor,
            base_retriever=vectorstore.as_retriever()
        )

2. LLM-based Re-ranking

Use LLM to score relevance

from langchain.retrievers.document_compressors import LLMChainFilter

        # LLM judges relevance
        compressor = LLMChainFilter.from_llm(llm)
        retriever = ContextualCompressionRetriever(
            base_compressor=compressor,
            base_retriever=vectorstore.as_retriever()
        )

⚠️ Trade-offs

Cross-Encoder: Faster, cheaper, but needs separate API
LLM Re-ranker: No extra API, but slower and more expensive

5.9 Evaluation Metrics

📊 Why Evaluate RAG Systems?

The Challenge

How do you know if your RAG system is actually working well?

❌ Without Evaluation

Guessing if it works
Can't compare approaches
No way to improve systematically
Production surprises

✅ With Evaluation

Quantify performance
A/B test changes
Track improvements
Confidence in production

Two Types of Metrics

Retrieval Metrics

Did we retrieve the RIGHT documents?

Generation Metrics

Did the LLM generate a GOOD answer?

5.9 Evaluation Metrics

🎯 Retrieval Metrics

Context Precision & Recall

Context Precision: Of retrieved docs, how many are relevant?

Precision = (Relevant Retrieved) / (Total Retrieved)

Context Recall: Of all relevant docs, how many did we retrieve?

Recall = (Relevant Retrieved) / (Total Relevant)

Example

Query: "What's the vacation policy?"

Total relevant documents in DB: 5

Retrieved: 3 documents

Relevant among retrieved: 2 documents

Precision = 2/3 = 0.67 (67% of retrieved were relevant)

Recall = 2/5 = 0.40 (40% of relevant docs were found)

5.9 Evaluation Metrics

📝 Generation Metrics

Metric	What It Measures	Range
Faithfulness	Is the answer grounded in retrieved context?	0-1 (1 = fully grounded)
Answer Relevance	Does the answer address the question?	0-1 (1 = completely relevant)
Context Relevance	Is the retrieved context relevant to query?	0-1 (1 = highly relevant)
BLEU Score	N-gram overlap with reference answer	0-1 (1 = perfect match)
ROUGE Score	Recall-oriented overlap with reference	0-1 (1 = perfect match)

⚠️ Important: BLEU/ROUGE need reference answers. Faithfulness/Relevance can be computed automatically using LLMs!

5.9 Evaluation Metrics

🔧 Evaluating RAG with RAGAS

RAGAS: RAG Assessment Framework

# Install: pip install ragas
        from ragas import evaluate
        from ragas.metrics import (
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall
        )

        # Your RAG pipeline results
        data = {
            "question": ["What is the vacation policy?"],
            "answer": ["Employees get 15 days PTO annually"],
            "contexts": [["Vacation policy: 15 days...", "PTO accrues..."]],
            "ground_truth": ["The company offers 15 days paid time off"]
        }

        # Evaluate
        result = evaluate(
            data,
            metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
        )

        print(result)

Output:

faithfulness: 0.95
answer_relevancy: 0.92
context_precision: 0.88
context_recall: 0.85

🏗️ Building a Complete RAG System

from langchain.document_loaders import PyPDFLoader
        from langchain.text_splitter import RecursiveCharacterTextSplitter
        from langchain.embeddings import OpenAIEmbeddings
        from langchain.vectorstores import Chroma
        from langchain.chains import RetrievalQA
        from langchain.chat_models import ChatOpenAI

        # 1. Load documents
        loader = PyPDFLoader("company_docs.pdf")
        documents = loader.load()

        # 2. Split into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        chunks = text_splitter.split_documents(documents)

        # 3. Create embeddings and store
        embeddings = OpenAIEmbeddings()
        vectorstore = Chroma.from_documents(chunks, embeddings)

        # 4. Create retrieval chain
        qa_chain = RetrievalQA.from_chain_type(
            llm=ChatOpenAI(temperature=0),
            chain_type="stuff",
            retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
        )

        # 5. Ask questions!
        answer = qa_chain.run("What is the vacation policy?")
        print(answer)

🚀 Advanced RAG Patterns

1. Multi-Query RAG

Generate multiple versions of the query for better retrieval

from langchain.retrievers import MultiQueryRetriever

        retriever = MultiQueryRetriever.from_llm(
            retriever=vectorstore.as_retriever(),
            llm=ChatOpenAI()
        )
        # Generates 3-5 variations of query

2. Parent Document Retriever

Retrieve small chunks, but return larger parent documents

from langchain.retrievers import ParentDocumentRetriever

        retriever = ParentDocumentRetriever(
            vectorstore=vectorstore,
            docstore=docstore,
            child_splitter=small_splitter,
            parent_splitter=large_splitter
        )

3. Self-Query Retriever

Extract metadata filters from natural language

from langchain.retrievers import SelfQueryRetriever

        retriever = SelfQueryRetriever.from_llm(
            llm=ChatOpenAI(),
            vectorstore=vectorstore,
            document_content_description="docs",
            metadata_field_info=metadata_info
        )

4. Contextual Compression

Compress retrieved docs to only relevant parts

from langchain.retrievers import ContextualCompressionRetriever

        retriever = ContextualCompressionRetriever(
            base_compressor=compressor,
            base_retriever=base_retriever
        )

✨ RAG Best Practices

📝 Document Preparation

Clean your data: Remove noise, formatting issues
Add metadata: Source, date, category, author
Test chunking: Experiment with sizes (500-1500 tokens)
Maintain structure: Don't break paragraphs/sections

🔍 Retrieval Optimization

Start with k=3-5: Don't retrieve too many
Use hybrid search: Semantic + keyword
Add re-ranking: Improves top results
Filter by metadata: When possible

🤖 Generation Quality

System prompts: "Answer based only on context"
Cite sources: Include document references
Handle no-answer: "I don't have information about..."
Temperature=0: For factual accuracy

📊 Evaluation & Monitoring

Create test set: 50-100 Q&A pairs
Track metrics: Faithfulness, relevance, precision
A/B test changes: Before deploying
User feedback: Thumbs up/down

⚠️ Common RAG Challenges & Solutions

Challenge	Symptom	Solution
Hallucination	LLM makes up information	Lower temperature, stronger system prompt, check faithfulness
Irrelevant Retrieval	Wrong documents retrieved	Better chunking, hybrid search, re-ranking, metadata filters
Context Overflow	Too many tokens in context	Reduce k, use compression, summarize chunks
Poor Answers	Answers lack detail/accuracy	Improve prompts, increase k, check chunk quality
Slow Performance	High latency	Cache embeddings, use faster models, optimize DB
High Costs	Expensive API bills	Cache results, use smaller embeddings, optimize k

🌍 Real-World RAG Applications

📚 Enterprise Knowledge Base

Use Case: Internal documentation Q&A

Confluence, SharePoint, Google Drive docs
IT policies, HR handbooks, procedures
Reduces support tickets by 40%

🎧 Customer Support

Use Case: Product documentation assistant

Answer product questions instantly
Cite sources from manuals
24/7 availability

⚖️ Legal Research

Use Case: Case law search

Search through thousands of cases
Find relevant precedents
Saves hours of manual research

🏥 Medical Information

Use Case: Clinical guidelines

Query medical literature
Evidence-based recommendations
Always cite research sources

🎯 Key Takeaways

🔍 Core Concepts

RAG = Retrieve + Augment + Generate
Embeddings capture semantic meaning
Vector DBs enable similarity search
Chunking strategy matters!

🛠️ Technical Skills

Build RAG pipelines with LangChain
Use Chroma/Pinecone for storage
Implement hybrid search
Apply re-ranking techniques

📊 Quality & Evaluation

Measure precision, recall, faithfulness
Use RAGAS for evaluation
A/B test improvements
Monitor in production

🔥 The RAG Revolution

RAG lets LLMs access ANY knowledge without retraining

This is why ChatGPT plugins, Claude Projects, and most AI products use RAG under the hood!

📝 Homework Assignment

Assignment: Build a Production RAG System

Due: Next class

Project: Domain-Specific Q&A System

Requirements (100 points):

Data Collection (15 pts):
Gather 10+ documents in your chosen domain (PDFs, websites, etc.)
RAG Pipeline (40 pts):
- Document loading and preprocessing
- Smart chunking with overlap
- Vector store implementation (Chroma or Pinecone)
- Retrieval with k=3-5
- QA chain with citations
Advanced Features (25 pts):
Implement at least 2 of: Hybrid search, Re-ranking, Metadata filtering, Multi-query, Compression
Evaluation (20 pts):
Create 10 test Q&A pairs, compute RAGAS metrics, analyze results

📦 Assignment Deliverables

1. Code (GitHub)

Complete RAG implementation
Clean, commented code
Requirements.txt
README with setup instructions

2. Documentation

Domain and data sources
Architecture decisions
Chunking strategy rationale
Advanced features explanation

3. Demo Video (5 min)

Show system in action
Ask 5+ questions
Demonstrate advanced features
Show source citations

4. Evaluation Report

Test questions and answers
RAGAS metric scores
Analysis of results
Improvement suggestions

💡 Domain Suggestions: Tech documentation, legal documents, medical info, academic papers, company policies, product manuals, news articles, research papers

📚 Resources & Next Steps

📖 Essential Reading

Lewis et al. (2020): Original RAG paper
LangChain RAG Docs: Comprehensive guide
Pinecone Learning Center: RAG tutorials
RAGAS Documentation: Evaluation framework

🛠️ Tools & Libraries

LangChain: RAG framework
LlamaIndex: Alternative framework
Chroma: Vector database
RAGAS: Evaluation metrics

📅 Coming Next

⚖️ Unit 6: Ethical and Responsible AI

Bias and fairness in AI systems
Privacy and data protection
Model misuse and safeguards
Responsible AI practices

❓ Questions?

Let's Discuss!

Any questions about:

RAG architecture and workflow?
Vector embeddings and databases?
Chunking strategies?
Hybrid search and re-ranking?
Evaluation metrics?
Your project ideas?

Thank You! 🎉

You now understand RAG systems!

The foundation of modern AI applications

📧 Questions? Reach out anytime!

💻 Start building your RAG system!

🔍 Experiment with different techniques!

⚖️ Next: Ethics in AI - crucial for responsible development!