πŸ” Retrieval Augmented Generation

Unit 5: The Secret Sauce of Modern AI

Building Context-Aware LLM Systems

πŸ“š What is RAG and Why Does It Matter?

The Problem with Plain LLMs

LLMs have knowledge cutoff dates and don't know about your private data

❌ Without RAG

Q: "What's in our Q3 2024 financial report?"

A: "I don't have access to your company's financial reports."

β†’

βœ… With RAG

Q: "What's in our Q3 2024 financial report?"

A: "According to your Q3 report, revenue increased 23% to $5.2M..."

🎯 RAG = Give LLMs access to external knowledge without retraining!

5.1 Introduction to RAG

🎯 What is RAG?

Retrieval Augmented Generation

Retrieve relevant information from external sources β†’
Augment the prompt with this context β†’
Generate accurate, grounded responses

Basic RAG Flow

1. User Query

"What's our vacation policy?"

β†’
2. Retrieve Relevant Documents

Search knowledge base for vacation policy

β†’
3. Augment Prompt

Context: [vacation policy text] + Question

β†’
4. Generate Answer

"Employees receive 15 days PTO annually..."

5.1 Introduction to RAG

πŸ’‘ Why RAG Over Fine-tuning?

Aspect Fine-tuning RAG
Cost πŸ’°πŸ’°πŸ’° High (retraining) πŸ’° Low (just retrieval)
Update Speed 🐌 Slow (retrain needed) ⚑ Instant (update docs)
Knowledge Source Baked into weights External, auditable
Accuracy Can hallucinate Grounded in sources
Transparency Black box Can cite sources
Best For Behavior/style changes Knowledge/facts

πŸ’‘ Best of Both Worlds: Fine-tune for behavior + RAG for knowledge = powerful combination!

5.2 Understanding Tokens

πŸ”€ What Are Tokens?

Definition

Tokens = The basic units LLMs process. Not exactly wordsβ€”subword pieces!

Tokenization Examples

Text: "Hello, how are you?"

Tokens: ["Hello", ",", " how", " are", " you", "?"]

Count: 6 tokens

Text: "Supercalifragilisticexpialidocious"

Tokens: ["Super", "cal", "ifrag", "ilistic", "exp", "ial", "idoc", "ious"]

Count: 8 tokens (long word = more tokens!)

πŸ’‘ Rule of Thumb: 1 token β‰ˆ 0.75 words in English. 100 words β‰ˆ 133 tokens

5.2 Understanding Tokens

⚠️ Token Limitations & Context Windows

Context Window Sizes

Model Context Window Equivalent
GPT-3.5 4,096 tokens ~3,000 words
GPT-3.5-16k 16,384 tokens ~12,000 words
GPT-4 8,192 tokens ~6,000 words
GPT-4-32k 32,768 tokens ~24,000 words
Claude 3 200,000 tokens ~150,000 words

The RAG Problem

Your company knowledge base = 10,000 documents = millions of tokens!

Solution: Retrieve only the MOST RELEVANT chunks, not everything

5.3 Vector Embeddings

🎯 What Are Vector Embeddings?

Definition

Embeddings = Numerical representations of text that capture semantic meaning

From Text to Numbers

Text: "The cat sat on the mat"

Embedding (1536 dimensions):

[0.023, -0.145, 0.678, 0.234, ..., 0.892]

A vector of 1536 numbers!

Text: "A feline rested on the rug"

Embedding:

[0.028, -0.142, 0.681, 0.229, ..., 0.895]

Similar meaning = similar vectors!

🎯 Magic: Sentences with similar meanings have similar embeddings, even with different words!

5.3 Vector Embeddings

🧠 How Embeddings Capture Meaning

Semantic Similarity in Vector Space

"king" is close to "queen", "monarch"

"dog" is close to "cat", "puppy"

"king" - "man" + "woman" β‰ˆ "queen"

Generating Embeddings

        from langchain.embeddings import OpenAIEmbeddings

                # Initialize embeddings model
                embeddings = OpenAIEmbeddings()

                # Generate embedding for text
                text = "Machine learning is amazing"
                vector = embeddings.embed_query(text)

                print(len(vector))  # 1536 dimensions
                print(vector[:5])   # [0.023, -0.145, 0.678, 0.234, -0.012]

                # Embed multiple documents at once
                docs = ["doc1", "doc2", "doc3"]
                doc_vectors = embeddings.embed_documents(docs)
5.3 Vector Embeddings

πŸ€– Popular Embedding Models

Model Dimensions Provider Best For
text-embedding-ada-002 1536 OpenAI General purpose, most popular
text-embedding-3-small 1536 OpenAI Faster, cheaper
text-embedding-3-large 3072 OpenAI Best quality
all-MiniLM-L6-v2 384 Open source Fast, free, local
BGE-large 1024 Open source High quality, free

πŸ’‘ Cost Consideration: OpenAI embeddings cost ~$0.0001 per 1K tokens. For 1M documents, that's ~$100!

5.4 Vector Databases

πŸ—„οΈ What Are Vector Databases?

Definition

Specialized databases optimized for storing and searching high-dimensional vectors

Traditional Database

Search: "Find rows where name = 'John'"

β†’ Exact match

VS

Vector Database

Search: "Find vectors similar to [0.23, 0.45, ...]"

β†’ Semantic similarity

Why We Need Them

⚑ Speed

Find similar vectors among millions in milliseconds

πŸ“ Similarity

Built-in distance calculations (cosine, euclidean)

πŸ—οΈ Scalability

Handle billions of vectors efficiently

5.4 Vector Databases

πŸ—„οΈ Popular Vector Databases

Pinecone

  • Type: Managed cloud
  • Pros: Easy setup, scalable, reliable
  • Cons: Costs money, vendor lock-in
  • Best for: Production apps

Chroma

  • Type: Open source, local
  • Pros: Free, easy, LangChain integration
  • Cons: Not for massive scale
  • Best for: Development, prototypes

Weaviate

  • Type: Open source + cloud
  • Pros: Feature-rich, GraphQL API
  • Cons: Complex setup
  • Best for: Enterprise apps

FAISS

  • Type: Library (Meta)
  • Pros: Very fast, battle-tested
  • Cons: Just a library, not full DB
  • Best for: Custom solutions

πŸ’‘ Recommendation: Start with Chroma (free, easy). Scale to Pinecone for production.

5.4 Vector Databases

πŸ”§ Using Chroma (Hands-on)

# Install: pip install chromadb
        from langchain.vectorstores import Chroma
        from langchain.embeddings import OpenAIEmbeddings

        # 1. Create embeddings model
        embeddings = OpenAIEmbeddings()

        # 2. Sample documents
        docs = [
            "Paris is the capital of France",
            "Python is a programming language",
            "The Eiffel Tower is in Paris",
            "Machine learning uses Python"
        ]

        # 3. Create vector store from documents
        vectorstore = Chroma.from_texts(
            texts=docs,
            embedding=embeddings,
            persist_directory="./chroma_db"  # Save to disk
        )

        # 4. Search for similar documents
        query = "Tell me about France"
        results = vectorstore.similarity_search(query, k=2)

        for doc in results:
            print(doc.page_content)

Output:

Paris is the capital of France
The Eiffel Tower is in Paris

5.5 Chunking Strategies

βœ‚οΈ Why Chunking Matters

The Chunking Problem

Documents are too large to fit in context windows. We must split them into chunks!

❌ Bad Chunking

Chunk 1: "The company was founded in"
Chunk 2: "1998 by two engineers who"

Sentence split! Context lost! 😱

VS

βœ… Good Chunking

Chunk 1: "The company was founded in 1998 by two engineers who wanted to..."

Complete thought preserved! ✨

Chunking Trade-offs

  • Too small: Not enough context, many API calls
  • Too large: Irrelevant info, expensive tokens
  • Just right: Goldilocks zoneβ€”meaningful context, efficient retrieval
5.5 Chunking Strategies

🎯 Chunking Strategies

1. Fixed-Size Chunking

from langchain.text_splitter import CharacterTextSplitter

        splitter = CharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50
        )
        chunks = splitter.split_text(text)

βœ… Simple, predictable
❌ May split sentences

2. Recursive Splitting

from langchain.text_splitter import RecursiveCharacterTextSplitter

        splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", " ", ""]
        )

βœ… Preserves structure
βœ… Most recommended

3. Semantic Chunking

from langchain.text_splitter import SemanticChunker

        splitter = SemanticChunker(
            embeddings=OpenAIEmbeddings()
        )
        chunks = splitter.split_text(text)

βœ… Topic-based splits
❌ Slower, more expensive

4. Document-Specific

  • Markdown: Split by headers
  • Code: Split by functions
  • HTML: Split by sections

βœ… Respects structure
βœ… Domain-specific

5.5 Chunking & Metadata

🏷️ Adding Metadata to Chunks

Why Metadata?

Attach additional information to chunks for better filtering and context

from langchain.schema import Document

        # Create document with metadata
        doc = Document(
            page_content="Paris is the capital of France",
            metadata={
                "source": "geography.pdf",
                "page": 42,
                "author": "John Smith",
                "date": "2024-01-15",
                "category": "geography"
            }
        )

        # Store in vector DB
        vectorstore = Chroma.from_documents([doc], embeddings)

        # Filter by metadata during search
        results = vectorstore.similarity_search(
            query="capitals",
            filter={"category": "geography"}
        )

✨ Use Cases: Filter by date, source, author, department, security level, etc.

5.6 Similarity Search

πŸ” How Similarity Search Works

The Core Concept

Find vectors that are "closest" to the query vector in high-dimensional space

Distance Metrics

Cosine Similarity

similarity = cos(ΞΈ)
Range: [-1, 1]
1 = identical
0 = orthogonal
-1 = opposite

βœ… Most common for text

Euclidean Distance

d = √Σ(a_i - b_i)²
Range: [0, ∞]
0 = identical
larger = farther

βœ… Intuitive "distance"

Dot Product

dp = Ξ£(a_i Γ— b_i)
Range: [-∞, ∞]
higher = more similar

βœ… Fast to compute

πŸ’‘ Default: Cosine similarity works best for most RAG applications!

5.6 Similarity Search

🎯 Retrieval Methods in LangChain

from langchain.vectorstores import Chroma

        # Assume vectorstore is already created

        # 1. Basic similarity search (top k)
        results = vectorstore.similarity_search(
            query="What is machine learning?",
            k=3  # Return top 3 most similar
        )

        # 2. Similarity search with scores
        results = vectorstore.similarity_search_with_score(
            query="What is machine learning?",
            k=3
        )
        for doc, score in results:
            print(f"Score: {score:.3f} - {doc.page_content}")

        # 3. MMR (Maximum Marginal Relevance) - diverse results
        results = vectorstore.max_marginal_relevance_search(
            query="machine learning",
            k=3,
            fetch_k=10,  # Fetch 10, return 3 diverse ones
            lambda_mult=0.5  # Balance relevance vs diversity
        )

        # 4. With metadata filtering
        results = vectorstore.similarity_search(
            query="policy updates",
            k=3,
            filter={"year": 2024}
)
5.7 Hybrid Search Methods

πŸ”€ Hybrid Search: Best of Both Worlds

What is Hybrid Search?

Combine semantic search (embeddings) + keyword search (BM25) for better results

Semantic Search Only

Query: "Python programming"

Finds: "Coding in Python", "Programming languages"

βœ… Understands meaning
❌ Might miss exact term matches

+

Keyword Search (BM25)

Query: "Python programming"

Finds: Exact word matches of "Python" and "programming"

βœ… Precise term matching
❌ Misses synonyms

🎯 Hybrid = Semantic understanding + exact keyword matching = best results!

5.7 Hybrid Search Methods

πŸ”§ Implementing Hybrid Search

from langchain.retrievers import BM25Retriever, EnsembleRetriever
        from langchain.vectorstores import Chroma

        # 1. Semantic retriever (vector search)
        vectorstore = Chroma.from_documents(docs, embeddings)
        semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

        # 2. Keyword retriever (BM25)
        bm25_retriever = BM25Retriever.from_documents(docs)
        bm25_retriever.k = 3

        # 3. Combine them with weights
        ensemble_retriever = EnsembleRetriever(
            retrievers=[semantic_retriever, bm25_retriever],
            weights=[0.5, 0.5]  # 50% semantic, 50% keyword
        )

        # 4. Use hybrid retriever
        results = ensemble_retriever.get_relevant_documents(
            "What is the company vacation policy?"
        )

        for doc in results:
            print(doc.page_content)

πŸ’‘ Tuning Weights: Try 0.7/0.3 or 0.6/0.4. Test what works best for your data!

5.8 Re-ranking Techniques

🎯 Re-ranking: The Secret Weapon

The Problem

Initial retrieval might return 100 documents, but only top 3-5 go to LLM. Order matters!

Re-ranking Flow

Step 1: Initial Retrieval

Fetch 100 potentially relevant documents (fast, rough)

Step 2: Re-rank

Score and re-order the 100 documents (slower, precise)

Step 3: Return Top K

Send top 3-5 best documents to LLM

🎯 Result: Better accuracy without sending 100 documents to expensive LLM!

5.8 Re-ranking Techniques

πŸ”§ Re-ranking Methods

1. Cross-Encoder Models

Models specifically trained to score query-document pairs

from langchain.retrievers import ContextualCompressionRetriever
        from langchain.retrievers.document_compressors import CohereRerank

        # Using Cohere's reranker
        compressor = CohereRerank()
        retriever = ContextualCompressionRetriever(
            base_compressor=compressor,
            base_retriever=vectorstore.as_retriever()
        )

2. LLM-based Re-ranking

Use LLM to score relevance

from langchain.retrievers.document_compressors import LLMChainFilter

        # LLM judges relevance
        compressor = LLMChainFilter.from_llm(llm)
        retriever = ContextualCompressionRetriever(
            base_compressor=compressor,
            base_retriever=vectorstore.as_retriever()
        )

⚠️ Trade-offs

  • Cross-Encoder: Faster, cheaper, but needs separate API
  • LLM Re-ranker: No extra API, but slower and more expensive
5.9 Evaluation Metrics

πŸ“Š Why Evaluate RAG Systems?

The Challenge

How do you know if your RAG system is actually working well?

❌ Without Evaluation

  • Guessing if it works
  • Can't compare approaches
  • No way to improve systematically
  • Production surprises

βœ… With Evaluation

  • Quantify performance
  • A/B test changes
  • Track improvements
  • Confidence in production

Two Types of Metrics

Retrieval Metrics

Did we retrieve the RIGHT documents?

Generation Metrics

Did the LLM generate a GOOD answer?

5.9 Evaluation Metrics

🎯 Retrieval Metrics

Context Precision & Recall

Context Precision: Of retrieved docs, how many are relevant?

Precision = (Relevant Retrieved) / (Total Retrieved)

Context Recall: Of all relevant docs, how many did we retrieve?

Recall = (Relevant Retrieved) / (Total Relevant)

Example

Query: "What's the vacation policy?"

Total relevant documents in DB: 5

Retrieved: 3 documents

Relevant among retrieved: 2 documents

Precision = 2/3 = 0.67 (67% of retrieved were relevant)

Recall = 2/5 = 0.40 (40% of relevant docs were found)

5.9 Evaluation Metrics

πŸ“ Generation Metrics

Metric What It Measures Range
Faithfulness Is the answer grounded in retrieved context? 0-1 (1 = fully grounded)
Answer Relevance Does the answer address the question? 0-1 (1 = completely relevant)
Context Relevance Is the retrieved context relevant to query? 0-1 (1 = highly relevant)
BLEU Score N-gram overlap with reference answer 0-1 (1 = perfect match)
ROUGE Score Recall-oriented overlap with reference 0-1 (1 = perfect match)

⚠️ Important: BLEU/ROUGE need reference answers. Faithfulness/Relevance can be computed automatically using LLMs!

5.9 Evaluation Metrics

πŸ”§ Evaluating RAG with RAGAS

RAGAS: RAG Assessment Framework

# Install: pip install ragas
        from ragas import evaluate
        from ragas.metrics import (
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall
        )

        # Your RAG pipeline results
        data = {
            "question": ["What is the vacation policy?"],
            "answer": ["Employees get 15 days PTO annually"],
            "contexts": [["Vacation policy: 15 days...", "PTO accrues..."]],
            "ground_truth": ["The company offers 15 days paid time off"]
        }

        # Evaluate
        result = evaluate(
            data,
            metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
        )

        print(result)

Output:

faithfulness: 0.95
answer_relevancy: 0.92
context_precision: 0.88
context_recall: 0.85

πŸ—οΈ Building a Complete RAG System

from langchain.document_loaders import PyPDFLoader
        from langchain.text_splitter import RecursiveCharacterTextSplitter
        from langchain.embeddings import OpenAIEmbeddings
        from langchain.vectorstores import Chroma
        from langchain.chains import RetrievalQA
        from langchain.chat_models import ChatOpenAI

        # 1. Load documents
        loader = PyPDFLoader("company_docs.pdf")
        documents = loader.load()

        # 2. Split into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200
        )
        chunks = text_splitter.split_documents(documents)

        # 3. Create embeddings and store
        embeddings = OpenAIEmbeddings()
        vectorstore = Chroma.from_documents(chunks, embeddings)

        # 4. Create retrieval chain
        qa_chain = RetrievalQA.from_chain_type(
            llm=ChatOpenAI(temperature=0),
            chain_type="stuff",
            retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
        )

        # 5. Ask questions!
        answer = qa_chain.run("What is the vacation policy?")
        print(answer)

πŸš€ Advanced RAG Patterns

1. Multi-Query RAG

Generate multiple versions of the query for better retrieval

from langchain.retrievers import MultiQueryRetriever

        retriever = MultiQueryRetriever.from_llm(
            retriever=vectorstore.as_retriever(),
            llm=ChatOpenAI()
        )
        # Generates 3-5 variations of query

2. Parent Document Retriever

Retrieve small chunks, but return larger parent documents

from langchain.retrievers import ParentDocumentRetriever

        retriever = ParentDocumentRetriever(
            vectorstore=vectorstore,
            docstore=docstore,
            child_splitter=small_splitter,
            parent_splitter=large_splitter
        )

3. Self-Query Retriever

Extract metadata filters from natural language

from langchain.retrievers import SelfQueryRetriever

        retriever = SelfQueryRetriever.from_llm(
            llm=ChatOpenAI(),
            vectorstore=vectorstore,
            document_content_description="docs",
            metadata_field_info=metadata_info
        )

4. Contextual Compression

Compress retrieved docs to only relevant parts

from langchain.retrievers import ContextualCompressionRetriever

        retriever = ContextualCompressionRetriever(
            base_compressor=compressor,
            base_retriever=base_retriever
        )

✨ RAG Best Practices

πŸ“ Document Preparation

  • Clean your data: Remove noise, formatting issues
  • Add metadata: Source, date, category, author
  • Test chunking: Experiment with sizes (500-1500 tokens)
  • Maintain structure: Don't break paragraphs/sections

πŸ” Retrieval Optimization

  • Start with k=3-5: Don't retrieve too many
  • Use hybrid search: Semantic + keyword
  • Add re-ranking: Improves top results
  • Filter by metadata: When possible

πŸ€– Generation Quality

  • System prompts: "Answer based only on context"
  • Cite sources: Include document references
  • Handle no-answer: "I don't have information about..."
  • Temperature=0: For factual accuracy

πŸ“Š Evaluation & Monitoring

  • Create test set: 50-100 Q&A pairs
  • Track metrics: Faithfulness, relevance, precision
  • A/B test changes: Before deploying
  • User feedback: Thumbs up/down

⚠️ Common RAG Challenges & Solutions

Challenge Symptom Solution
Hallucination LLM makes up information Lower temperature, stronger system prompt, check faithfulness
Irrelevant Retrieval Wrong documents retrieved Better chunking, hybrid search, re-ranking, metadata filters
Context Overflow Too many tokens in context Reduce k, use compression, summarize chunks
Poor Answers Answers lack detail/accuracy Improve prompts, increase k, check chunk quality
Slow Performance High latency Cache embeddings, use faster models, optimize DB
High Costs Expensive API bills Cache results, use smaller embeddings, optimize k

🌍 Real-World RAG Applications

πŸ“š Enterprise Knowledge Base

Use Case: Internal documentation Q&A

  • Confluence, SharePoint, Google Drive docs
  • IT policies, HR handbooks, procedures
  • Reduces support tickets by 40%

🎧 Customer Support

Use Case: Product documentation assistant

  • Answer product questions instantly
  • Cite sources from manuals
  • 24/7 availability

βš–οΈ Legal Research

Use Case: Case law search

  • Search through thousands of cases
  • Find relevant precedents
  • Saves hours of manual research

πŸ₯ Medical Information

Use Case: Clinical guidelines

  • Query medical literature
  • Evidence-based recommendations
  • Always cite research sources

🎯 Key Takeaways

πŸ” Core Concepts

  • RAG = Retrieve + Augment + Generate
  • Embeddings capture semantic meaning
  • Vector DBs enable similarity search
  • Chunking strategy matters!

πŸ› οΈ Technical Skills

  • Build RAG pipelines with LangChain
  • Use Chroma/Pinecone for storage
  • Implement hybrid search
  • Apply re-ranking techniques

πŸ“Š Quality & Evaluation

  • Measure precision, recall, faithfulness
  • Use RAGAS for evaluation
  • A/B test improvements
  • Monitor in production

πŸ”₯ The RAG Revolution

RAG lets LLMs access ANY knowledge without retraining

This is why ChatGPT plugins, Claude Projects, and most AI products use RAG under the hood!

πŸ“ Homework Assignment

Assignment: Build a Production RAG System

Due: Next class

Project: Domain-Specific Q&A System

Requirements (100 points):

  1. Data Collection (15 pts):

    Gather 10+ documents in your chosen domain (PDFs, websites, etc.)

  2. RAG Pipeline (40 pts):
    • Document loading and preprocessing
    • Smart chunking with overlap
    • Vector store implementation (Chroma or Pinecone)
    • Retrieval with k=3-5
    • QA chain with citations
  3. Advanced Features (25 pts):

    Implement at least 2 of: Hybrid search, Re-ranking, Metadata filtering, Multi-query, Compression

  4. Evaluation (20 pts):

    Create 10 test Q&A pairs, compute RAGAS metrics, analyze results

πŸ“¦ Assignment Deliverables

1. Code (GitHub)

  • Complete RAG implementation
  • Clean, commented code
  • Requirements.txt
  • README with setup instructions

2. Documentation

  • Domain and data sources
  • Architecture decisions
  • Chunking strategy rationale
  • Advanced features explanation

3. Demo Video (5 min)

  • Show system in action
  • Ask 5+ questions
  • Demonstrate advanced features
  • Show source citations

4. Evaluation Report

  • Test questions and answers
  • RAGAS metric scores
  • Analysis of results
  • Improvement suggestions

πŸ’‘ Domain Suggestions: Tech documentation, legal documents, medical info, academic papers, company policies, product manuals, news articles, research papers

πŸ“š Resources & Next Steps

πŸ“– Essential Reading

  • Lewis et al. (2020): Original RAG paper
  • LangChain RAG Docs: Comprehensive guide
  • Pinecone Learning Center: RAG tutorials
  • RAGAS Documentation: Evaluation framework

πŸ› οΈ Tools & Libraries

  • LangChain: RAG framework
  • LlamaIndex: Alternative framework
  • Chroma: Vector database
  • RAGAS: Evaluation metrics

πŸ“… Coming Next

βš–οΈ Unit 6: Ethical and Responsible AI

  • Bias and fairness in AI systems
  • Privacy and data protection
  • Model misuse and safeguards
  • Responsible AI practices

❓ Questions?

Let's Discuss!

Any questions about:

  • RAG architecture and workflow?
  • Vector embeddings and databases?
  • Chunking strategies?
  • Hybrid search and re-ranking?
  • Evaluation metrics?
  • Your project ideas?

Thank You! πŸŽ‰

You now understand RAG systems!

The foundation of modern AI applications

πŸ“§ Questions? Reach out anytime!

πŸ’» Start building your RAG system!

πŸ” Experiment with different techniques!

βš–οΈ Next: Ethics in AI - crucial for responsible development!

1 / 36