π Retrieval Augmented Generation
Unit 5: The Secret Sauce of Modern AI
Building Context-Aware LLM Systems
π What is RAG and Why Does It Matter?
The Problem with Plain LLMs
LLMs have knowledge cutoff dates and don't know about your private data
β Without RAG
Q: "What's in our Q3 2024 financial report?"
A: "I don't have access to your company's financial reports."
β With RAG
Q: "What's in our Q3 2024 financial report?"
A: "According to your Q3 report, revenue increased 23% to $5.2M..."
π― RAG = Give LLMs access to external knowledge without retraining!
π― What is RAG?
Retrieval Augmented Generation
Retrieve relevant information from external sources β
Augment the prompt with this context β
Generate accurate, grounded responses
Basic RAG Flow
"What's our vacation policy?"
Search knowledge base for vacation policy
Context: [vacation policy text] + Question
"Employees receive 15 days PTO annually..."
π‘ Why RAG Over Fine-tuning?
| Aspect | Fine-tuning | RAG |
|---|---|---|
| Cost | π°π°π° High (retraining) | π° Low (just retrieval) |
| Update Speed | π Slow (retrain needed) | β‘ Instant (update docs) |
| Knowledge Source | Baked into weights | External, auditable |
| Accuracy | Can hallucinate | Grounded in sources |
| Transparency | Black box | Can cite sources |
| Best For | Behavior/style changes | Knowledge/facts |
π‘ Best of Both Worlds: Fine-tune for behavior + RAG for knowledge = powerful combination!
π€ What Are Tokens?
Definition
Tokens = The basic units LLMs process. Not exactly wordsβsubword pieces!
Tokenization Examples
Text: "Hello, how are you?"
Tokens: ["Hello", ",", " how", " are", " you", "?"]
Count: 6 tokens
Text: "Supercalifragilisticexpialidocious"
Tokens: ["Super", "cal", "ifrag", "ilistic", "exp", "ial", "idoc", "ious"]
Count: 8 tokens (long word = more tokens!)
π‘ Rule of Thumb: 1 token β 0.75 words in English. 100 words β 133 tokens
β οΈ Token Limitations & Context Windows
Context Window Sizes
| Model | Context Window | Equivalent |
|---|---|---|
| GPT-3.5 | 4,096 tokens | ~3,000 words |
| GPT-3.5-16k | 16,384 tokens | ~12,000 words |
| GPT-4 | 8,192 tokens | ~6,000 words |
| GPT-4-32k | 32,768 tokens | ~24,000 words |
| Claude 3 | 200,000 tokens | ~150,000 words |
The RAG Problem
Your company knowledge base = 10,000 documents = millions of tokens!
Solution: Retrieve only the MOST RELEVANT chunks, not everything
π― What Are Vector Embeddings?
Definition
Embeddings = Numerical representations of text that capture semantic meaning
From Text to Numbers
Text: "The cat sat on the mat"
Embedding (1536 dimensions):
[0.023, -0.145, 0.678, 0.234, ..., 0.892]
A vector of 1536 numbers!
Text: "A feline rested on the rug"
Embedding:
[0.028, -0.142, 0.681, 0.229, ..., 0.895]
Similar meaning = similar vectors!
π― Magic: Sentences with similar meanings have similar embeddings, even with different words!
π§ How Embeddings Capture Meaning
Semantic Similarity in Vector Space
"king" is close to "queen", "monarch"
"dog" is close to "cat", "puppy"
"king" - "man" + "woman" β "queen"
Generating Embeddings
from langchain.embeddings import OpenAIEmbeddings
# Initialize embeddings model
embeddings = OpenAIEmbeddings()
# Generate embedding for text
text = "Machine learning is amazing"
vector = embeddings.embed_query(text)
print(len(vector)) # 1536 dimensions
print(vector[:5]) # [0.023, -0.145, 0.678, 0.234, -0.012]
# Embed multiple documents at once
docs = ["doc1", "doc2", "doc3"]
doc_vectors = embeddings.embed_documents(docs)
π€ Popular Embedding Models
| Model | Dimensions | Provider | Best For |
|---|---|---|---|
| text-embedding-ada-002 | 1536 | OpenAI | General purpose, most popular |
| text-embedding-3-small | 1536 | OpenAI | Faster, cheaper |
| text-embedding-3-large | 3072 | OpenAI | Best quality |
| all-MiniLM-L6-v2 | 384 | Open source | Fast, free, local |
| BGE-large | 1024 | Open source | High quality, free |
π‘ Cost Consideration: OpenAI embeddings cost ~$0.0001 per 1K tokens. For 1M documents, that's ~$100!
ποΈ What Are Vector Databases?
Definition
Specialized databases optimized for storing and searching high-dimensional vectors
Traditional Database
Search: "Find rows where name = 'John'"
β Exact match
Vector Database
Search: "Find vectors similar to [0.23, 0.45, ...]"
β Semantic similarity
Why We Need Them
β‘ Speed
Find similar vectors among millions in milliseconds
π Similarity
Built-in distance calculations (cosine, euclidean)
ποΈ Scalability
Handle billions of vectors efficiently
ποΈ Popular Vector Databases
Pinecone
- Type: Managed cloud
- Pros: Easy setup, scalable, reliable
- Cons: Costs money, vendor lock-in
- Best for: Production apps
Chroma
- Type: Open source, local
- Pros: Free, easy, LangChain integration
- Cons: Not for massive scale
- Best for: Development, prototypes
Weaviate
- Type: Open source + cloud
- Pros: Feature-rich, GraphQL API
- Cons: Complex setup
- Best for: Enterprise apps
FAISS
- Type: Library (Meta)
- Pros: Very fast, battle-tested
- Cons: Just a library, not full DB
- Best for: Custom solutions
π‘ Recommendation: Start with Chroma (free, easy). Scale to Pinecone for production.
π§ Using Chroma (Hands-on)
# Install: pip install chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
# 1. Create embeddings model
embeddings = OpenAIEmbeddings()
# 2. Sample documents
docs = [
"Paris is the capital of France",
"Python is a programming language",
"The Eiffel Tower is in Paris",
"Machine learning uses Python"
]
# 3. Create vector store from documents
vectorstore = Chroma.from_texts(
texts=docs,
embedding=embeddings,
persist_directory="./chroma_db" # Save to disk
)
# 4. Search for similar documents
query = "Tell me about France"
results = vectorstore.similarity_search(query, k=2)
for doc in results:
print(doc.page_content)
Output:
Paris is the capital of France
The Eiffel Tower is in Paris
βοΈ Why Chunking Matters
The Chunking Problem
Documents are too large to fit in context windows. We must split them into chunks!
β Bad Chunking
Chunk 1: "The company was founded in"
Chunk 2: "1998 by two engineers who"
Sentence split! Context lost! π±
β Good Chunking
Chunk 1: "The company was founded in 1998 by two engineers who wanted to..."
Complete thought preserved! β¨
Chunking Trade-offs
- Too small: Not enough context, many API calls
- Too large: Irrelevant info, expensive tokens
- Just right: Goldilocks zoneβmeaningful context, efficient retrieval
π― Chunking Strategies
1. Fixed-Size Chunking
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_text(text)
β
Simple, predictable
β May split sentences
2. Recursive Splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
β
Preserves structure
β
Most recommended
3. Semantic Chunking
from langchain.text_splitter import SemanticChunker
splitter = SemanticChunker(
embeddings=OpenAIEmbeddings()
)
chunks = splitter.split_text(text)
β
Topic-based splits
β Slower, more expensive
4. Document-Specific
- Markdown: Split by headers
- Code: Split by functions
- HTML: Split by sections
β
Respects structure
β
Domain-specific
π·οΈ Adding Metadata to Chunks
Why Metadata?
Attach additional information to chunks for better filtering and context
from langchain.schema import Document
# Create document with metadata
doc = Document(
page_content="Paris is the capital of France",
metadata={
"source": "geography.pdf",
"page": 42,
"author": "John Smith",
"date": "2024-01-15",
"category": "geography"
}
)
# Store in vector DB
vectorstore = Chroma.from_documents([doc], embeddings)
# Filter by metadata during search
results = vectorstore.similarity_search(
query="capitals",
filter={"category": "geography"}
)
β¨ Use Cases: Filter by date, source, author, department, security level, etc.
π How Similarity Search Works
The Core Concept
Find vectors that are "closest" to the query vector in high-dimensional space
Distance Metrics
Cosine Similarity
similarity = cos(ΞΈ)
Range: [-1, 1]
1 = identical
0 = orthogonal
-1 = opposite
β Most common for text
Euclidean Distance
d = βΞ£(a_i - b_i)Β²
Range: [0, β]
0 = identical
larger = farther
β Intuitive "distance"
Dot Product
dp = Ξ£(a_i Γ b_i)
Range: [-β, β]
higher = more similar
β Fast to compute
π‘ Default: Cosine similarity works best for most RAG applications!
π― Retrieval Methods in LangChain
from langchain.vectorstores import Chroma
# Assume vectorstore is already created
# 1. Basic similarity search (top k)
results = vectorstore.similarity_search(
query="What is machine learning?",
k=3 # Return top 3 most similar
)
# 2. Similarity search with scores
results = vectorstore.similarity_search_with_score(
query="What is machine learning?",
k=3
)
for doc, score in results:
print(f"Score: {score:.3f} - {doc.page_content}")
# 3. MMR (Maximum Marginal Relevance) - diverse results
results = vectorstore.max_marginal_relevance_search(
query="machine learning",
k=3,
fetch_k=10, # Fetch 10, return 3 diverse ones
lambda_mult=0.5 # Balance relevance vs diversity
)
# 4. With metadata filtering
results = vectorstore.similarity_search(
query="policy updates",
k=3,
filter={"year": 2024}
)
π Hybrid Search: Best of Both Worlds
What is Hybrid Search?
Combine semantic search (embeddings) + keyword search (BM25) for better results
Semantic Search Only
Query: "Python programming"
Finds: "Coding in Python", "Programming languages"
β
Understands meaning
β Might miss exact term matches
Keyword Search (BM25)
Query: "Python programming"
Finds: Exact word matches of "Python" and "programming"
β
Precise term matching
β Misses synonyms
π― Hybrid = Semantic understanding + exact keyword matching = best results!
π§ Implementing Hybrid Search
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import Chroma
# 1. Semantic retriever (vector search)
vectorstore = Chroma.from_documents(docs, embeddings)
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# 2. Keyword retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 3
# 3. Combine them with weights
ensemble_retriever = EnsembleRetriever(
retrievers=[semantic_retriever, bm25_retriever],
weights=[0.5, 0.5] # 50% semantic, 50% keyword
)
# 4. Use hybrid retriever
results = ensemble_retriever.get_relevant_documents(
"What is the company vacation policy?"
)
for doc in results:
print(doc.page_content)
π‘ Tuning Weights: Try 0.7/0.3 or 0.6/0.4. Test what works best for your data!
π― Re-ranking: The Secret Weapon
The Problem
Initial retrieval might return 100 documents, but only top 3-5 go to LLM. Order matters!
Re-ranking Flow
Step 1: Initial Retrieval
Fetch 100 potentially relevant documents (fast, rough)
Step 2: Re-rank
Score and re-order the 100 documents (slower, precise)
Step 3: Return Top K
Send top 3-5 best documents to LLM
π― Result: Better accuracy without sending 100 documents to expensive LLM!
π§ Re-ranking Methods
1. Cross-Encoder Models
Models specifically trained to score query-document pairs
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
# Using Cohere's reranker
compressor = CohereRerank()
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
2. LLM-based Re-ranking
Use LLM to score relevance
from langchain.retrievers.document_compressors import LLMChainFilter
# LLM judges relevance
compressor = LLMChainFilter.from_llm(llm)
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vectorstore.as_retriever()
)
β οΈ Trade-offs
- Cross-Encoder: Faster, cheaper, but needs separate API
- LLM Re-ranker: No extra API, but slower and more expensive
π Why Evaluate RAG Systems?
The Challenge
How do you know if your RAG system is actually working well?
β Without Evaluation
- Guessing if it works
- Can't compare approaches
- No way to improve systematically
- Production surprises
β With Evaluation
- Quantify performance
- A/B test changes
- Track improvements
- Confidence in production
Two Types of Metrics
Retrieval Metrics
Did we retrieve the RIGHT documents?
Generation Metrics
Did the LLM generate a GOOD answer?
π― Retrieval Metrics
Context Precision & Recall
Context Precision: Of retrieved docs, how many are relevant?
Precision = (Relevant Retrieved) / (Total Retrieved)
Context Recall: Of all relevant docs, how many did we retrieve?
Recall = (Relevant Retrieved) / (Total Relevant)
Example
Query: "What's the vacation policy?"
Total relevant documents in DB: 5
Retrieved: 3 documents
Relevant among retrieved: 2 documents
Precision = 2/3 = 0.67 (67% of retrieved were relevant)
Recall = 2/5 = 0.40 (40% of relevant docs were found)
π Generation Metrics
| Metric | What It Measures | Range |
|---|---|---|
| Faithfulness | Is the answer grounded in retrieved context? | 0-1 (1 = fully grounded) |
| Answer Relevance | Does the answer address the question? | 0-1 (1 = completely relevant) |
| Context Relevance | Is the retrieved context relevant to query? | 0-1 (1 = highly relevant) |
| BLEU Score | N-gram overlap with reference answer | 0-1 (1 = perfect match) |
| ROUGE Score | Recall-oriented overlap with reference | 0-1 (1 = perfect match) |
β οΈ Important: BLEU/ROUGE need reference answers. Faithfulness/Relevance can be computed automatically using LLMs!
π§ Evaluating RAG with RAGAS
RAGAS: RAG Assessment Framework
# Install: pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
# Your RAG pipeline results
data = {
"question": ["What is the vacation policy?"],
"answer": ["Employees get 15 days PTO annually"],
"contexts": [["Vacation policy: 15 days...", "PTO accrues..."]],
"ground_truth": ["The company offers 15 days paid time off"]
}
# Evaluate
result = evaluate(
data,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)
Output:
faithfulness: 0.95
answer_relevancy: 0.92
context_precision: 0.88
context_recall: 0.85
ποΈ Building a Complete RAG System
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
# 1. Load documents
loader = PyPDFLoader("company_docs.pdf")
documents = loader.load()
# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
# 3. Create embeddings and store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# 4. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(temperature=0),
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
# 5. Ask questions!
answer = qa_chain.run("What is the vacation policy?")
print(answer)
π Advanced RAG Patterns
1. Multi-Query RAG
Generate multiple versions of the query for better retrieval
from langchain.retrievers import MultiQueryRetriever
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(),
llm=ChatOpenAI()
)
# Generates 3-5 variations of query
2. Parent Document Retriever
Retrieve small chunks, but return larger parent documents
from langchain.retrievers import ParentDocumentRetriever
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=docstore,
child_splitter=small_splitter,
parent_splitter=large_splitter
)
3. Self-Query Retriever
Extract metadata filters from natural language
from langchain.retrievers import SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
llm=ChatOpenAI(),
vectorstore=vectorstore,
document_content_description="docs",
metadata_field_info=metadata_info
)
4. Contextual Compression
Compress retrieved docs to only relevant parts
from langchain.retrievers import ContextualCompressionRetriever
retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
β¨ RAG Best Practices
π Document Preparation
- Clean your data: Remove noise, formatting issues
- Add metadata: Source, date, category, author
- Test chunking: Experiment with sizes (500-1500 tokens)
- Maintain structure: Don't break paragraphs/sections
π Retrieval Optimization
- Start with k=3-5: Don't retrieve too many
- Use hybrid search: Semantic + keyword
- Add re-ranking: Improves top results
- Filter by metadata: When possible
π€ Generation Quality
- System prompts: "Answer based only on context"
- Cite sources: Include document references
- Handle no-answer: "I don't have information about..."
- Temperature=0: For factual accuracy
π Evaluation & Monitoring
- Create test set: 50-100 Q&A pairs
- Track metrics: Faithfulness, relevance, precision
- A/B test changes: Before deploying
- User feedback: Thumbs up/down
β οΈ Common RAG Challenges & Solutions
| Challenge | Symptom | Solution |
|---|---|---|
| Hallucination | LLM makes up information | Lower temperature, stronger system prompt, check faithfulness |
| Irrelevant Retrieval | Wrong documents retrieved | Better chunking, hybrid search, re-ranking, metadata filters |
| Context Overflow | Too many tokens in context | Reduce k, use compression, summarize chunks |
| Poor Answers | Answers lack detail/accuracy | Improve prompts, increase k, check chunk quality |
| Slow Performance | High latency | Cache embeddings, use faster models, optimize DB |
| High Costs | Expensive API bills | Cache results, use smaller embeddings, optimize k |
π Real-World RAG Applications
π Enterprise Knowledge Base
Use Case: Internal documentation Q&A
- Confluence, SharePoint, Google Drive docs
- IT policies, HR handbooks, procedures
- Reduces support tickets by 40%
π§ Customer Support
Use Case: Product documentation assistant
- Answer product questions instantly
- Cite sources from manuals
- 24/7 availability
βοΈ Legal Research
Use Case: Case law search
- Search through thousands of cases
- Find relevant precedents
- Saves hours of manual research
π₯ Medical Information
Use Case: Clinical guidelines
- Query medical literature
- Evidence-based recommendations
- Always cite research sources
π― Key Takeaways
π Core Concepts
- RAG = Retrieve + Augment + Generate
- Embeddings capture semantic meaning
- Vector DBs enable similarity search
- Chunking strategy matters!
π οΈ Technical Skills
- Build RAG pipelines with LangChain
- Use Chroma/Pinecone for storage
- Implement hybrid search
- Apply re-ranking techniques
π Quality & Evaluation
- Measure precision, recall, faithfulness
- Use RAGAS for evaluation
- A/B test improvements
- Monitor in production
π₯ The RAG Revolution
RAG lets LLMs access ANY knowledge without retraining
This is why ChatGPT plugins, Claude Projects, and most AI products use RAG under the hood!
π Homework Assignment
Assignment: Build a Production RAG System
Due: Next class
Project: Domain-Specific Q&A System
Requirements (100 points):
-
Data Collection (15 pts):
Gather 10+ documents in your chosen domain (PDFs, websites, etc.)
-
RAG Pipeline (40 pts):
- Document loading and preprocessing
- Smart chunking with overlap
- Vector store implementation (Chroma or Pinecone)
- Retrieval with k=3-5
- QA chain with citations
-
Advanced Features (25 pts):
Implement at least 2 of: Hybrid search, Re-ranking, Metadata filtering, Multi-query, Compression
-
Evaluation (20 pts):
Create 10 test Q&A pairs, compute RAGAS metrics, analyze results
π¦ Assignment Deliverables
1. Code (GitHub)
- Complete RAG implementation
- Clean, commented code
- Requirements.txt
- README with setup instructions
2. Documentation
- Domain and data sources
- Architecture decisions
- Chunking strategy rationale
- Advanced features explanation
3. Demo Video (5 min)
- Show system in action
- Ask 5+ questions
- Demonstrate advanced features
- Show source citations
4. Evaluation Report
- Test questions and answers
- RAGAS metric scores
- Analysis of results
- Improvement suggestions
π‘ Domain Suggestions: Tech documentation, legal documents, medical info, academic papers, company policies, product manuals, news articles, research papers
π Resources & Next Steps
π Essential Reading
- Lewis et al. (2020): Original RAG paper
- LangChain RAG Docs: Comprehensive guide
- Pinecone Learning Center: RAG tutorials
- RAGAS Documentation: Evaluation framework
π οΈ Tools & Libraries
- LangChain: RAG framework
- LlamaIndex: Alternative framework
- Chroma: Vector database
- RAGAS: Evaluation metrics
π Coming Next
βοΈ Unit 6: Ethical and Responsible AI
- Bias and fairness in AI systems
- Privacy and data protection
- Model misuse and safeguards
- Responsible AI practices
β Questions?
Let's Discuss!
Any questions about:
- RAG architecture and workflow?
- Vector embeddings and databases?
- Chunking strategies?
- Hybrid search and re-ranking?
- Evaluation metrics?
- Your project ideas?
Thank You! π
You now understand RAG systems!
The foundation of modern AI applications
π§ Questions? Reach out anytime!
π» Start building your RAG system!
π Experiment with different techniques!
βοΈ Next: Ethics in AI - crucial for responsible development!