๐ง Foundations of Large Language Models
Unit 2: How Modern AI Really Works
Understanding Transformers, Architectures & Training
๐ What We'll Learn Today
Journey Through Modern LLMs
2.1 Transformer Architecture & Attention
The revolutionary architecture that changed everything
2.2 Key LLM Types
GPT, BERT, LLaMA - understanding the big players
2.3 Training Methodologies
Pre-training, fine-tuning, and how models learn
๐ก By the end: You'll understand how ChatGPT, Claude, and other LLMs actually work under the hood!
โฎ๏ธ The World Before Transformers
The Problem: Sequential Processing
๐ RNNs & LSTMs (Pre-2017)
- Process word by word
- Can't parallelize
- Forget long contexts
- Slow to train
Word1 โ Word2 โ Word3 โ Word4
(Must process in order)
โก Transformers (2017+)
- Process all words at once
- Highly parallelizable
- Remember everything
- Fast to train
Word1 โท Word2 โท Word3 โท Word4
(All words see each other)
๐ฏ The Breakthrough: "Attention Is All You Need" paper (2017) - transformers became the foundation of all modern LLMs!
๐๏ธ The Transformer Architecture
Input Text
Embedding Layer
Converts words to vectors
Encoder / Decoder Layers
Self-Attention + Feed Forward
(Repeated 12-96 times)
Output Layer
Predicts next token
Generated Text
๐ What is "Attention"?
The Core Idea
Attention = "Which words should I focus on?"
Example: Understanding Context
"The animal didn't cross the street because it was too tired."
"it" pays attention to "animal" (not "street")
Attention weight: animal = 0.8, street = 0.1, tired = 0.1
Without Attention
Old models: "it" = just the previous word or fixed distance
โ Couldn't understand long-range dependencies
With Attention
Transformers: "it" = look at ALL previous words, decide which matters
โ Understands context perfectly
๐ Self-Attention: The Magic Formula
How It Works (Simplified)
Step 1: Create Q, K, V
For each word, create 3 vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What info do I provide?"
Step 2: Calculate Scores
Compare Query with all Keys:
Higher score = more relevant
Step 3: Softmax
Convert scores to probabilities:
Sum to 1.0
Step 4: Weighted Sum
Combine Values using weights:
Final representation
๐ก Intuition: Every word asks "Who's relevant to me?" and combines info from those words!
๐ญ Multi-Head Attention
Why Multiple "Heads"?
Different heads learn different relationships!
Example: Analyzing "The cat sat on the mat"
Head 1: Subject-Verb
Focuses on:
cat โ sat
Learns: Who did the action?
Head 2: Prepositions
Focuses on:
sat โ on โ mat
Learns: Spatial relationships
Head 3: Determiners
Focuses on:
the โ cat
the โ mat
Learns: Article-noun pairs
๐ฏ Typical Setup: GPT-3 has 96 attention heads per layer! Each specializes in different linguistic patterns.
๐ Positional Encoding: Word Order Matters
The Problem
Attention alone has NO sense of order!
"Dog bites man" and "Man bites dog" would look the same!
The Solution: Add Position Information
Without Position
Word: [0.2, 0.5, 0.8]
Position: ???
With Position Encoding
Word: [0.2, 0.5, 0.8]
Position: [0.1, 0.0, 0.3]
Final: [0.3, 0.5, 1.1]
How It Works
- Sinusoidal Encoding: Use sine/cosine waves of different frequencies
- Learned Encoding: Let the model learn position patterns
- Result: Each position gets a unique "signature"
๐ก Key Insight: Position encoding lets transformers understand "first word", "last word", "middle word" etc.
๐งฑ Complete Transformer Block
Input Embeddings + Positional Encoding
Multi-Head Self-Attention
Look at all other tokens
Add & Normalize
Residual connection + Layer Norm
Feed Forward Network
2-layer neural network
Add & Normalize
Another residual + Layer Norm
Output (to next layer)
๐ This block repeats 12-96 times! GPT-3: 96 layers, GPT-4: estimated 120+ layers
๐ The LLM Landscape
Three Main Architecture Families
Encoder-Only
Example: BERT
Best for:
- Classification
- Understanding text
- Question answering
- Named entity recognition
Decoder-Only
Example: GPT, LLaMA
Best for:
- Text generation
- Creative writing
- Code generation
- Conversations
Encoder-Decoder
Example: T5, BART
Best for:
- Translation
- Summarization
- Text-to-text tasks
- Question generation
๐ก Today's Focus: We'll dive deep into the most popular ones used in GenAI applications!
๐ค GPT: Generative Pre-trained Transformer
Core Idea: Predict the Next Word
Given: "The cat sat on the ___"
Predict: "mat" (or "chair", "floor", etc.)
Architecture
- Type: Decoder-only
- Attention: Causal (left-to-right)
- Training: Next token prediction
- Direction: Forward only
Key Features
- Autoregressive generation
- Can't see future words
- Excellent at generation
- Zero-shot learning
The GPT Evolution
| Model | Parameters | Year | Key Feature |
|---|---|---|---|
| GPT-1 | 117M | 2018 | Proof of concept |
| GPT-2 | 1.5B | 2019 | "Too dangerous to release" |
| GPT-3 | 175B | 2020 | Few-shot learning breakthrough |
| GPT-4 | ~1.7T* | 2023 | Multimodal, reasoning |
*Estimated, not officially confirmed
๐ BERT: Bidirectional Encoder Representations
Core Idea: Fill in the Blank
Given: "The cat [MASK] on the mat"
Predict: "sat"
Can see words BEFORE and AFTER the blank!
Architecture
- Type: Encoder-only
- Attention: Bidirectional
- Training: Masked Language Model
- Direction: Both ways โท
Key Features
- Sees full context
- Better understanding
- Not for generation
- Great for classification
GPT vs BERT: The Key Difference
GPT (Causal Attention)
"The cat sat on the"
Can only look โ left
Predicts next word
BERT (Bidirectional)
"The cat [MASK] on the mat"
Looks โ left and right โ
Understands full sentence
๐ฆ LLaMA: Open Source Revolution
Meta's LLaMA (Large Language Model Meta AI)
Open-source alternative to GPT, designed for efficiency
What Makes LLaMA Special?
- Open weights: Anyone can use/modify
- Efficient: Smaller but competitive
- Decoder-only: Like GPT architecture
- Research-friendly: Democratizes AI
LLaMA Family
- LLaMA 1: 7B, 13B, 33B, 65B (2023)
- LLaMA 2: 7B, 13B, 70B (2023)
- LLaMA 3: 8B, 70B, 405B (2024)
- Commercial use: Allowed!
Why Open Source Matters
๐ฌ Research
Scientists can study how models work
๐ฐ Cost
Run on your own hardware, no API fees
๐ ๏ธ Customization
Fine-tune for specific domains
๐ฏ Impact: LLaMA sparked an explosion of open-source models: Alpaca, Vicuna, Mistral, and hundreds more!
โ๏ธ Comparing Major LLM Types
| Feature | GPT | BERT | LLaMA |
|---|---|---|---|
| Architecture | Decoder-only | Encoder-only | Decoder-only |
| Attention | Causal (left-to-right) | Bidirectional | Causal |
| Best Use | Text generation | Understanding/Classification | Text generation |
| Training Task | Next token prediction | Masked language model | Next token prediction |
| Generation? | โ Excellent | โ Not designed for it | โ Excellent |
| Understanding? | โ Good | โ Excellent | โ Good |
| Open Source? | โ No (API only) | โ Yes | โ Yes |
| Example Apps | ChatGPT, Copilot | Search engines, QA | Ollama, local chatbots |
๐ก Quick Guide: Need generation? โ GPT/LLaMA. Need understanding? โ BERT. Need both? โ Use different models for different tasks!
๐ How LLMs Learn
Phase 1: Pre-training
Learn language from massive text
Billions of words, weeks of training
Phase 2: Fine-tuning
Adapt to specific tasks
Task-specific data, hours to days
Phase 3: Alignment (Optional)
Make it helpful, harmless, honest
Human feedback, days to weeks
๐ฏ Result: A powerful model that understands language AND follows instructions!
๐ Pre-training: Learning Language
What is Pre-training?
Learning general language patterns from massive amounts of text WITHOUT specific task labels
Training Data
- Books, articles, websites
- Common Crawl (web scrape)
- Wikipedia, Reddit
- GitHub code (for coding models)
- Total: Trillions of tokens
Training Objective
- GPT: Predict next word
- BERT: Predict masked words
- Self-supervised (no labels needed)
- Learns grammar, facts, reasoning
The Scale of Pre-training
Parameters (GPT-3)
Tokens trained on
Estimated cost (GPT-3)
โ ๏ธ Reality Check: Pre-training from scratch requires massive compute (thousands of GPUs) and millions of dollars. Most developers fine-tune existing models!
๐ฏ Fine-tuning: Specialization
What is Fine-tuning?
Taking a pre-trained model and adapting it to YOUR specific task with YOUR data
Types of Fine-tuning
Full Fine-tuning
- Update ALL model weights
- Most powerful but expensive
- Requires significant compute
- Risk of catastrophic forgetting
๐ All 175B parameters change
Parameter-Efficient (LoRA, Adapters)
- Update SMALL subset of weights
- Much cheaper and faster
- Works on consumer GPUs
- Preserves general knowledge
๐ฏ Only ~1% of parameters change
Fine-tuning Examples
Medical Chatbot
Base: GPT-3
+ Medical Q&A dataset
= MedPaLM
Code Assistant
Base: GPT-3
+ GitHub code
= GitHub Copilot
๐ Instruction Fine-tuning
Teaching Models to Follow Instructions
Pre-trained models complete text. Instruction-tuned models follow commands!
Before vs After Instruction Tuning
โ Before (Base Model)
Prompt: "Translate to French: Hello"
Output: "Translate to Spanish: Hola Translate to German..."
Just continues the pattern!
โ After (Instruction-Tuned)
Prompt: "Translate to French: Hello"
Output: "Bonjour"
Follows the instruction!
Training Data Format
Instruction: Summarize this article in 2 sentences
Input: [Long article text]
Output: [2-sentence summary]
๐ฏ Result: Models like ChatGPT, Claude, and others that helpfully respond to your requests!
๐ญ RLHF: Learning from Human Feedback
Reinforcement Learning from Human Feedback
The secret sauce behind ChatGPT's helpfulness
The RLHF Process
Step 1: Collect Comparisons
Humans rank multiple model outputs: "Which response is better?"
Prompt: "Explain quantum computing"
Response A: โญโญโญโญโญ (clear, accurate)
Response B: โญโญ (confusing, wrong)
Step 2: Train Reward Model
Learn to predict which outputs humans prefer
Step 3: Optimize with RL
Fine-tune model to maximize reward (human preference)
Step 4: Iterate
Collect more feedback, improve continuously
๐ฏ Impact: RLHF makes models helpful, harmless, and honest. It's why ChatGPT refuses harmful requests!
๐ญ Complete Training Pipeline
Pre-training
๐ Trillions of tokens
โฑ๏ธ Weeks/months
๐ฐ $Millions
โ Base Model
Supervised Fine-tuning
๐ Thousands of examples
โฑ๏ธ Hours/days
๐ฐ $Thousands
โ Task Model
RLHF
๐ฅ Human preferences
โฑ๏ธ Days/weeks
๐ฐ $Tens of thousands
โ Aligned Model
What Can YOU Do?
โ Usually Not Feasible
- Pre-training from scratch
- Training 100B+ models
- Full fine-tuning large models
โ Totally Possible!
- LoRA fine-tuning
- Instruction tuning smaller models
- Using open-source models
- Prompt engineering
๐ฏ Key Takeaways
๐๏ธ Transformers
- Self-attention mechanism
- Parallel processing
- Positional encoding
- Multi-head attention
๐ค LLM Types
- GPT: Generation master
- BERT: Understanding expert
- LLaMA: Open-source hero
- Choose based on task
๐ Training
- Pre-training: Learn language
- Fine-tuning: Specialize
- RLHF: Align with humans
- You can fine-tune too!
๐ฅ The Big Picture
Transformers revolutionized NLP โ Led to GPT/BERT/LLaMA โ Trained on massive data โ Fine-tuned for specific tasks โ Aligned with human values โ Powers ChatGPT, Claude, and everything you'll build!
๐ Real-World Impact
These Architectures Power:
๐ฌ Conversational AI
- ChatGPT (GPT-4)
- Claude (Transformer-based)
- Gemini (Google)
- Customer service bots
๐ป Code Generation
- GitHub Copilot
- Amazon CodeWhisperer
- Cursor AI
- Replit Ghostwriter
๐ Search & Understanding
- Google Search (BERT)
- Bing Chat
- Perplexity AI
- Semantic search
๐ Content Creation
- Jasper AI
- Copy.ai
- Notion AI
- Writing assistants
๐ฏ Your Future: With this knowledge, you can build the NEXT generation of AI applications!
๐ช Understanding Check
Quick Quiz: Test Your Knowledge
-
What's the main advantage of transformers over RNNs?
Hint: Think about parallel processing
-
Why can't BERT generate text like GPT?
Hint: Think about attention direction
-
What's the difference between pre-training and fine-tuning?
Hint: Think about data and objectives
-
How does multi-head attention help?
Hint: Different relationships
๐ก Discussion: We'll go through these together. No wrong answersโthis is about understanding!
๐ฌ Mini Lab Activity
Exploring Attention in Action
Visit: https://transformer.huggingface.co/
What to Do:
-
Try different sentences
Example: "The animal didn't cross the street because it was too tired."
-
Observe attention patterns
Which words does "it" attend to most?
-
Experiment with ambiguous pronouns
Does the model understand context correctly?
-
Compare different attention heads
Do different heads focus on different relationships?
๐ฏ Goal: Develop intuition for how attention mechanisms actually work in practice!
๐ Resources & Next Steps
๐ Must-Read Papers
- "Attention Is All You Need" (2017) - The original transformer paper
- "BERT: Pre-training..." (2018) - Understanding BERT
- "Language Models are Few-Shot..." (2020) - GPT-3 paper
- "LLaMA: Open and Efficient..." (2023)
๐ฏ Interactive Resources
- The Illustrated Transformer (Jay Alammar's blog)
- Hugging Face Course (free NLP course)
- 3Blue1Brown (Attention video)
- Andrej Karpathy's YouTube
๐ Coming Next
๐จ Unit 3: Prompt Engineering Fundamentals
- How to write effective prompts
- Zero-shot and few-shot learning
- Chain-of-thought prompting
- Building your first prompt library
๐ Homework Assignment
Assignment: LLM Architecture Analysis
Due: Next class
Tasks:
-
Compare GPT vs BERT
Create a diagram showing their architectural differences. Include attention mechanisms, training objectives, and best use cases.
-
Research a Specific Model
Choose one: GPT-4, Claude, LLaMA 3, Gemini, or Mistral. Write 1 page about its architecture, training, and unique features.
-
Attention Visualization
Use the Hugging Face transformer tool. Take screenshots of 3 interesting attention patterns and explain what they show.
-
Reflection (500 words)
How might you use fine-tuning in your future projects? What domain would you specialize a model for?
๐ก Bonus: Try using an open-source model locally with Ollama. Document your experience!
โ Questions?
Let's Discuss!
Any questions about:
- Transformer architecture or attention?
- Differences between GPT, BERT, LLaMA?
- Pre-training vs fine-tuning?
- How to apply this in your projects?
- Anything else about LLMs?
Thank You! ๐
You now understand how modern AI works!
This knowledge is the foundation for everything we'll build
๐ง Questions? Office hours or email me!
๐ฌ Complete the assignment!
๐จ Next: Prompt Engineering - where the magic happens!