🧠 Foundations of Large Language Models

Unit 2: How Modern AI Really Works

Understanding Transformers, Architectures & Training

📚 What We'll Learn Today

Journey Through Modern LLMs

2.1 Transformer Architecture & Attention

The revolutionary architecture that changed everything

2.2 Key LLM Types

GPT, BERT, LLaMA - understanding the big players

2.3 Training Methodologies

Pre-training, fine-tuning, and how models learn

💡 By the end: You'll understand how ChatGPT, Claude, and other LLMs actually work under the hood!

⏮️ The World Before Transformers

The Problem: Sequential Processing

🐌 RNNs & LSTMs (Pre-2017)

Process word by word
Can't parallelize
Forget long contexts
Slow to train

Word1 → Word2 → Word3 → Word4

(Must process in order)

⚡ Transformers (2017+)

Process all words at once
Highly parallelizable
Remember everything
Fast to train

Word1 ⟷ Word2 ⟷ Word3 ⟷ Word4

(All words see each other)

🎯 The Breakthrough: "Attention Is All You Need" paper (2017) - transformers became the foundation of all modern LLMs!

2.1 Transformer Architecture

🏗️ The Transformer Architecture

📥

Input Text

↓

Embedding Layer

Converts words to vectors

↓

Encoder / Decoder Layers

Self-Attention + Feed Forward

(Repeated 12-96 times)

↓

Output Layer

Predicts next token

↓

📤

Generated Text

2.1 Transformer Architecture

👀 What is "Attention"?

The Core Idea

Attention = "Which words should I focus on?"

Example: Understanding Context

"The animal didn't cross the street because it was too tired."

"it" pays attention to "animal" (not "street")

Attention weight: animal = 0.8, street = 0.1, tired = 0.1

Without Attention

Old models: "it" = just the previous word or fixed distance

❌ Couldn't understand long-range dependencies

With Attention

Transformers: "it" = look at ALL previous words, decide which matters

✅ Understands context perfectly

2.1 Transformer Architecture

🔍 Self-Attention: The Magic Formula

How It Works (Simplified)

Step 1: Create Q, K, V

For each word, create 3 vectors:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What info do I provide?"

Step 2: Calculate Scores

Compare Query with all Keys:

Score = Q \cdot K

Higher score = more relevant

Step 3: Softmax

Convert scores to probabilities:

Attention Weights = softmax(Scores)

Sum to 1.0

Step 4: Weighted Sum

Combine Values using weights:

Output = Σ(Weights \times Values)

Final representation

💡 Intuition: Every word asks "Who's relevant to me?" and combines info from those words!

2.1 Transformer Architecture

🎭 Multi-Head Attention

Why Multiple "Heads"?

Different heads learn different relationships!

Example: Analyzing "The cat sat on the mat"

Head 1: Subject-Verb

Focuses on:

cat ↔ sat

Learns: Who did the action?

Head 2: Prepositions

Focuses on:

sat ↔ on ↔ mat

Learns: Spatial relationships

Head 3: Determiners

Focuses on:

the ↔ cat

the ↔ mat

Learns: Article-noun pairs

🎯 Typical Setup: GPT-3 has 96 attention heads per layer! Each specializes in different linguistic patterns.

2.1 Transformer Architecture

📍 Positional Encoding: Word Order Matters

The Problem

Attention alone has NO sense of order!

"Dog bites man" and "Man bites dog" would look the same!

The Solution: Add Position Information

Without Position

Word: [0.2, 0.5, 0.8]

Position: ???

With Position Encoding

Word: [0.2, 0.5, 0.8]

Position: [0.1, 0.0, 0.3]

Final: [0.3, 0.5, 1.1]

                        How It Works
                        Sinusoidal Encoding: Use sine/cosine waves of different frequencies
Learned Encoding: Let the model learn position patterns
Result: Each position gets a unique "signature"

                    

💡 Key Insight: Position encoding lets transformers understand "first word", "last word", "middle word" etc.

2.1 Transformer Architecture

🧱 Complete Transformer Block

Input Embeddings + Positional Encoding

↓

Multi-Head Self-Attention

Look at all other tokens

↓

Add & Normalize

Residual connection + Layer Norm

↓

Feed Forward Network

2-layer neural network

↓

Add & Normalize

Another residual + Layer Norm

↓

Output (to next layer)

🔄 This block repeats 12-96 times! GPT-3: 96 layers, GPT-4: estimated 120+ layers

2.2 Key LLM Types

🌍 The LLM Landscape

Three Main Architecture Families

Encoder-Only

Example: BERT

Best for:

Classification
Understanding text
Question answering
Named entity recognition

Decoder-Only

Example: GPT, LLaMA

Best for:

Text generation
Creative writing
Code generation
Conversations

Encoder-Decoder

Example: T5, BART

Best for:

Translation
Summarization
Text-to-text tasks
Question generation

💡 Today's Focus: We'll dive deep into the most popular ones used in GenAI applications!

2.2 Key LLM Types

🤖 GPT: Generative Pre-trained Transformer

Core Idea: Predict the Next Word

Given: "The cat sat on the ___"

Predict: "mat" (or "chair", "floor", etc.)

Architecture

Type: Decoder-only
Attention: Causal (left-to-right)
Training: Next token prediction
Direction: Forward only

Key Features

Autoregressive generation
Can't see future words
Excellent at generation
Zero-shot learning

The GPT Evolution

Model	Parameters	Year	Key Feature
GPT-1	117M	2018	Proof of concept
GPT-2	1.5B	2019	"Too dangerous to release"
GPT-3	175B	2020	Few-shot learning breakthrough
GPT-4	~1.7T*	2023	Multimodal, reasoning

*Estimated, not officially confirmed

2.2 Key LLM Types

📖 BERT: Bidirectional Encoder Representations

Core Idea: Fill in the Blank

Given: "The cat [MASK] on the mat"

Predict: "sat"

Can see words BEFORE and AFTER the blank!

Architecture

Type: Encoder-only
Attention: Bidirectional
Training: Masked Language Model
Direction: Both ways ⟷

Key Features

Sees full context
Better understanding
Not for generation
Great for classification

GPT vs BERT: The Key Difference

GPT (Causal Attention)

"The cat sat on the"

Can only look ← left

Predicts next word

BERT (Bidirectional)

"The cat [MASK] on the mat"

Looks ← left and right →

Understands full sentence

2.2 Key LLM Types

🦙 LLaMA: Open Source Revolution

Meta's LLaMA (Large Language Model Meta AI)

Open-source alternative to GPT, designed for efficiency

What Makes LLaMA Special?

Open weights: Anyone can use/modify
Efficient: Smaller but competitive
Decoder-only: Like GPT architecture
Research-friendly: Democratizes AI

LLaMA Family

LLaMA 1: 7B, 13B, 33B, 65B (2023)
LLaMA 2: 7B, 13B, 70B (2023)
LLaMA 3: 8B, 70B, 405B (2024)
Commercial use: Allowed!

Why Open Source Matters

🔬 Research

Scientists can study how models work

💰 Cost

Run on your own hardware, no API fees

🛠️ Customization

Fine-tune for specific domains

🎯 Impact: LLaMA sparked an explosion of open-source models: Alpaca, Vicuna, Mistral, and hundreds more!

2.2 Key LLM Types

⚖️ Comparing Major LLM Types

Feature	GPT	BERT	LLaMA
Architecture	Decoder-only	Encoder-only	Decoder-only
Attention	Causal (left-to-right)	Bidirectional	Causal
Best Use	Text generation	Understanding/Classification	Text generation
Training Task	Next token prediction	Masked language model	Next token prediction
Generation?	✅ Excellent	❌ Not designed for it	✅ Excellent
Understanding?	✅ Good	✅ Excellent	✅ Good
Open Source?	❌ No (API only)	✅ Yes	✅ Yes
Example Apps	ChatGPT, Copilot	Search engines, QA	Ollama, local chatbots

💡 Quick Guide: Need generation? → GPT/LLaMA. Need understanding? → BERT. Need both? → Use different models for different tasks!

2.3 Training Methodologies

🎓 How LLMs Learn

Phase 1: Pre-training

Learn language from massive text

Billions of words, weeks of training

↓

Phase 2: Fine-tuning

Adapt to specific tasks

Task-specific data, hours to days

↓

Phase 3: Alignment (Optional)

Make it helpful, harmless, honest

Human feedback, days to weeks

🎯 Result: A powerful model that understands language AND follows instructions!

2.3 Training Methodologies

📚 Pre-training: Learning Language

What is Pre-training?

Learning general language patterns from massive amounts of text WITHOUT specific task labels

Training Data

Books, articles, websites
Common Crawl (web scrape)
Wikipedia, Reddit
GitHub code (for coding models)
Total: Trillions of tokens

Training Objective

GPT: Predict next word
BERT: Predict masked words
Self-supervised (no labels needed)
Learns grammar, facts, reasoning

The Scale of Pre-training

175B

Parameters (GPT-3)

300B

Tokens trained on

$4.6M

Estimated cost (GPT-3)

⚠️ Reality Check: Pre-training from scratch requires massive compute (thousands of GPUs) and millions of dollars. Most developers fine-tune existing models!

2.3 Training Methodologies

🎯 Fine-tuning: Specialization

What is Fine-tuning?

Taking a pre-trained model and adapting it to YOUR specific task with YOUR data

Types of Fine-tuning

Full Fine-tuning

Update ALL model weights
Most powerful but expensive
Requires significant compute
Risk of catastrophic forgetting

🔄 All 175B parameters change

Parameter-Efficient (LoRA, Adapters)

Update SMALL subset of weights
Much cheaper and faster
Works on consumer GPUs
Preserves general knowledge

🎯 Only ~1% of parameters change

Fine-tuning Examples

Medical Chatbot

Base: GPT-3

+ Medical Q&A dataset

= MedPaLM

Code Assistant

Base: GPT-3

+ GitHub code

= GitHub Copilot

2.3 Training Methodologies

📝 Instruction Fine-tuning

Teaching Models to Follow Instructions

Pre-trained models complete text. Instruction-tuned models follow commands!

Before vs After Instruction Tuning

❌ Before (Base Model)

Prompt: "Translate to French: Hello"

Output: "Translate to Spanish: Hola Translate to German..."

Just continues the pattern!

✅ After (Instruction-Tuned)

Prompt: "Translate to French: Hello"

Output: "Bonjour"

Follows the instruction!

Training Data Format

Instruction: Summarize this article in 2 sentences
Input: [Long article text]
Output: [2-sentence summary]

🎯 Result: Models like ChatGPT, Claude, and others that helpfully respond to your requests!

2.3 Training Methodologies

🎭 RLHF: Learning from Human Feedback

Reinforcement Learning from Human Feedback

The secret sauce behind ChatGPT's helpfulness

The RLHF Process

Step 1: Collect Comparisons

Humans rank multiple model outputs: "Which response is better?"

Prompt: "Explain quantum computing"

Response A: ⭐⭐⭐⭐⭐ (clear, accurate)

Response B: ⭐⭐ (confusing, wrong)

Step 2: Train Reward Model

Learn to predict which outputs humans prefer

Step 3: Optimize with RL

Fine-tune model to maximize reward (human preference)

Step 4: Iterate

Collect more feedback, improve continuously

🎯 Impact: RLHF makes models helpful, harmless, and honest. It's why ChatGPT refuses harmful requests!

2.3 Training Methodologies

🏭 Complete Training Pipeline

Pre-training

📚 Trillions of tokens

⏱️ Weeks/months

💰 $Millions

→ Base Model

→

Supervised Fine-tuning

📝 Thousands of examples

⏱️ Hours/days

💰 $Thousands

→ Task Model

→

RLHF

👥 Human preferences

⏱️ Days/weeks

💰 $Tens of thousands

→ Aligned Model

What Can YOU Do?

❌ Usually Not Feasible

Pre-training from scratch
Training 100B+ models
Full fine-tuning large models

✅ Totally Possible!

LoRA fine-tuning
Instruction tuning smaller models
Using open-source models
Prompt engineering

🎯 Key Takeaways

🏗️ Transformers

Self-attention mechanism
Parallel processing
Positional encoding
Multi-head attention

🤖 LLM Types

GPT: Generation master
BERT: Understanding expert
LLaMA: Open-source hero
Choose based on task

🎓 Training

Pre-training: Learn language
Fine-tuning: Specialize
RLHF: Align with humans
You can fine-tune too!

🔥 The Big Picture

Transformers revolutionized NLP → Led to GPT/BERT/LLaMA → Trained on massive data → Fine-tuned for specific tasks → Aligned with human values → Powers ChatGPT, Claude, and everything you'll build!

🌍 Real-World Impact

These Architectures Power:

💬 Conversational AI

ChatGPT (GPT-4)
Claude (Transformer-based)
Gemini (Google)
Customer service bots

💻 Code Generation

GitHub Copilot
Amazon CodeWhisperer
Cursor AI
Replit Ghostwriter

🔍 Search & Understanding

Google Search (BERT)
Bing Chat
Perplexity AI
Semantic search

📝 Content Creation

Jasper AI
Copy.ai
Notion AI
Writing assistants

🎯 Your Future: With this knowledge, you can build the NEXT generation of AI applications!

💪 Understanding Check

Quick Quiz: Test Your Knowledge

What's the main advantage of transformers over RNNs?
Hint: Think about parallel processing
Why can't BERT generate text like GPT?
Hint: Think about attention direction
What's the difference between pre-training and fine-tuning?
Hint: Think about data and objectives
How does multi-head attention help?
Hint: Different relationships

💡 Discussion: We'll go through these together. No wrong answers—this is about understanding!

🔬 Mini Lab Activity

Exploring Attention in Action

Visit: https://transformer.huggingface.co/

What to Do:

Try different sentences
Example: "The animal didn't cross the street because it was too tired."
Observe attention patterns
Which words does "it" attend to most?
Experiment with ambiguous pronouns
Does the model understand context correctly?
Compare different attention heads
Do different heads focus on different relationships?

🎯 Goal: Develop intuition for how attention mechanisms actually work in practice!

📚 Resources & Next Steps

📖 Must-Read Papers

"Attention Is All You Need" (2017) - The original transformer paper
"BERT: Pre-training..." (2018) - Understanding BERT
"Language Models are Few-Shot..." (2020) - GPT-3 paper
"LLaMA: Open and Efficient..." (2023)

🎯 Interactive Resources

The Illustrated Transformer (Jay Alammar's blog)
Hugging Face Course (free NLP course)
3Blue1Brown (Attention video)
Andrej Karpathy's YouTube

📅 Coming Next

🎨 Unit 3: Prompt Engineering Fundamentals

How to write effective prompts
Zero-shot and few-shot learning
Chain-of-thought prompting
Building your first prompt library

📝 Homework Assignment

Assignment: LLM Architecture Analysis

Due: Next class

Tasks:

Compare GPT vs BERT
Create a diagram showing their architectural differences. Include attention mechanisms, training objectives, and best use cases.
Research a Specific Model
Choose one: GPT-4, Claude, LLaMA 3, Gemini, or Mistral. Write 1 page about its architecture, training, and unique features.
Attention Visualization
Use the Hugging Face transformer tool. Take screenshots of 3 interesting attention patterns and explain what they show.
Reflection (500 words)
How might you use fine-tuning in your future projects? What domain would you specialize a model for?

💡 Bonus: Try using an open-source model locally with Ollama. Document your experience!

❓ Questions?

Let's Discuss!

Any questions about:

Transformer architecture or attention?
Differences between GPT, BERT, LLaMA?
Pre-training vs fine-tuning?
How to apply this in your projects?
Anything else about LLMs?

Thank You! 🎉

You now understand how modern AI works!

This knowledge is the foundation for everything we'll build

📧 Questions? Office hours or email me!

🔬 Complete the assignment!

🎨 Next: Prompt Engineering - where the magic happens!