๐Ÿง  Foundations of Large Language Models

Unit 2: How Modern AI Really Works

Understanding Transformers, Architectures & Training

๐Ÿ“š What We'll Learn Today

Journey Through Modern LLMs

2.1 Transformer Architecture & Attention

The revolutionary architecture that changed everything

2.2 Key LLM Types

GPT, BERT, LLaMA - understanding the big players

2.3 Training Methodologies

Pre-training, fine-tuning, and how models learn

๐Ÿ’ก By the end: You'll understand how ChatGPT, Claude, and other LLMs actually work under the hood!

โฎ๏ธ The World Before Transformers

The Problem: Sequential Processing

๐ŸŒ RNNs & LSTMs (Pre-2017)

  • Process word by word
  • Can't parallelize
  • Forget long contexts
  • Slow to train

Word1 โ†’ Word2 โ†’ Word3 โ†’ Word4

(Must process in order)

โšก Transformers (2017+)

  • Process all words at once
  • Highly parallelizable
  • Remember everything
  • Fast to train

Word1 โŸท Word2 โŸท Word3 โŸท Word4

(All words see each other)

๐ŸŽฏ The Breakthrough: "Attention Is All You Need" paper (2017) - transformers became the foundation of all modern LLMs!

2.1 Transformer Architecture

๐Ÿ—๏ธ The Transformer Architecture

๐Ÿ“ฅ

Input Text

โ†“

Embedding Layer

Converts words to vectors

โ†“

Encoder / Decoder Layers

Self-Attention + Feed Forward

(Repeated 12-96 times)

โ†“

Output Layer

Predicts next token

โ†“
๐Ÿ“ค

Generated Text

2.1 Transformer Architecture

๐Ÿ‘€ What is "Attention"?

The Core Idea

Attention = "Which words should I focus on?"

Example: Understanding Context

"The animal didn't cross the street because it was too tired."

"it" pays attention to "animal" (not "street")

Attention weight: animal = 0.8, street = 0.1, tired = 0.1

Without Attention

Old models: "it" = just the previous word or fixed distance

โŒ Couldn't understand long-range dependencies

With Attention

Transformers: "it" = look at ALL previous words, decide which matters

โœ… Understands context perfectly

2.1 Transformer Architecture

๐Ÿ” Self-Attention: The Magic Formula

How It Works (Simplified)

Step 1: Create Q, K, V

For each word, create 3 vectors:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What info do I provide?"

Step 2: Calculate Scores

Compare Query with all Keys:

Score = Q ยท K

Higher score = more relevant

Step 3: Softmax

Convert scores to probabilities:

Attention Weights = softmax(Scores)

Sum to 1.0

Step 4: Weighted Sum

Combine Values using weights:

Output = ฮฃ(Weights ร— Values)

Final representation

๐Ÿ’ก Intuition: Every word asks "Who's relevant to me?" and combines info from those words!

2.1 Transformer Architecture

๐ŸŽญ Multi-Head Attention

Why Multiple "Heads"?

Different heads learn different relationships!

Example: Analyzing "The cat sat on the mat"

Head 1: Subject-Verb

Focuses on:

cat โ†” sat

Learns: Who did the action?

Head 2: Prepositions

Focuses on:

sat โ†” on โ†” mat

Learns: Spatial relationships

Head 3: Determiners

Focuses on:

the โ†” cat

the โ†” mat

Learns: Article-noun pairs

๐ŸŽฏ Typical Setup: GPT-3 has 96 attention heads per layer! Each specializes in different linguistic patterns.

2.1 Transformer Architecture

๐Ÿ“ Positional Encoding: Word Order Matters

The Problem

Attention alone has NO sense of order!

"Dog bites man" and "Man bites dog" would look the same!

The Solution: Add Position Information

Without Position

Word: [0.2, 0.5, 0.8]

Position: ???

With Position Encoding

Word: [0.2, 0.5, 0.8]

Position: [0.1, 0.0, 0.3]

Final: [0.3, 0.5, 1.1]

How It Works

  • Sinusoidal Encoding: Use sine/cosine waves of different frequencies
  • Learned Encoding: Let the model learn position patterns
  • Result: Each position gets a unique "signature"

๐Ÿ’ก Key Insight: Position encoding lets transformers understand "first word", "last word", "middle word" etc.

2.1 Transformer Architecture

๐Ÿงฑ Complete Transformer Block

Input Embeddings + Positional Encoding

โ†“

Multi-Head Self-Attention

Look at all other tokens

โ†“

Add & Normalize

Residual connection + Layer Norm

โ†“

Feed Forward Network

2-layer neural network

โ†“

Add & Normalize

Another residual + Layer Norm

โ†“

Output (to next layer)

๐Ÿ”„ This block repeats 12-96 times! GPT-3: 96 layers, GPT-4: estimated 120+ layers

2.2 Key LLM Types

๐ŸŒ The LLM Landscape

Three Main Architecture Families

Encoder-Only

Example: BERT

Best for:

  • Classification
  • Understanding text
  • Question answering
  • Named entity recognition

Decoder-Only

Example: GPT, LLaMA

Best for:

  • Text generation
  • Creative writing
  • Code generation
  • Conversations

Encoder-Decoder

Example: T5, BART

Best for:

  • Translation
  • Summarization
  • Text-to-text tasks
  • Question generation

๐Ÿ’ก Today's Focus: We'll dive deep into the most popular ones used in GenAI applications!

2.2 Key LLM Types

๐Ÿค– GPT: Generative Pre-trained Transformer

Core Idea: Predict the Next Word

Given: "The cat sat on the ___"

Predict: "mat" (or "chair", "floor", etc.)

Architecture

  • Type: Decoder-only
  • Attention: Causal (left-to-right)
  • Training: Next token prediction
  • Direction: Forward only

Key Features

  • Autoregressive generation
  • Can't see future words
  • Excellent at generation
  • Zero-shot learning

The GPT Evolution

Model Parameters Year Key Feature
GPT-1 117M 2018 Proof of concept
GPT-2 1.5B 2019 "Too dangerous to release"
GPT-3 175B 2020 Few-shot learning breakthrough
GPT-4 ~1.7T* 2023 Multimodal, reasoning

*Estimated, not officially confirmed

2.2 Key LLM Types

๐Ÿ“– BERT: Bidirectional Encoder Representations

Core Idea: Fill in the Blank

Given: "The cat [MASK] on the mat"

Predict: "sat"

Can see words BEFORE and AFTER the blank!

Architecture

  • Type: Encoder-only
  • Attention: Bidirectional
  • Training: Masked Language Model
  • Direction: Both ways โŸท

Key Features

  • Sees full context
  • Better understanding
  • Not for generation
  • Great for classification

GPT vs BERT: The Key Difference

GPT (Causal Attention)

"The cat sat on the"

Can only look โ† left

Predicts next word

BERT (Bidirectional)

"The cat [MASK] on the mat"

Looks โ† left and right โ†’

Understands full sentence

2.2 Key LLM Types

๐Ÿฆ™ LLaMA: Open Source Revolution

Meta's LLaMA (Large Language Model Meta AI)

Open-source alternative to GPT, designed for efficiency

What Makes LLaMA Special?

  • Open weights: Anyone can use/modify
  • Efficient: Smaller but competitive
  • Decoder-only: Like GPT architecture
  • Research-friendly: Democratizes AI

LLaMA Family

  • LLaMA 1: 7B, 13B, 33B, 65B (2023)
  • LLaMA 2: 7B, 13B, 70B (2023)
  • LLaMA 3: 8B, 70B, 405B (2024)
  • Commercial use: Allowed!

Why Open Source Matters

๐Ÿ”ฌ Research

Scientists can study how models work

๐Ÿ’ฐ Cost

Run on your own hardware, no API fees

๐Ÿ› ๏ธ Customization

Fine-tune for specific domains

๐ŸŽฏ Impact: LLaMA sparked an explosion of open-source models: Alpaca, Vicuna, Mistral, and hundreds more!

2.2 Key LLM Types

โš–๏ธ Comparing Major LLM Types

Feature GPT BERT LLaMA
Architecture Decoder-only Encoder-only Decoder-only
Attention Causal (left-to-right) Bidirectional Causal
Best Use Text generation Understanding/Classification Text generation
Training Task Next token prediction Masked language model Next token prediction
Generation? โœ… Excellent โŒ Not designed for it โœ… Excellent
Understanding? โœ… Good โœ… Excellent โœ… Good
Open Source? โŒ No (API only) โœ… Yes โœ… Yes
Example Apps ChatGPT, Copilot Search engines, QA Ollama, local chatbots

๐Ÿ’ก Quick Guide: Need generation? โ†’ GPT/LLaMA. Need understanding? โ†’ BERT. Need both? โ†’ Use different models for different tasks!

2.3 Training Methodologies

๐ŸŽ“ How LLMs Learn

Phase 1: Pre-training

Learn language from massive text

Billions of words, weeks of training

โ†“

Phase 2: Fine-tuning

Adapt to specific tasks

Task-specific data, hours to days

โ†“

Phase 3: Alignment (Optional)

Make it helpful, harmless, honest

Human feedback, days to weeks

๐ŸŽฏ Result: A powerful model that understands language AND follows instructions!

2.3 Training Methodologies

๐Ÿ“š Pre-training: Learning Language

What is Pre-training?

Learning general language patterns from massive amounts of text WITHOUT specific task labels

Training Data

  • Books, articles, websites
  • Common Crawl (web scrape)
  • Wikipedia, Reddit
  • GitHub code (for coding models)
  • Total: Trillions of tokens

Training Objective

  • GPT: Predict next word
  • BERT: Predict masked words
  • Self-supervised (no labels needed)
  • Learns grammar, facts, reasoning

The Scale of Pre-training

175B

Parameters (GPT-3)

300B

Tokens trained on

$4.6M

Estimated cost (GPT-3)

โš ๏ธ Reality Check: Pre-training from scratch requires massive compute (thousands of GPUs) and millions of dollars. Most developers fine-tune existing models!

2.3 Training Methodologies

๐ŸŽฏ Fine-tuning: Specialization

What is Fine-tuning?

Taking a pre-trained model and adapting it to YOUR specific task with YOUR data

Types of Fine-tuning

Full Fine-tuning

  • Update ALL model weights
  • Most powerful but expensive
  • Requires significant compute
  • Risk of catastrophic forgetting

๐Ÿ”„ All 175B parameters change

Parameter-Efficient (LoRA, Adapters)

  • Update SMALL subset of weights
  • Much cheaper and faster
  • Works on consumer GPUs
  • Preserves general knowledge

๐ŸŽฏ Only ~1% of parameters change

Fine-tuning Examples

Medical Chatbot

Base: GPT-3

+ Medical Q&A dataset

= MedPaLM

Code Assistant

Base: GPT-3

+ GitHub code

= GitHub Copilot

2.3 Training Methodologies

๐Ÿ“ Instruction Fine-tuning

Teaching Models to Follow Instructions

Pre-trained models complete text. Instruction-tuned models follow commands!

Before vs After Instruction Tuning

โŒ Before (Base Model)

Prompt: "Translate to French: Hello"

Output: "Translate to Spanish: Hola Translate to German..."

Just continues the pattern!

โœ… After (Instruction-Tuned)

Prompt: "Translate to French: Hello"

Output: "Bonjour"

Follows the instruction!

Training Data Format

Instruction: Summarize this article in 2 sentences
Input: [Long article text]
Output: [2-sentence summary]

๐ŸŽฏ Result: Models like ChatGPT, Claude, and others that helpfully respond to your requests!

2.3 Training Methodologies

๐ŸŽญ RLHF: Learning from Human Feedback

Reinforcement Learning from Human Feedback

The secret sauce behind ChatGPT's helpfulness

The RLHF Process

Step 1: Collect Comparisons

Humans rank multiple model outputs: "Which response is better?"

Prompt: "Explain quantum computing"

Response A: โญโญโญโญโญ (clear, accurate)

Response B: โญโญ (confusing, wrong)

Step 2: Train Reward Model

Learn to predict which outputs humans prefer

Step 3: Optimize with RL

Fine-tune model to maximize reward (human preference)

Step 4: Iterate

Collect more feedback, improve continuously

๐ŸŽฏ Impact: RLHF makes models helpful, harmless, and honest. It's why ChatGPT refuses harmful requests!

2.3 Training Methodologies

๐Ÿญ Complete Training Pipeline

Pre-training

๐Ÿ“š Trillions of tokens

โฑ๏ธ Weeks/months

๐Ÿ’ฐ $Millions

โ†’ Base Model

โ†’

Supervised Fine-tuning

๐Ÿ“ Thousands of examples

โฑ๏ธ Hours/days

๐Ÿ’ฐ $Thousands

โ†’ Task Model

โ†’

RLHF

๐Ÿ‘ฅ Human preferences

โฑ๏ธ Days/weeks

๐Ÿ’ฐ $Tens of thousands

โ†’ Aligned Model

What Can YOU Do?

โŒ Usually Not Feasible

  • Pre-training from scratch
  • Training 100B+ models
  • Full fine-tuning large models

โœ… Totally Possible!

  • LoRA fine-tuning
  • Instruction tuning smaller models
  • Using open-source models
  • Prompt engineering

๐ŸŽฏ Key Takeaways

๐Ÿ—๏ธ Transformers

  • Self-attention mechanism
  • Parallel processing
  • Positional encoding
  • Multi-head attention

๐Ÿค– LLM Types

  • GPT: Generation master
  • BERT: Understanding expert
  • LLaMA: Open-source hero
  • Choose based on task

๐ŸŽ“ Training

  • Pre-training: Learn language
  • Fine-tuning: Specialize
  • RLHF: Align with humans
  • You can fine-tune too!

๐Ÿ”ฅ The Big Picture

Transformers revolutionized NLP โ†’ Led to GPT/BERT/LLaMA โ†’ Trained on massive data โ†’ Fine-tuned for specific tasks โ†’ Aligned with human values โ†’ Powers ChatGPT, Claude, and everything you'll build!

๐ŸŒ Real-World Impact

These Architectures Power:

๐Ÿ’ฌ Conversational AI

  • ChatGPT (GPT-4)
  • Claude (Transformer-based)
  • Gemini (Google)
  • Customer service bots

๐Ÿ’ป Code Generation

  • GitHub Copilot
  • Amazon CodeWhisperer
  • Cursor AI
  • Replit Ghostwriter

๐Ÿ” Search & Understanding

  • Google Search (BERT)
  • Bing Chat
  • Perplexity AI
  • Semantic search

๐Ÿ“ Content Creation

  • Jasper AI
  • Copy.ai
  • Notion AI
  • Writing assistants

๐ŸŽฏ Your Future: With this knowledge, you can build the NEXT generation of AI applications!

๐Ÿ’ช Understanding Check

Quick Quiz: Test Your Knowledge

  1. What's the main advantage of transformers over RNNs?

    Hint: Think about parallel processing

  2. Why can't BERT generate text like GPT?

    Hint: Think about attention direction

  3. What's the difference between pre-training and fine-tuning?

    Hint: Think about data and objectives

  4. How does multi-head attention help?

    Hint: Different relationships

๐Ÿ’ก Discussion: We'll go through these together. No wrong answersโ€”this is about understanding!

๐Ÿ”ฌ Mini Lab Activity

Exploring Attention in Action

Visit: https://transformer.huggingface.co/

What to Do:

  1. Try different sentences

    Example: "The animal didn't cross the street because it was too tired."

  2. Observe attention patterns

    Which words does "it" attend to most?

  3. Experiment with ambiguous pronouns

    Does the model understand context correctly?

  4. Compare different attention heads

    Do different heads focus on different relationships?

๐ŸŽฏ Goal: Develop intuition for how attention mechanisms actually work in practice!

๐Ÿ“š Resources & Next Steps

๐Ÿ“– Must-Read Papers

  • "Attention Is All You Need" (2017) - The original transformer paper
  • "BERT: Pre-training..." (2018) - Understanding BERT
  • "Language Models are Few-Shot..." (2020) - GPT-3 paper
  • "LLaMA: Open and Efficient..." (2023)

๐ŸŽฏ Interactive Resources

  • The Illustrated Transformer (Jay Alammar's blog)
  • Hugging Face Course (free NLP course)
  • 3Blue1Brown (Attention video)
  • Andrej Karpathy's YouTube

๐Ÿ“… Coming Next

๐ŸŽจ Unit 3: Prompt Engineering Fundamentals

  • How to write effective prompts
  • Zero-shot and few-shot learning
  • Chain-of-thought prompting
  • Building your first prompt library

๐Ÿ“ Homework Assignment

Assignment: LLM Architecture Analysis

Due: Next class

Tasks:

  1. Compare GPT vs BERT

    Create a diagram showing their architectural differences. Include attention mechanisms, training objectives, and best use cases.

  2. Research a Specific Model

    Choose one: GPT-4, Claude, LLaMA 3, Gemini, or Mistral. Write 1 page about its architecture, training, and unique features.

  3. Attention Visualization

    Use the Hugging Face transformer tool. Take screenshots of 3 interesting attention patterns and explain what they show.

  4. Reflection (500 words)

    How might you use fine-tuning in your future projects? What domain would you specialize a model for?

๐Ÿ’ก Bonus: Try using an open-source model locally with Ollama. Document your experience!

โ“ Questions?

Let's Discuss!

Any questions about:

  • Transformer architecture or attention?
  • Differences between GPT, BERT, LLaMA?
  • Pre-training vs fine-tuning?
  • How to apply this in your projects?
  • Anything else about LLMs?

Thank You! ๐ŸŽ‰

You now understand how modern AI works!

This knowledge is the foundation for everything we'll build

๐Ÿ“ง Questions? Office hours or email me!

๐Ÿ”ฌ Complete the assignment!

๐ŸŽจ Next: Prompt Engineering - where the magic happens!

1 / 28