Unit 7: Advanced Topics in Generative AI

Exploring the Frontiers of AI Innovation

4 Hours • 5 Topics

Unit Overview

What We'll Explore

This unit covers cutting-edge developments in Generative AI, from multimodal systems to autonomous agents and practical implementation strategies.

🎯 Learning Objectives

Understand multimodal AI architectures
Design agentic AI systems
Explore emerging research frontiers
Implement self-hosted LLMs
Fine-tune models for specific tasks

📚 Topics Covered

Multimodal AI Fundamentals
Agentic AI Systems
Emerging Research Directions
Self-hosting LLMs
Finetuning LLMs

7.1 Fundamentals of Multimodal AI

Section 7.1

Beyond Text: AI That Sees, Hears, and Understands

Multimodal AI systems process and integrate multiple types of data including text, images, audio, and video to create richer, more contextual understanding.

Key Concept: Multimodal AI combines different sensory modalities to achieve more human-like perception and reasoning capabilities.

What is Multimodal AI?

Traditional AI

Single modality input (text OR image OR audio)

Limited context understanding

Multimodal AI

Multiple modality inputs (text AND image AND audio)

Rich contextual understanding

Any-to-Any

Any input to any output modality

Maximum flexibility

                    Common Modality Combinations
                    Vision-Language: GPT-4V, Claude 3, Gemini (image + text understanding)
Audio-Language: Whisper, Speech synthesis models
Video Understanding: Combining temporal visual + audio + language
Multimodal Generation: DALL-E 3, Stable Diffusion, Sora (text to image/video)

                

Multimodal AI Architectures

Early Fusion

Combine modalities at input level before processing

Concatenate features early
Joint embedding space
Example: CLIP embeddings

Late Fusion

Process modalities separately, combine at output

Independent encoders
Merge predictions/features
More modular approach

Cross-Attention Mechanisms

Modern approach: Use cross-attention layers to allow modalities to attend to each other, enabling deep interaction between vision and language tokens.

Vision tokens attend to language tokens and vice versa
Flexible integration at multiple layers
Used in GPT-4V, Flamingo, and other state-of-the-art models

Key Multimodal Models & Applications

Model	Modalities	Key Capabilities
GPT-4V	Text, Images	Visual understanding, OCR, diagram analysis, image description
CLIP	Text, Images	Zero-shot image classification, image-text matching
Whisper	Audio, Text	Speech recognition, translation, multilingual transcription
ImageBind	6+ modalities	Audio, depth, thermal, IMU data alignment
Gemini	Text, Image, Audio, Video	Native multimodal reasoning, video understanding

Industry Impact: Multimodal AI enables applications like medical image diagnosis with contextual patient data, autonomous vehicles combining camera and LIDAR, and accessibility tools for vision-impaired users.

Challenges & Future Directions

Current Challenges

Alignment: Ensuring modalities are properly synchronized and aligned
Data Requirements: Need paired multimodal datasets
Computational Cost: Processing multiple modalities is expensive
Evaluation: Difficult to benchmark multimodal understanding
Hallucination: Models may generate incorrect cross-modal associations

Emerging Solutions

Contrastive Learning: Better alignment through contrastive objectives
Efficient Architectures: Adapter layers, parameter-efficient methods
Synthetic Data: Generating paired multimodal data
Unified Tokenization: Treating all modalities as token sequences

7.2 Agentic AI Systems

Section 7.2

From Chatbots to Autonomous Agents

Agentic AI systems can plan, use tools, interact with environments, and autonomously pursue complex goals over extended periods.

"The shift from passive language models to active agents represents one of the most significant transitions in AI development."

What are AI Agents?

Traditional LLMs

Respond to single prompts
No persistent state or memory
Cannot take external actions
Passive information providers

Agentic Systems

Plan multi-step workflows
Maintain context and memory
Use tools and APIs
Active goal pursuers

Key Components of AI Agents

Perception

Observe environment and user inputs

Reasoning

Plan actions and make decisions

Action

Execute tools and interact with world

Agent Architecture Patterns

ReAct (Reasoning + Acting)

Interleaves reasoning traces with action execution. Agent thinks aloud about what to do, then acts.

Pattern: Thought → Action → Observation → Thought → Action...

ReWOO (Reasoning Without Observation)

Plans all actions upfront before execution

Create complete plan first
Execute all actions
More efficient, less flexible

Reflexion

Learns from mistakes through self-reflection

Execute action
Evaluate outcome
Reflect and improve

Tool Use & Function Calling

Extending Agent Capabilities

Agents become powerful when they can use external tools: APIs, calculators, search engines, databases, code interpreters, and more.

Common Tool Categories

Information Retrieval: Search, databases, RAG systems
Computation: Calculators, code execution, data analysis
Communication: Email, messaging, notifications
File Operations: Read, write, modify documents
Web Interaction: Browser automation, API calls

Function Calling Pattern

Define available functions with schemas
LLM decides which function to call
Extract parameters from context
Execute function in environment
Return results to LLM
Continue conversation with results

Popular Agent Frameworks

Framework	Key Features	Best For
LangGraph	State machines, cycles, human-in-loop	Complex workflows, production systems
AutoGPT	Autonomous goal pursuit, self-prompting	Open-ended tasks, research
CrewAI	Multi-agent collaboration, role assignment	Team-based workflows
Microsoft AutoGen	Conversable agents, group chat	Multi-agent conversations
OpenAI Assistants	Managed threads, built-in tools, retrieval	Quick prototypes, managed infrastructure

Important: Agent systems can be unpredictable and may take unexpected actions. Always implement proper safety guardrails, monitoring, and human oversight for production deployments.

Challenges in Agentic AI

Technical Challenges

Long-term Planning: Difficulty maintaining coherent plans over many steps
Error Propagation: Early mistakes compound over time
Tool Reliability: Agents struggle when tools fail or return unexpected results
Context Management: Keeping relevant information in limited context windows

Safety & Control

Unpredictability: Agents may take unexpected actions
Runaway Behavior: Infinite loops or excessive API calls
Security Risks: Potential for malicious tool use
Alignment: Ensuring agents pursue intended goals

Best Practices

Implement rate limiting and budgets
Add human approval for critical actions
Use simulation environments for testing
Monitor and log all agent actions
Design clear success/failure criteria

7.3 Emerging Research Directions

Section 7.3

The Cutting Edge of Generative AI

Exploring the latest research trends, breakthrough techniques, and future directions that will shape the next generation of AI systems.

Advanced Reasoning Capabilities

Chain-of-Thought & Beyond

Teaching models to think step-by-step has dramatically improved complex reasoning tasks.

Current Approaches

Chain-of-Thought (CoT): Explicit reasoning steps
Tree of Thoughts: Exploring multiple reasoning paths
Graph of Thoughts: Non-linear reasoning structures
Self-Consistency: Sampling multiple reasoning paths

Emerging Techniques

Process Reward Models: Supervising reasoning process, not just outcomes
Inference-time Compute: Extended thinking during generation
Metacognition: Models reasoning about their own reasoning
Symbolic Integration: Hybrid neural-symbolic systems

Efficiency & Scalability Research

Model Compression

Quantization: 4-bit, 3-bit, 1-bit models
Pruning: Removing unnecessary parameters
Distillation: Training smaller models from larger ones
Sparse Models: Mixture of Experts (MoE)

Training Efficiency

Flash Attention: Memory-efficient attention mechanisms
Gradient Checkpointing: Trade compute for memory
Mixed Precision: FP16, BF16 training
Zero Redundancy Optimizer: Distributed training optimization

Context Length Extension

RoPE Scaling: Extending rotary embeddings
ALiBi: Attention with Linear Biases
Sparse Attention: Longformer, BigBird patterns
Retrieval Augmentation: Infinite context via RAG

Breakthrough: Models like Claude and Gemini now support 100K-200K+ token contexts, enabling processing of entire codebases, books, and long documents.

AI Alignment & Safety

Ensuring AI Systems are Helpful, Harmless, and Honest

Alignment research focuses on making AI systems that reliably do what humans intend while avoiding harmful behaviors.

Current Techniques

RLHF: Reinforcement Learning from Human Feedback
Constitutional AI: Self-critique against principles
Red Teaming: Adversarial testing for vulnerabilities
Interpretability: Understanding model internals

Open Problems

Scalable Oversight: Supervising superhuman AI
Robustness: Resisting jailbreaks and adversarial attacks
Value Learning: Inferring human preferences accurately
Long-term Safety: Ensuring safe behavior as capabilities increase

Other Exciting Research Areas

🧬 Code Generation

Models that write, debug, and optimize code

AlphaCode, CodeLlama
Program synthesis
Automated debugging

🔬 Scientific AI

AI for scientific discovery

AlphaFold for proteins
Material discovery
Mathematical reasoning

🎮 Embodied AI

AI in physical or simulated worlds

Robotics integration
Virtual agents
Simulation-to-reality

🌐 Multilingual & Cross-lingual

Better support for low-resource languages, cross-lingual transfer, and culturally-aware models

⚡ On-device AI

Running powerful models on smartphones, edge devices, and without cloud connectivity

7.4 Self-hosting LLMs

Section 7.4

Running Your Own Language Models

Self-hosting LLMs gives you complete control over data privacy, customization, and costs, but requires technical infrastructure and expertise.

Why Self-host? Data privacy, cost control at scale, offline operation, full customization, no rate limits, and compliance requirements.

Hardware Requirements

Estimating VRAM Needs

Rule of Thumb: Model size in billions × 2 bytes (FP16) = VRAM needed in GB

Example: 7B model × 2 = 14GB VRAM minimum

Model Size	FP16 VRAM	4-bit VRAM	Recommended GPU
7B (e.g., Llama 2 7B)	~14 GB	~4 GB	RTX 3060 12GB, RTX 4060 Ti 16GB
13B (e.g., Llama 2 13B)	~26 GB	~7 GB	RTX 3090 24GB, RTX 4090 24GB
30-34B	~60 GB	~16 GB	A100 40GB, Multi-GPU setup
70B (e.g., Llama 2 70B)	~140 GB	~35 GB	A100 80GB × 2, H100 80GB

Popular Open Models for Self-hosting

Llama Family (Meta)

Llama 3.1: 8B, 70B, 405B variants
Llama 3.2: Vision models, lightweight variants
Instruction-tuned versions available
Strong general-purpose performance

Mistral AI Models

Mistral 7B: Efficient, powerful 7B model
Mixtral 8×7B: MoE architecture, 56B total
Apache 2.0 license
Excellent efficiency

Falcon (TII)

7B, 40B, 180B models with strong performance

Phi (Microsoft)

Small but powerful models (1.3B-3.8B)

Qwen (Alibaba)

Multilingual models with strong coding abilities

Self-hosting Tools & Frameworks

Tool	Best For	Key Features
Ollama	Easy local deployment	One-command installation, model library, REST API
vLLM	Production inference	PagedAttention, high throughput, OpenAI-compatible API
Text Generation Inference (TGI)	Hugging Face models	Optimized serving, streaming, token authentication
llama.cpp	CPU/consumer GPU	Quantization, low memory, cross-platform
LM Studio	Desktop GUI	User-friendly, model discovery, local chat interface
LocalAI	OpenAI drop-in replacement	API compatibility, multi-model, self-contained

Model Quantization

Reducing Model Size & Memory Requirements

Quantization converts model weights from high precision (FP16/32) to lower precision (INT8/4) to reduce memory and increase speed.

Quantization Methods

GPTQ: Post-training quantization, 4-bit, fast inference
GGUF: llama.cpp format, 2-8 bit, CPU-friendly
AWQ: Activation-aware, preserves important weights
bitsandbytes: 8-bit, 4-bit quantization library

Precision Trade-offs

FP16: Full quality, 2× memory of FP32
8-bit: Minimal quality loss, 2× savings
4-bit: Small quality loss, 4× savings
3-bit/2-bit: Noticeable degradation, experimental

Practical Tip: For most use cases, 4-bit quantization (GPTQ or GGUF Q4) provides excellent quality while making large models accessible on consumer hardware.

Production Deployment Considerations

Performance Optimization

Batching: Process multiple requests together
Caching: KV cache for efficient generation
Load Balancing: Distribute across multiple instances
Streaming: Token-by-token output for responsiveness

Infrastructure

Monitoring: Track latency, throughput, errors
Scaling: Auto-scale based on demand
Security: Authentication, rate limiting, input validation
Logging: Track usage, debug issues

Common Pitfalls

Underestimating memory requirements (add 20% buffer)
Not implementing timeout mechanisms
Ignoring prompt injection vulnerabilities
Insufficient error handling and fallbacks

7.5 Finetuning LLMs

Section 7.5

Customizing Models for Your Use Case

Finetuning adapts pre-trained models to specific tasks, domains, or behaviors by training on targeted datasets.

"Finetuning is the bridge between general-purpose foundation models and specialized, production-ready AI systems."

When to Finetune vs. Use Prompt Engineering

Use Prompt Engineering When:

Task is well-defined and examples fit in context
Need fast iteration and flexibility
Limited training data available
Model already performs reasonably well
Budget/time constraints exist

Finetune When:

Need consistent behavior across many requests
Domain-specific knowledge not in base model
Have quality training dataset (100s-1000s examples)
Need to reduce latency/cost per request
Specific output format or style required

The Hierarchy of Adaptation

1. Prompt Engineering → 2. Few-shot Learning → 3. RAG → 4. Finetuning → 5. Pre-training from Scratch

Start simple and move right only when necessary!

Full Finetuning vs. Parameter-Efficient Finetuning

Aspect	Full Finetuning	PEFT (LoRA/QLoRA)
Parameters Updated	All model parameters	Small adapter layers (0.1-1%)
Memory Required	Very high (multiple copies)	Much lower (base model + adapters)
Training Time	Slow	Fast
GPU Requirements	A100/H100 for large models	Consumer GPUs sufficient
Storage	Full model copy per task	Small adapter files (MBs)
Use Case	Major domain shift	Most finetuning scenarios

LoRA & QLoRA: Efficient Finetuning

Low-Rank Adaptation (LoRA)

LoRA freezes the base model and trains small adapter matrices that are added to attention layers. Instead of updating all weights, it learns low-rank decomposition matrices.

Key Innovation: W' = W + BA (where B and A are small, low-rank matrices)

LoRA Benefits

Reduce trainable parameters by 10,000×
Faster training and lower GPU memory
Swap adapters for different tasks
No inference latency overhead
Maintain base model quality

QLoRA Enhancement

Quantize base model to 4-bit
Train LoRA adapters in higher precision
Finetune 70B models on single consumer GPU
Minimal quality degradation
Democratizes large model finetuning

Practical Example: QLoRA enables finetuning Llama 2 70B on a single RTX 4090 (24GB VRAM), which would otherwise require 4-8× A100 GPUs with full finetuning.

The Finetuning Process

Step-by-Step Workflow

Define Objective: What specific behavior or knowledge do you want?
Prepare Dataset: Collect and format high-quality examples
Choose Base Model: Select appropriate size and architecture
Configure Training: Set hyperparameters (learning rate, epochs, batch size)
Train: Monitor loss curves and validation metrics
Evaluate: Test on held-out data, compare to baseline
Iterate: Refine dataset and hyperparameters based on results
Deploy: Serve the finetuned model or adapter

Dataset Preparation for Finetuning

Quality Over Quantity

A smaller dataset of high-quality, diverse examples typically outperforms a large dataset of mediocre examples.

Data Format

Instruction Format (Most Common):

System prompt (optional)
User instruction/question
Assistant response

Example: Alpaca, ShareGPT format

Dataset Size Guidelines

Minimum: 100-200 quality examples
Sweet Spot: 1,000-10,000 examples
Diminishing Returns: Beyond 50K
Diversity: More important than size

Common Data Quality Issues

Inconsistent formatting across examples
Duplicates or near-duplicates
Factual errors in training data
Biased or toxic content
Too similar to base model's existing knowledge

Key Hyperparameters for Finetuning

Hyperparameter	Typical Range	Notes
Learning Rate	1e-5 to 5e-4	LoRA can use higher LR than full finetuning; start conservative
Batch Size	4-32	Larger is better but limited by GPU memory
Epochs	1-5	More epochs can cause overfitting; monitor validation loss
LoRA Rank (r)	8-64	Higher rank = more capacity but slower; 16 is common
LoRA Alpha	16-32	Scaling factor; often set to 2× rank
Warmup Steps	5-10% of total	Gradually increase LR to stabilize training

Popular Finetuning Frameworks

Hugging Face Libraries

Transformers: Core library for models
PEFT: Parameter-Efficient Fine-Tuning (LoRA, QLoRA)
TRL: Transformer Reinforcement Learning (RLHF, DPO)
Accelerate: Distributed training utilities

Other Frameworks

Axolotl: User-friendly YAML configs
LLaMA-Factory: GUI for finetuning
Unsloth: 2× faster finetuning
Ludwig: Low-code ML platform

Recommended Starting Point: Hugging Face PEFT + Transformers provides the most flexibility and community support. Axolotl is great for quick experimentation with configs.

Evaluating Finetuned Models

Quantitative Metrics

Perplexity: Lower is better (language modeling)
Task-specific: Accuracy, F1, BLEU, ROUGE
Benchmarks: MMLU, HumanEval, MT-Bench
Loss Curves: Monitor training and validation loss

Qualitative Evaluation

Human Review: Sample outputs manually
A/B Testing: Compare to base model
Domain Expert Review: Specialist evaluation
Red Teaming: Test edge cases and failures

Watch for These Issues

Overfitting: Perfect on training data, poor on new examples
Catastrophic Forgetting: Loss of general capabilities
Hallucination Increase: Making up information more frequently
Style Drift: Unwanted changes in tone or behavior

Advanced Finetuning Techniques

DPO

Direct Preference Optimization

Simpler alternative to RLHF, trains directly on preference pairs

RLHF

Reinforcement Learning from Human Feedback

Train reward model, then optimize policy with RL

Mixture of Adapters

Multi-task Learning

Train multiple LoRA adapters for different tasks

Continued Pre-training

For domain adaptation, continue pre-training on domain-specific unlabeled text before instruction finetuning. This is particularly effective for specialized domains like medicine, law, or code.

Practical Finetuning Example

Use Case: Customer Support Chatbot

Fine-tuning a 7B model to handle company-specific customer inquiries with proper tone and accurate product information.

Implementation Steps

Data Collection: 2,000 real customer conversations (anonymized)
Base Model: Mistral 7B Instruct (strong baseline)
Method: QLoRA with rank=16, alpha=32
Training: 3 epochs, learning rate 2e-4, batch size 8
Hardware: Single RTX 4090 (24GB), 6 hours training
Results: 85% accuracy on test set, 40% faster than GPT-4
Deployment: vLLM serving with 50ms latency

Outcome: Reduced API costs by 90%, improved response relevance, maintained data privacy, and achieved sub-second response times.

Finetuning Best Practices

Before Training

Start with prompt engineering baseline
Curate high-quality, diverse dataset
Split data into train/validation/test
Document data sources and preprocessing
Choose appropriate base model for task

During Training

Monitor both training and validation loss
Save checkpoints regularly
Use early stopping to prevent overfitting
Log hyperparameters for reproducibility
Sample outputs periodically for quality check

After Training

Compare to base model on diverse test cases
Test edge cases and potential failure modes
Validate on real-world data, not just test set
Document model limitations and known issues
Plan for monitoring and retraining schedule

Unit 7 Summary

Multimodal AI (7.1)

Vision-language models
Cross-attention architectures
Applications across modalities

Agentic Systems (7.2)

ReAct and agent patterns
Tool use and function calling
Safety and oversight

Research Directions (7.3)

Advanced reasoning techniques
Efficiency and scalability
AI alignment and safety

Self-hosting & Finetuning (7.4-7.5)

Running open models locally
LoRA and QLoRA
Production deployment

Key Takeaways from Unit 7

1. Multimodal is the Future

AI systems are moving beyond text to understand and generate across multiple modalities, enabling richer and more natural interactions.

2. Agents Extend Capabilities

Agentic systems that can plan, use tools, and take actions represent a paradigm shift from passive assistants to active problem-solvers.

3. Efficiency Enables Access

Techniques like quantization, LoRA, and optimized inference are democratizing access to powerful AI capabilities.

4. Customization Matters

Finetuning and self-hosting give you control over performance, privacy, and costs for production applications.

Future Outlook & Trends

Where is Generative AI Heading?

🚀 Scaling

Larger models with trillions of parameters and longer contexts

⚡ Efficiency

Smaller, faster models matching current large model performance

🤖 Autonomy

More capable agents handling complex, multi-step tasks

🌐 Multimodal

Seamless any-to-any generation across all modalities

🔒 Safety

Better alignment, robustness, and interpretability

🌍 Accessibility

Open models, on-device AI, democratized access

"We're still in the early days of generative AI. The most transformative applications haven't been built yet."

Resources for Further Learning

📚 Research Papers

Attention is All You Need (Transformers)
GPT-4 Technical Report
LoRA: Low-Rank Adaptation
Constitutional AI (Anthropic)
ReAct: Synergizing Reasoning and Acting

🛠️ Practical Resources

Hugging Face Documentation
LangChain / LangGraph Tutorials
Ollama Quick Start
Axolotl Finetuning Guide
vLLM Deployment Docs

Recommended Next Steps

Experiment with multimodal models (GPT-4V, Claude)
Build a simple agent with tool use
Deploy an open model locally with Ollama
Finetune a small model on a custom dataset
Stay current with research (arXiv, AI conferences)

Thank You!

Unit 7: Advanced Topics in Generative AI

Keep Learning, Keep Building! 🚀

Questions? Discussion? Let's explore together!