Unit 7: Advanced Topics in Generative AI

Exploring the Frontiers of AI Innovation

4 Hours • 5 Topics

Unit Overview

What We'll Explore

This unit covers cutting-edge developments in Generative AI, from multimodal systems to autonomous agents and practical implementation strategies.

🎯 Learning Objectives

  • Understand multimodal AI architectures
  • Design agentic AI systems
  • Explore emerging research frontiers
  • Implement self-hosted LLMs
  • Fine-tune models for specific tasks

📚 Topics Covered

  • Multimodal AI Fundamentals
  • Agentic AI Systems
  • Emerging Research Directions
  • Self-hosting LLMs
  • Finetuning LLMs

7.1 Fundamentals of Multimodal AI

Section 7.1

Beyond Text: AI That Sees, Hears, and Understands

Multimodal AI systems process and integrate multiple types of data including text, images, audio, and video to create richer, more contextual understanding.

Key Concept: Multimodal AI combines different sensory modalities to achieve more human-like perception and reasoning capabilities.

What is Multimodal AI?

Traditional AI

Single modality input (text OR image OR audio)

Limited context understanding

Multimodal AI

Multiple modality inputs (text AND image AND audio)

Rich contextual understanding

Any-to-Any

Any input to any output modality

Maximum flexibility

Common Modality Combinations

  • Vision-Language: GPT-4V, Claude 3, Gemini (image + text understanding)
  • Audio-Language: Whisper, Speech synthesis models
  • Video Understanding: Combining temporal visual + audio + language
  • Multimodal Generation: DALL-E 3, Stable Diffusion, Sora (text to image/video)

Multimodal AI Architectures

Early Fusion

Combine modalities at input level before processing

  • Concatenate features early
  • Joint embedding space
  • Example: CLIP embeddings

Late Fusion

Process modalities separately, combine at output

  • Independent encoders
  • Merge predictions/features
  • More modular approach

Cross-Attention Mechanisms

Modern approach: Use cross-attention layers to allow modalities to attend to each other, enabling deep interaction between vision and language tokens.

  • Vision tokens attend to language tokens and vice versa
  • Flexible integration at multiple layers
  • Used in GPT-4V, Flamingo, and other state-of-the-art models

Key Multimodal Models & Applications

Model Modalities Key Capabilities
GPT-4V Text, Images Visual understanding, OCR, diagram analysis, image description
CLIP Text, Images Zero-shot image classification, image-text matching
Whisper Audio, Text Speech recognition, translation, multilingual transcription
ImageBind 6+ modalities Audio, depth, thermal, IMU data alignment
Gemini Text, Image, Audio, Video Native multimodal reasoning, video understanding

Industry Impact: Multimodal AI enables applications like medical image diagnosis with contextual patient data, autonomous vehicles combining camera and LIDAR, and accessibility tools for vision-impaired users.

Challenges & Future Directions

Current Challenges

  • Alignment: Ensuring modalities are properly synchronized and aligned
  • Data Requirements: Need paired multimodal datasets
  • Computational Cost: Processing multiple modalities is expensive
  • Evaluation: Difficult to benchmark multimodal understanding
  • Hallucination: Models may generate incorrect cross-modal associations

Emerging Solutions

  • Contrastive Learning: Better alignment through contrastive objectives
  • Efficient Architectures: Adapter layers, parameter-efficient methods
  • Synthetic Data: Generating paired multimodal data
  • Unified Tokenization: Treating all modalities as token sequences

7.2 Agentic AI Systems

Section 7.2

From Chatbots to Autonomous Agents

Agentic AI systems can plan, use tools, interact with environments, and autonomously pursue complex goals over extended periods.

"The shift from passive language models to active agents represents one of the most significant transitions in AI development."

What are AI Agents?

Traditional LLMs

  • Respond to single prompts
  • No persistent state or memory
  • Cannot take external actions
  • Passive information providers

Agentic Systems

  • Plan multi-step workflows
  • Maintain context and memory
  • Use tools and APIs
  • Active goal pursuers

Key Components of AI Agents

Perception

Observe environment and user inputs

Reasoning

Plan actions and make decisions

Action

Execute tools and interact with world

Agent Architecture Patterns

ReAct (Reasoning + Acting)

Interleaves reasoning traces with action execution. Agent thinks aloud about what to do, then acts.

Pattern: Thought → Action → Observation → Thought → Action...

ReWOO (Reasoning Without Observation)

Plans all actions upfront before execution

  • Create complete plan first
  • Execute all actions
  • More efficient, less flexible

Reflexion

Learns from mistakes through self-reflection

  • Execute action
  • Evaluate outcome
  • Reflect and improve

Tool Use & Function Calling

Extending Agent Capabilities

Agents become powerful when they can use external tools: APIs, calculators, search engines, databases, code interpreters, and more.

Common Tool Categories

  • Information Retrieval: Search, databases, RAG systems
  • Computation: Calculators, code execution, data analysis
  • Communication: Email, messaging, notifications
  • File Operations: Read, write, modify documents
  • Web Interaction: Browser automation, API calls

Function Calling Pattern

  1. Define available functions with schemas
  2. LLM decides which function to call
  3. Extract parameters from context
  4. Execute function in environment
  5. Return results to LLM
  6. Continue conversation with results

Popular Agent Frameworks

Framework Key Features Best For
LangGraph State machines, cycles, human-in-loop Complex workflows, production systems
AutoGPT Autonomous goal pursuit, self-prompting Open-ended tasks, research
CrewAI Multi-agent collaboration, role assignment Team-based workflows
Microsoft AutoGen Conversable agents, group chat Multi-agent conversations
OpenAI Assistants Managed threads, built-in tools, retrieval Quick prototypes, managed infrastructure

Important: Agent systems can be unpredictable and may take unexpected actions. Always implement proper safety guardrails, monitoring, and human oversight for production deployments.

Challenges in Agentic AI

Technical Challenges

  • Long-term Planning: Difficulty maintaining coherent plans over many steps
  • Error Propagation: Early mistakes compound over time
  • Tool Reliability: Agents struggle when tools fail or return unexpected results
  • Context Management: Keeping relevant information in limited context windows

Safety & Control

  • Unpredictability: Agents may take unexpected actions
  • Runaway Behavior: Infinite loops or excessive API calls
  • Security Risks: Potential for malicious tool use
  • Alignment: Ensuring agents pursue intended goals

Best Practices

  • Implement rate limiting and budgets
  • Add human approval for critical actions
  • Use simulation environments for testing
  • Monitor and log all agent actions
  • Design clear success/failure criteria

7.3 Emerging Research Directions

Section 7.3

The Cutting Edge of Generative AI

Exploring the latest research trends, breakthrough techniques, and future directions that will shape the next generation of AI systems.

Advanced Reasoning Capabilities

Chain-of-Thought & Beyond

Teaching models to think step-by-step has dramatically improved complex reasoning tasks.

Current Approaches

  • Chain-of-Thought (CoT): Explicit reasoning steps
  • Tree of Thoughts: Exploring multiple reasoning paths
  • Graph of Thoughts: Non-linear reasoning structures
  • Self-Consistency: Sampling multiple reasoning paths

Emerging Techniques

  • Process Reward Models: Supervising reasoning process, not just outcomes
  • Inference-time Compute: Extended thinking during generation
  • Metacognition: Models reasoning about their own reasoning
  • Symbolic Integration: Hybrid neural-symbolic systems

Efficiency & Scalability Research

Model Compression

  • Quantization: 4-bit, 3-bit, 1-bit models
  • Pruning: Removing unnecessary parameters
  • Distillation: Training smaller models from larger ones
  • Sparse Models: Mixture of Experts (MoE)

Training Efficiency

  • Flash Attention: Memory-efficient attention mechanisms
  • Gradient Checkpointing: Trade compute for memory
  • Mixed Precision: FP16, BF16 training
  • Zero Redundancy Optimizer: Distributed training optimization

Context Length Extension

  • RoPE Scaling: Extending rotary embeddings
  • ALiBi: Attention with Linear Biases
  • Sparse Attention: Longformer, BigBird patterns
  • Retrieval Augmentation: Infinite context via RAG

Breakthrough: Models like Claude and Gemini now support 100K-200K+ token contexts, enabling processing of entire codebases, books, and long documents.

AI Alignment & Safety

Ensuring AI Systems are Helpful, Harmless, and Honest

Alignment research focuses on making AI systems that reliably do what humans intend while avoiding harmful behaviors.

Current Techniques

  • RLHF: Reinforcement Learning from Human Feedback
  • Constitutional AI: Self-critique against principles
  • Red Teaming: Adversarial testing for vulnerabilities
  • Interpretability: Understanding model internals

Open Problems

  • Scalable Oversight: Supervising superhuman AI
  • Robustness: Resisting jailbreaks and adversarial attacks
  • Value Learning: Inferring human preferences accurately
  • Long-term Safety: Ensuring safe behavior as capabilities increase

Other Exciting Research Areas

🧬 Code Generation

Models that write, debug, and optimize code

  • AlphaCode, CodeLlama
  • Program synthesis
  • Automated debugging

🔬 Scientific AI

AI for scientific discovery

  • AlphaFold for proteins
  • Material discovery
  • Mathematical reasoning

🎮 Embodied AI

AI in physical or simulated worlds

  • Robotics integration
  • Virtual agents
  • Simulation-to-reality

🌐 Multilingual & Cross-lingual

Better support for low-resource languages, cross-lingual transfer, and culturally-aware models

⚡ On-device AI

Running powerful models on smartphones, edge devices, and without cloud connectivity

7.4 Self-hosting LLMs

Section 7.4

Running Your Own Language Models

Self-hosting LLMs gives you complete control over data privacy, customization, and costs, but requires technical infrastructure and expertise.

Why Self-host? Data privacy, cost control at scale, offline operation, full customization, no rate limits, and compliance requirements.

Hardware Requirements

Estimating VRAM Needs

Rule of Thumb: Model size in billions × 2 bytes (FP16) = VRAM needed in GB

Example: 7B model × 2 = 14GB VRAM minimum

Model Size FP16 VRAM 4-bit VRAM Recommended GPU
7B (e.g., Llama 2 7B) ~14 GB ~4 GB RTX 3060 12GB, RTX 4060 Ti 16GB
13B (e.g., Llama 2 13B) ~26 GB ~7 GB RTX 3090 24GB, RTX 4090 24GB
30-34B ~60 GB ~16 GB A100 40GB, Multi-GPU setup
70B (e.g., Llama 2 70B) ~140 GB ~35 GB A100 80GB × 2, H100 80GB

Popular Open Models for Self-hosting

Llama Family (Meta)

  • Llama 3.1: 8B, 70B, 405B variants
  • Llama 3.2: Vision models, lightweight variants
  • Instruction-tuned versions available
  • Strong general-purpose performance

Mistral AI Models

  • Mistral 7B: Efficient, powerful 7B model
  • Mixtral 8×7B: MoE architecture, 56B total
  • Apache 2.0 license
  • Excellent efficiency

Falcon (TII)

7B, 40B, 180B models with strong performance

Phi (Microsoft)

Small but powerful models (1.3B-3.8B)

Qwen (Alibaba)

Multilingual models with strong coding abilities

Self-hosting Tools & Frameworks

Tool Best For Key Features
Ollama Easy local deployment One-command installation, model library, REST API
vLLM Production inference PagedAttention, high throughput, OpenAI-compatible API
Text Generation Inference (TGI) Hugging Face models Optimized serving, streaming, token authentication
llama.cpp CPU/consumer GPU Quantization, low memory, cross-platform
LM Studio Desktop GUI User-friendly, model discovery, local chat interface
LocalAI OpenAI drop-in replacement API compatibility, multi-model, self-contained

Model Quantization

Reducing Model Size & Memory Requirements

Quantization converts model weights from high precision (FP16/32) to lower precision (INT8/4) to reduce memory and increase speed.

Quantization Methods

  • GPTQ: Post-training quantization, 4-bit, fast inference
  • GGUF: llama.cpp format, 2-8 bit, CPU-friendly
  • AWQ: Activation-aware, preserves important weights
  • bitsandbytes: 8-bit, 4-bit quantization library

Precision Trade-offs

  • FP16: Full quality, 2× memory of FP32
  • 8-bit: Minimal quality loss, 2× savings
  • 4-bit: Small quality loss, 4× savings
  • 3-bit/2-bit: Noticeable degradation, experimental

Practical Tip: For most use cases, 4-bit quantization (GPTQ or GGUF Q4) provides excellent quality while making large models accessible on consumer hardware.

Production Deployment Considerations

Performance Optimization

  • Batching: Process multiple requests together
  • Caching: KV cache for efficient generation
  • Load Balancing: Distribute across multiple instances
  • Streaming: Token-by-token output for responsiveness

Infrastructure

  • Monitoring: Track latency, throughput, errors
  • Scaling: Auto-scale based on demand
  • Security: Authentication, rate limiting, input validation
  • Logging: Track usage, debug issues

Common Pitfalls

  • Underestimating memory requirements (add 20% buffer)
  • Not implementing timeout mechanisms
  • Ignoring prompt injection vulnerabilities
  • Insufficient error handling and fallbacks

7.5 Finetuning LLMs

Section 7.5

Customizing Models for Your Use Case

Finetuning adapts pre-trained models to specific tasks, domains, or behaviors by training on targeted datasets.

"Finetuning is the bridge between general-purpose foundation models and specialized, production-ready AI systems."

When to Finetune vs. Use Prompt Engineering

Use Prompt Engineering When:

  • Task is well-defined and examples fit in context
  • Need fast iteration and flexibility
  • Limited training data available
  • Model already performs reasonably well
  • Budget/time constraints exist

Finetune When:

  • Need consistent behavior across many requests
  • Domain-specific knowledge not in base model
  • Have quality training dataset (100s-1000s examples)
  • Need to reduce latency/cost per request
  • Specific output format or style required

The Hierarchy of Adaptation

1. Prompt Engineering2. Few-shot Learning3. RAG4. Finetuning5. Pre-training from Scratch

Start simple and move right only when necessary!

Full Finetuning vs. Parameter-Efficient Finetuning

Aspect Full Finetuning PEFT (LoRA/QLoRA)
Parameters Updated All model parameters Small adapter layers (0.1-1%)
Memory Required Very high (multiple copies) Much lower (base model + adapters)
Training Time Slow Fast
GPU Requirements A100/H100 for large models Consumer GPUs sufficient
Storage Full model copy per task Small adapter files (MBs)
Use Case Major domain shift Most finetuning scenarios

LoRA & QLoRA: Efficient Finetuning

Low-Rank Adaptation (LoRA)

LoRA freezes the base model and trains small adapter matrices that are added to attention layers. Instead of updating all weights, it learns low-rank decomposition matrices.

Key Innovation: W' = W + BA (where B and A are small, low-rank matrices)

LoRA Benefits

  • Reduce trainable parameters by 10,000×
  • Faster training and lower GPU memory
  • Swap adapters for different tasks
  • No inference latency overhead
  • Maintain base model quality

QLoRA Enhancement

  • Quantize base model to 4-bit
  • Train LoRA adapters in higher precision
  • Finetune 70B models on single consumer GPU
  • Minimal quality degradation
  • Democratizes large model finetuning

Practical Example: QLoRA enables finetuning Llama 2 70B on a single RTX 4090 (24GB VRAM), which would otherwise require 4-8× A100 GPUs with full finetuning.

The Finetuning Process

Step-by-Step Workflow

  1. Define Objective: What specific behavior or knowledge do you want?
  2. Prepare Dataset: Collect and format high-quality examples
  3. Choose Base Model: Select appropriate size and architecture
  4. Configure Training: Set hyperparameters (learning rate, epochs, batch size)
  5. Train: Monitor loss curves and validation metrics
  6. Evaluate: Test on held-out data, compare to baseline
  7. Iterate: Refine dataset and hyperparameters based on results
  8. Deploy: Serve the finetuned model or adapter

Dataset Preparation for Finetuning

Quality Over Quantity

A smaller dataset of high-quality, diverse examples typically outperforms a large dataset of mediocre examples.

Data Format

Instruction Format (Most Common):

  • System prompt (optional)
  • User instruction/question
  • Assistant response

Example: Alpaca, ShareGPT format

Dataset Size Guidelines

  • Minimum: 100-200 quality examples
  • Sweet Spot: 1,000-10,000 examples
  • Diminishing Returns: Beyond 50K
  • Diversity: More important than size

Common Data Quality Issues

  • Inconsistent formatting across examples
  • Duplicates or near-duplicates
  • Factual errors in training data
  • Biased or toxic content
  • Too similar to base model's existing knowledge

Key Hyperparameters for Finetuning

Hyperparameter Typical Range Notes
Learning Rate 1e-5 to 5e-4 LoRA can use higher LR than full finetuning; start conservative
Batch Size 4-32 Larger is better but limited by GPU memory
Epochs 1-5 More epochs can cause overfitting; monitor validation loss
LoRA Rank (r) 8-64 Higher rank = more capacity but slower; 16 is common
LoRA Alpha 16-32 Scaling factor; often set to 2× rank
Warmup Steps 5-10% of total Gradually increase LR to stabilize training

Popular Finetuning Frameworks

Hugging Face Libraries

  • Transformers: Core library for models
  • PEFT: Parameter-Efficient Fine-Tuning (LoRA, QLoRA)
  • TRL: Transformer Reinforcement Learning (RLHF, DPO)
  • Accelerate: Distributed training utilities

Other Frameworks

  • Axolotl: User-friendly YAML configs
  • LLaMA-Factory: GUI for finetuning
  • Unsloth: 2× faster finetuning
  • Ludwig: Low-code ML platform

Recommended Starting Point: Hugging Face PEFT + Transformers provides the most flexibility and community support. Axolotl is great for quick experimentation with configs.

Evaluating Finetuned Models

Quantitative Metrics

  • Perplexity: Lower is better (language modeling)
  • Task-specific: Accuracy, F1, BLEU, ROUGE
  • Benchmarks: MMLU, HumanEval, MT-Bench
  • Loss Curves: Monitor training and validation loss

Qualitative Evaluation

  • Human Review: Sample outputs manually
  • A/B Testing: Compare to base model
  • Domain Expert Review: Specialist evaluation
  • Red Teaming: Test edge cases and failures

Watch for These Issues

  • Overfitting: Perfect on training data, poor on new examples
  • Catastrophic Forgetting: Loss of general capabilities
  • Hallucination Increase: Making up information more frequently
  • Style Drift: Unwanted changes in tone or behavior

Advanced Finetuning Techniques

DPO

Direct Preference Optimization

Simpler alternative to RLHF, trains directly on preference pairs

RLHF

Reinforcement Learning from Human Feedback

Train reward model, then optimize policy with RL

Mixture of Adapters

Multi-task Learning

Train multiple LoRA adapters for different tasks

Continued Pre-training

For domain adaptation, continue pre-training on domain-specific unlabeled text before instruction finetuning. This is particularly effective for specialized domains like medicine, law, or code.

Practical Finetuning Example

Use Case: Customer Support Chatbot

Fine-tuning a 7B model to handle company-specific customer inquiries with proper tone and accurate product information.

Implementation Steps

  1. Data Collection: 2,000 real customer conversations (anonymized)
  2. Base Model: Mistral 7B Instruct (strong baseline)
  3. Method: QLoRA with rank=16, alpha=32
  4. Training: 3 epochs, learning rate 2e-4, batch size 8
  5. Hardware: Single RTX 4090 (24GB), 6 hours training
  6. Results: 85% accuracy on test set, 40% faster than GPT-4
  7. Deployment: vLLM serving with 50ms latency

Outcome: Reduced API costs by 90%, improved response relevance, maintained data privacy, and achieved sub-second response times.

Finetuning Best Practices

Before Training

  • Start with prompt engineering baseline
  • Curate high-quality, diverse dataset
  • Split data into train/validation/test
  • Document data sources and preprocessing
  • Choose appropriate base model for task

During Training

  • Monitor both training and validation loss
  • Save checkpoints regularly
  • Use early stopping to prevent overfitting
  • Log hyperparameters for reproducibility
  • Sample outputs periodically for quality check

After Training

  • Compare to base model on diverse test cases
  • Test edge cases and potential failure modes
  • Validate on real-world data, not just test set
  • Document model limitations and known issues
  • Plan for monitoring and retraining schedule

Unit 7 Summary

Multimodal AI (7.1)

  • Vision-language models
  • Cross-attention architectures
  • Applications across modalities

Agentic Systems (7.2)

  • ReAct and agent patterns
  • Tool use and function calling
  • Safety and oversight

Research Directions (7.3)

  • Advanced reasoning techniques
  • Efficiency and scalability
  • AI alignment and safety

Self-hosting & Finetuning (7.4-7.5)

  • Running open models locally
  • LoRA and QLoRA
  • Production deployment

Key Takeaways from Unit 7

1. Multimodal is the Future

AI systems are moving beyond text to understand and generate across multiple modalities, enabling richer and more natural interactions.

2. Agents Extend Capabilities

Agentic systems that can plan, use tools, and take actions represent a paradigm shift from passive assistants to active problem-solvers.

3. Efficiency Enables Access

Techniques like quantization, LoRA, and optimized inference are democratizing access to powerful AI capabilities.

4. Customization Matters

Finetuning and self-hosting give you control over performance, privacy, and costs for production applications.

Future Outlook & Trends

Where is Generative AI Heading?

🚀 Scaling

Larger models with trillions of parameters and longer contexts

⚡ Efficiency

Smaller, faster models matching current large model performance

🤖 Autonomy

More capable agents handling complex, multi-step tasks

🌐 Multimodal

Seamless any-to-any generation across all modalities

🔒 Safety

Better alignment, robustness, and interpretability

🌍 Accessibility

Open models, on-device AI, democratized access

"We're still in the early days of generative AI. The most transformative applications haven't been built yet."

Resources for Further Learning

📚 Research Papers

  • Attention is All You Need (Transformers)
  • GPT-4 Technical Report
  • LoRA: Low-Rank Adaptation
  • Constitutional AI (Anthropic)
  • ReAct: Synergizing Reasoning and Acting

🛠️ Practical Resources

  • Hugging Face Documentation
  • LangChain / LangGraph Tutorials
  • Ollama Quick Start
  • Axolotl Finetuning Guide
  • vLLM Deployment Docs

Recommended Next Steps

  • Experiment with multimodal models (GPT-4V, Claude)
  • Build a simple agent with tool use
  • Deploy an open model locally with Ollama
  • Finetune a small model on a custom dataset
  • Stay current with research (arXiv, AI conferences)

Thank You!

Unit 7: Advanced Topics in Generative AI

Keep Learning, Keep Building! 🚀

Questions? Discussion? Let's explore together!

1 / 27