Unit 7: Advanced Topics in Generative AI
Exploring the Frontiers of AI Innovation
4 Hours • 5 Topics
Unit Overview
What We'll Explore
This unit covers cutting-edge developments in Generative AI, from multimodal systems to autonomous agents and practical implementation strategies.
🎯 Learning Objectives
- Understand multimodal AI architectures
- Design agentic AI systems
- Explore emerging research frontiers
- Implement self-hosted LLMs
- Fine-tune models for specific tasks
📚 Topics Covered
- Multimodal AI Fundamentals
- Agentic AI Systems
- Emerging Research Directions
- Self-hosting LLMs
- Finetuning LLMs
7.1 Fundamentals of Multimodal AI
Beyond Text: AI That Sees, Hears, and Understands
Multimodal AI systems process and integrate multiple types of data including text, images, audio, and video to create richer, more contextual understanding.
Key Concept: Multimodal AI combines different sensory modalities to achieve more human-like perception and reasoning capabilities.
What is Multimodal AI?
Traditional AI
Single modality input (text OR image OR audio)
Limited context understanding
Multimodal AI
Multiple modality inputs (text AND image AND audio)
Rich contextual understanding
Any-to-Any
Any input to any output modality
Maximum flexibility
Common Modality Combinations
- Vision-Language: GPT-4V, Claude 3, Gemini (image + text understanding)
- Audio-Language: Whisper, Speech synthesis models
- Video Understanding: Combining temporal visual + audio + language
- Multimodal Generation: DALL-E 3, Stable Diffusion, Sora (text to image/video)
Multimodal AI Architectures
Early Fusion
Combine modalities at input level before processing
- Concatenate features early
- Joint embedding space
- Example: CLIP embeddings
Late Fusion
Process modalities separately, combine at output
- Independent encoders
- Merge predictions/features
- More modular approach
Cross-Attention Mechanisms
Modern approach: Use cross-attention layers to allow modalities to attend to each other, enabling deep interaction between vision and language tokens.
- Vision tokens attend to language tokens and vice versa
- Flexible integration at multiple layers
- Used in GPT-4V, Flamingo, and other state-of-the-art models
Key Multimodal Models & Applications
| Model | Modalities | Key Capabilities |
|---|---|---|
| GPT-4V | Text, Images | Visual understanding, OCR, diagram analysis, image description |
| CLIP | Text, Images | Zero-shot image classification, image-text matching |
| Whisper | Audio, Text | Speech recognition, translation, multilingual transcription |
| ImageBind | 6+ modalities | Audio, depth, thermal, IMU data alignment |
| Gemini | Text, Image, Audio, Video | Native multimodal reasoning, video understanding |
Industry Impact: Multimodal AI enables applications like medical image diagnosis with contextual patient data, autonomous vehicles combining camera and LIDAR, and accessibility tools for vision-impaired users.
Challenges & Future Directions
Current Challenges
- Alignment: Ensuring modalities are properly synchronized and aligned
- Data Requirements: Need paired multimodal datasets
- Computational Cost: Processing multiple modalities is expensive
- Evaluation: Difficult to benchmark multimodal understanding
- Hallucination: Models may generate incorrect cross-modal associations
Emerging Solutions
- Contrastive Learning: Better alignment through contrastive objectives
- Efficient Architectures: Adapter layers, parameter-efficient methods
- Synthetic Data: Generating paired multimodal data
- Unified Tokenization: Treating all modalities as token sequences
7.2 Agentic AI Systems
From Chatbots to Autonomous Agents
Agentic AI systems can plan, use tools, interact with environments, and autonomously pursue complex goals over extended periods.
What are AI Agents?
Traditional LLMs
- Respond to single prompts
- No persistent state or memory
- Cannot take external actions
- Passive information providers
Agentic Systems
- Plan multi-step workflows
- Maintain context and memory
- Use tools and APIs
- Active goal pursuers
Key Components of AI Agents
Perception
Observe environment and user inputs
Reasoning
Plan actions and make decisions
Action
Execute tools and interact with world
Agent Architecture Patterns
ReAct (Reasoning + Acting)
Interleaves reasoning traces with action execution. Agent thinks aloud about what to do, then acts.
Pattern: Thought → Action → Observation → Thought → Action...
ReWOO (Reasoning Without Observation)
Plans all actions upfront before execution
- Create complete plan first
- Execute all actions
- More efficient, less flexible
Reflexion
Learns from mistakes through self-reflection
- Execute action
- Evaluate outcome
- Reflect and improve
Tool Use & Function Calling
Extending Agent Capabilities
Agents become powerful when they can use external tools: APIs, calculators, search engines, databases, code interpreters, and more.
Common Tool Categories
- Information Retrieval: Search, databases, RAG systems
- Computation: Calculators, code execution, data analysis
- Communication: Email, messaging, notifications
- File Operations: Read, write, modify documents
- Web Interaction: Browser automation, API calls
Function Calling Pattern
- Define available functions with schemas
- LLM decides which function to call
- Extract parameters from context
- Execute function in environment
- Return results to LLM
- Continue conversation with results
Popular Agent Frameworks
| Framework | Key Features | Best For |
|---|---|---|
| LangGraph | State machines, cycles, human-in-loop | Complex workflows, production systems |
| AutoGPT | Autonomous goal pursuit, self-prompting | Open-ended tasks, research |
| CrewAI | Multi-agent collaboration, role assignment | Team-based workflows |
| Microsoft AutoGen | Conversable agents, group chat | Multi-agent conversations |
| OpenAI Assistants | Managed threads, built-in tools, retrieval | Quick prototypes, managed infrastructure |
Important: Agent systems can be unpredictable and may take unexpected actions. Always implement proper safety guardrails, monitoring, and human oversight for production deployments.
Challenges in Agentic AI
Technical Challenges
- Long-term Planning: Difficulty maintaining coherent plans over many steps
- Error Propagation: Early mistakes compound over time
- Tool Reliability: Agents struggle when tools fail or return unexpected results
- Context Management: Keeping relevant information in limited context windows
Safety & Control
- Unpredictability: Agents may take unexpected actions
- Runaway Behavior: Infinite loops or excessive API calls
- Security Risks: Potential for malicious tool use
- Alignment: Ensuring agents pursue intended goals
Best Practices
- Implement rate limiting and budgets
- Add human approval for critical actions
- Use simulation environments for testing
- Monitor and log all agent actions
- Design clear success/failure criteria
7.3 Emerging Research Directions
The Cutting Edge of Generative AI
Exploring the latest research trends, breakthrough techniques, and future directions that will shape the next generation of AI systems.
Advanced Reasoning Capabilities
Chain-of-Thought & Beyond
Teaching models to think step-by-step has dramatically improved complex reasoning tasks.
Current Approaches
- Chain-of-Thought (CoT): Explicit reasoning steps
- Tree of Thoughts: Exploring multiple reasoning paths
- Graph of Thoughts: Non-linear reasoning structures
- Self-Consistency: Sampling multiple reasoning paths
Emerging Techniques
- Process Reward Models: Supervising reasoning process, not just outcomes
- Inference-time Compute: Extended thinking during generation
- Metacognition: Models reasoning about their own reasoning
- Symbolic Integration: Hybrid neural-symbolic systems
Efficiency & Scalability Research
Model Compression
- Quantization: 4-bit, 3-bit, 1-bit models
- Pruning: Removing unnecessary parameters
- Distillation: Training smaller models from larger ones
- Sparse Models: Mixture of Experts (MoE)
Training Efficiency
- Flash Attention: Memory-efficient attention mechanisms
- Gradient Checkpointing: Trade compute for memory
- Mixed Precision: FP16, BF16 training
- Zero Redundancy Optimizer: Distributed training optimization
Context Length Extension
- RoPE Scaling: Extending rotary embeddings
- ALiBi: Attention with Linear Biases
- Sparse Attention: Longformer, BigBird patterns
- Retrieval Augmentation: Infinite context via RAG
Breakthrough: Models like Claude and Gemini now support 100K-200K+ token contexts, enabling processing of entire codebases, books, and long documents.
AI Alignment & Safety
Ensuring AI Systems are Helpful, Harmless, and Honest
Alignment research focuses on making AI systems that reliably do what humans intend while avoiding harmful behaviors.
Current Techniques
- RLHF: Reinforcement Learning from Human Feedback
- Constitutional AI: Self-critique against principles
- Red Teaming: Adversarial testing for vulnerabilities
- Interpretability: Understanding model internals
Open Problems
- Scalable Oversight: Supervising superhuman AI
- Robustness: Resisting jailbreaks and adversarial attacks
- Value Learning: Inferring human preferences accurately
- Long-term Safety: Ensuring safe behavior as capabilities increase
Other Exciting Research Areas
🧬 Code Generation
Models that write, debug, and optimize code
- AlphaCode, CodeLlama
- Program synthesis
- Automated debugging
🔬 Scientific AI
AI for scientific discovery
- AlphaFold for proteins
- Material discovery
- Mathematical reasoning
🎮 Embodied AI
AI in physical or simulated worlds
- Robotics integration
- Virtual agents
- Simulation-to-reality
🌐 Multilingual & Cross-lingual
Better support for low-resource languages, cross-lingual transfer, and culturally-aware models
⚡ On-device AI
Running powerful models on smartphones, edge devices, and without cloud connectivity
7.4 Self-hosting LLMs
Running Your Own Language Models
Self-hosting LLMs gives you complete control over data privacy, customization, and costs, but requires technical infrastructure and expertise.
Why Self-host? Data privacy, cost control at scale, offline operation, full customization, no rate limits, and compliance requirements.
Hardware Requirements
Estimating VRAM Needs
Rule of Thumb: Model size in billions × 2 bytes (FP16) = VRAM needed in GB
Example: 7B model × 2 = 14GB VRAM minimum
| Model Size | FP16 VRAM | 4-bit VRAM | Recommended GPU |
|---|---|---|---|
| 7B (e.g., Llama 2 7B) | ~14 GB | ~4 GB | RTX 3060 12GB, RTX 4060 Ti 16GB |
| 13B (e.g., Llama 2 13B) | ~26 GB | ~7 GB | RTX 3090 24GB, RTX 4090 24GB |
| 30-34B | ~60 GB | ~16 GB | A100 40GB, Multi-GPU setup |
| 70B (e.g., Llama 2 70B) | ~140 GB | ~35 GB | A100 80GB × 2, H100 80GB |
Popular Open Models for Self-hosting
Llama Family (Meta)
- Llama 3.1: 8B, 70B, 405B variants
- Llama 3.2: Vision models, lightweight variants
- Instruction-tuned versions available
- Strong general-purpose performance
Mistral AI Models
- Mistral 7B: Efficient, powerful 7B model
- Mixtral 8×7B: MoE architecture, 56B total
- Apache 2.0 license
- Excellent efficiency
Falcon (TII)
7B, 40B, 180B models with strong performance
Phi (Microsoft)
Small but powerful models (1.3B-3.8B)
Qwen (Alibaba)
Multilingual models with strong coding abilities
Self-hosting Tools & Frameworks
| Tool | Best For | Key Features |
|---|---|---|
| Ollama | Easy local deployment | One-command installation, model library, REST API |
| vLLM | Production inference | PagedAttention, high throughput, OpenAI-compatible API |
| Text Generation Inference (TGI) | Hugging Face models | Optimized serving, streaming, token authentication |
| llama.cpp | CPU/consumer GPU | Quantization, low memory, cross-platform |
| LM Studio | Desktop GUI | User-friendly, model discovery, local chat interface |
| LocalAI | OpenAI drop-in replacement | API compatibility, multi-model, self-contained |
Model Quantization
Reducing Model Size & Memory Requirements
Quantization converts model weights from high precision (FP16/32) to lower precision (INT8/4) to reduce memory and increase speed.
Quantization Methods
- GPTQ: Post-training quantization, 4-bit, fast inference
- GGUF: llama.cpp format, 2-8 bit, CPU-friendly
- AWQ: Activation-aware, preserves important weights
- bitsandbytes: 8-bit, 4-bit quantization library
Precision Trade-offs
- FP16: Full quality, 2× memory of FP32
- 8-bit: Minimal quality loss, 2× savings
- 4-bit: Small quality loss, 4× savings
- 3-bit/2-bit: Noticeable degradation, experimental
Practical Tip: For most use cases, 4-bit quantization (GPTQ or GGUF Q4) provides excellent quality while making large models accessible on consumer hardware.
Production Deployment Considerations
Performance Optimization
- Batching: Process multiple requests together
- Caching: KV cache for efficient generation
- Load Balancing: Distribute across multiple instances
- Streaming: Token-by-token output for responsiveness
Infrastructure
- Monitoring: Track latency, throughput, errors
- Scaling: Auto-scale based on demand
- Security: Authentication, rate limiting, input validation
- Logging: Track usage, debug issues
Common Pitfalls
- Underestimating memory requirements (add 20% buffer)
- Not implementing timeout mechanisms
- Ignoring prompt injection vulnerabilities
- Insufficient error handling and fallbacks
7.5 Finetuning LLMs
Customizing Models for Your Use Case
Finetuning adapts pre-trained models to specific tasks, domains, or behaviors by training on targeted datasets.
When to Finetune vs. Use Prompt Engineering
Use Prompt Engineering When:
- Task is well-defined and examples fit in context
- Need fast iteration and flexibility
- Limited training data available
- Model already performs reasonably well
- Budget/time constraints exist
Finetune When:
- Need consistent behavior across many requests
- Domain-specific knowledge not in base model
- Have quality training dataset (100s-1000s examples)
- Need to reduce latency/cost per request
- Specific output format or style required
The Hierarchy of Adaptation
1. Prompt Engineering → 2. Few-shot Learning → 3. RAG → 4. Finetuning → 5. Pre-training from Scratch
Start simple and move right only when necessary!
Full Finetuning vs. Parameter-Efficient Finetuning
| Aspect | Full Finetuning | PEFT (LoRA/QLoRA) |
|---|---|---|
| Parameters Updated | All model parameters | Small adapter layers (0.1-1%) |
| Memory Required | Very high (multiple copies) | Much lower (base model + adapters) |
| Training Time | Slow | Fast |
| GPU Requirements | A100/H100 for large models | Consumer GPUs sufficient |
| Storage | Full model copy per task | Small adapter files (MBs) |
| Use Case | Major domain shift | Most finetuning scenarios |
LoRA & QLoRA: Efficient Finetuning
Low-Rank Adaptation (LoRA)
LoRA freezes the base model and trains small adapter matrices that are added to attention layers. Instead of updating all weights, it learns low-rank decomposition matrices.
Key Innovation: W' = W + BA (where B and A are small, low-rank matrices)
LoRA Benefits
- Reduce trainable parameters by 10,000×
- Faster training and lower GPU memory
- Swap adapters for different tasks
- No inference latency overhead
- Maintain base model quality
QLoRA Enhancement
- Quantize base model to 4-bit
- Train LoRA adapters in higher precision
- Finetune 70B models on single consumer GPU
- Minimal quality degradation
- Democratizes large model finetuning
Practical Example: QLoRA enables finetuning Llama 2 70B on a single RTX 4090 (24GB VRAM), which would otherwise require 4-8× A100 GPUs with full finetuning.
The Finetuning Process
Step-by-Step Workflow
- Define Objective: What specific behavior or knowledge do you want?
- Prepare Dataset: Collect and format high-quality examples
- Choose Base Model: Select appropriate size and architecture
- Configure Training: Set hyperparameters (learning rate, epochs, batch size)
- Train: Monitor loss curves and validation metrics
- Evaluate: Test on held-out data, compare to baseline
- Iterate: Refine dataset and hyperparameters based on results
- Deploy: Serve the finetuned model or adapter
Dataset Preparation for Finetuning
Quality Over Quantity
A smaller dataset of high-quality, diverse examples typically outperforms a large dataset of mediocre examples.
Data Format
Instruction Format (Most Common):
- System prompt (optional)
- User instruction/question
- Assistant response
Example: Alpaca, ShareGPT format
Dataset Size Guidelines
- Minimum: 100-200 quality examples
- Sweet Spot: 1,000-10,000 examples
- Diminishing Returns: Beyond 50K
- Diversity: More important than size
Common Data Quality Issues
- Inconsistent formatting across examples
- Duplicates or near-duplicates
- Factual errors in training data
- Biased or toxic content
- Too similar to base model's existing knowledge
Key Hyperparameters for Finetuning
| Hyperparameter | Typical Range | Notes |
|---|---|---|
| Learning Rate | 1e-5 to 5e-4 | LoRA can use higher LR than full finetuning; start conservative |
| Batch Size | 4-32 | Larger is better but limited by GPU memory |
| Epochs | 1-5 | More epochs can cause overfitting; monitor validation loss |
| LoRA Rank (r) | 8-64 | Higher rank = more capacity but slower; 16 is common |
| LoRA Alpha | 16-32 | Scaling factor; often set to 2× rank |
| Warmup Steps | 5-10% of total | Gradually increase LR to stabilize training |
Popular Finetuning Frameworks
Hugging Face Libraries
- Transformers: Core library for models
- PEFT: Parameter-Efficient Fine-Tuning (LoRA, QLoRA)
- TRL: Transformer Reinforcement Learning (RLHF, DPO)
- Accelerate: Distributed training utilities
Other Frameworks
- Axolotl: User-friendly YAML configs
- LLaMA-Factory: GUI for finetuning
- Unsloth: 2× faster finetuning
- Ludwig: Low-code ML platform
Recommended Starting Point: Hugging Face PEFT + Transformers provides the most flexibility and community support. Axolotl is great for quick experimentation with configs.
Evaluating Finetuned Models
Quantitative Metrics
- Perplexity: Lower is better (language modeling)
- Task-specific: Accuracy, F1, BLEU, ROUGE
- Benchmarks: MMLU, HumanEval, MT-Bench
- Loss Curves: Monitor training and validation loss
Qualitative Evaluation
- Human Review: Sample outputs manually
- A/B Testing: Compare to base model
- Domain Expert Review: Specialist evaluation
- Red Teaming: Test edge cases and failures
Watch for These Issues
- Overfitting: Perfect on training data, poor on new examples
- Catastrophic Forgetting: Loss of general capabilities
- Hallucination Increase: Making up information more frequently
- Style Drift: Unwanted changes in tone or behavior
Advanced Finetuning Techniques
DPO
Direct Preference Optimization
Simpler alternative to RLHF, trains directly on preference pairs
RLHF
Reinforcement Learning from Human Feedback
Train reward model, then optimize policy with RL
Mixture of Adapters
Multi-task Learning
Train multiple LoRA adapters for different tasks
Continued Pre-training
For domain adaptation, continue pre-training on domain-specific unlabeled text before instruction finetuning. This is particularly effective for specialized domains like medicine, law, or code.
Practical Finetuning Example
Use Case: Customer Support Chatbot
Fine-tuning a 7B model to handle company-specific customer inquiries with proper tone and accurate product information.
Implementation Steps
- Data Collection: 2,000 real customer conversations (anonymized)
- Base Model: Mistral 7B Instruct (strong baseline)
- Method: QLoRA with rank=16, alpha=32
- Training: 3 epochs, learning rate 2e-4, batch size 8
- Hardware: Single RTX 4090 (24GB), 6 hours training
- Results: 85% accuracy on test set, 40% faster than GPT-4
- Deployment: vLLM serving with 50ms latency
Outcome: Reduced API costs by 90%, improved response relevance, maintained data privacy, and achieved sub-second response times.
Finetuning Best Practices
Before Training
- Start with prompt engineering baseline
- Curate high-quality, diverse dataset
- Split data into train/validation/test
- Document data sources and preprocessing
- Choose appropriate base model for task
During Training
- Monitor both training and validation loss
- Save checkpoints regularly
- Use early stopping to prevent overfitting
- Log hyperparameters for reproducibility
- Sample outputs periodically for quality check
After Training
- Compare to base model on diverse test cases
- Test edge cases and potential failure modes
- Validate on real-world data, not just test set
- Document model limitations and known issues
- Plan for monitoring and retraining schedule
Unit 7 Summary
Multimodal AI (7.1)
- Vision-language models
- Cross-attention architectures
- Applications across modalities
Agentic Systems (7.2)
- ReAct and agent patterns
- Tool use and function calling
- Safety and oversight
Research Directions (7.3)
- Advanced reasoning techniques
- Efficiency and scalability
- AI alignment and safety
Self-hosting & Finetuning (7.4-7.5)
- Running open models locally
- LoRA and QLoRA
- Production deployment
Key Takeaways from Unit 7
1. Multimodal is the Future
AI systems are moving beyond text to understand and generate across multiple modalities, enabling richer and more natural interactions.
2. Agents Extend Capabilities
Agentic systems that can plan, use tools, and take actions represent a paradigm shift from passive assistants to active problem-solvers.
3. Efficiency Enables Access
Techniques like quantization, LoRA, and optimized inference are democratizing access to powerful AI capabilities.
4. Customization Matters
Finetuning and self-hosting give you control over performance, privacy, and costs for production applications.
Future Outlook & Trends
Where is Generative AI Heading?
🚀 Scaling
Larger models with trillions of parameters and longer contexts
⚡ Efficiency
Smaller, faster models matching current large model performance
🤖 Autonomy
More capable agents handling complex, multi-step tasks
🌐 Multimodal
Seamless any-to-any generation across all modalities
🔒 Safety
Better alignment, robustness, and interpretability
🌍 Accessibility
Open models, on-device AI, democratized access
Resources for Further Learning
📚 Research Papers
- Attention is All You Need (Transformers)
- GPT-4 Technical Report
- LoRA: Low-Rank Adaptation
- Constitutional AI (Anthropic)
- ReAct: Synergizing Reasoning and Acting
🛠️ Practical Resources
- Hugging Face Documentation
- LangChain / LangGraph Tutorials
- Ollama Quick Start
- Axolotl Finetuning Guide
- vLLM Deployment Docs
Recommended Next Steps
- Experiment with multimodal models (GPT-4V, Claude)
- Build a simple agent with tool use
- Deploy an open model locally with Ollama
- Finetune a small model on a custom dataset
- Stay current with research (arXiv, AI conferences)
Thank You!
Unit 7: Advanced Topics in Generative AI
Keep Learning, Keep Building! 🚀
Questions? Discussion? Let's explore together!