Visão Geral
Curso Self-Supervised Audio-Visual Learning. Este curso aborda os fundamentos teóricos e práticos do Self-Supervised Audio-Visual Learning, uma área avançada de aprendizado profundo que explora sinais multimodais de áudio e vídeo sem a necessidade de rótulos explícitos. O curso cobre princípios de aprendizado auto-supervisionado, alinhamento cross-modal, representação compartilhada, arquiteturas baseadas em Transformers e contrastive learning, preparando o aluno para pesquisa aplicada e desenvolvimento de sistemas multimodais escaláveis em cenários reais como reconhecimento de fala, compreensão de vídeo, robótica e sistemas perceptivos inteligentes.
Conteúdo Programatico
Module 1: Foundations of Self-Supervised Learning
- Supervised vs Unsupervised vs Self-Supervised Learning
- Pretext Tasks and Proxy Objectives
- Representation Learning Principles
- Information Theory and Mutual Information
Module 2: Audio and Visual Signal Fundamentals
- Digital Audio Signals and Spectrograms
- Video Representation and Temporal Modeling
- Feature Extraction for Audio and Video
- Synchronization and Temporal Alignment
Module 3: Audio-Visual Correspondence Learning
- Cross-Modal Consistency Assumptions
- Audio-Visual Synchrony Detection
- Temporal Misalignment as Supervision
- Positive and Negative Pair Construction
Module 4: Contrastive Learning for Audio-Visual Data
- Contrastive Loss Functions
- InfoNCE and Variants
- Memory Banks and Large-Batch Training
- Hard Negative Mining
Module 5: Architectures for Audio-Visual Self-Supervision
- CNN-Based Audio-Visual Encoders
- Transformer-Based Multimodal Models
- Cross-Attention Mechanisms
- Late Fusion vs Early Fusion
Module 6: Large-Scale Pretraining Strategies
- Dataset Construction without Labels
- Sampling and Augmentation Strategies
- Distributed Training for Multimodal Models
- Stability and Collapse Prevention
Module 7: Downstream Tasks and Transfer Learning
- Audio-Visual Speech Recognition
- Video Event Understanding
- Sound Source Localization
- Action Recognition with Audio Cues
Module 8: Evaluation and Benchmarking
- Linear Probing Protocols
- Fine-Tuning Strategies
- Cross-Dataset Generalization
- Multimodal Representation Metrics
Module 9: Advanced Topics and Research Frontiers
- Masked Modeling for Audio-Visual Data
- Foundation Models for Multimodal Learning
- Self-Supervised Learning in Robotics and Embodied AI
- Ethical and Bias Considerations in Multimodal Models