Visão Geral
Curso Speech and Vision Multimodal Models. Este curso explora o desenvolvimento de modelos multimodais que integram fala (speech) e visão computacional, abordando desde a representação conjunta de sinais acústicos e visuais até arquiteturas avançadas baseadas em Transformers. O foco está na construção de modelos capazes de compreender, alinhar e raciocinar sobre informações provenientes de áudio de fala e vídeo/imagens, com aplicações em reconhecimento audiovisual de fala, assistentes inteligentes, interação humano-máquina, robótica e sistemas perceptivos multimodais.
Conteúdo Programatico
Module 1: Introduction to Speech and Vision Multimodality
- Multimodal Perception in AI Systems
- Speech and Vision Complementarity
- Historical Evolution of Multimodal Models
- Applications and Use Cases
Module 2: Fundamentals of Speech Representation
- Speech Signal Characteristics
- Spectrograms and Acoustic Features
- Deep Speech Encoders
- Temporal Modeling of Speech
Module 3: Fundamentals of Visual Representation
- Image and Video Feature Extraction
- Spatial and Temporal Visual Modeling
- CNNs and Vision Transformers
- Visual Attention Mechanisms
Module 4: Speech-Visual Alignment and Synchronization
- Temporal Alignment Challenges
- Audio-Visual Synchrony
- Lip Reading and Visual Speech Cues
- Cross-Modal Alignment Techniques
Module 5: Multimodal Fusion Strategies
- Early Fusion Architectures
- Late Fusion Architectures
- Hybrid and Hierarchical Fusion
- Cross-Modal Attention
Module 6: Transformer-Based Speech and Vision Models
- Multimodal Transformer Architectures
- Cross-Attention for Speech and Vision
- Shared vs Modality-Specific Encoders
- Scaling Multimodal Transformers
Module 7: Self-Supervised and Contrastive Learning
- Self-Supervised Pretraining for Speech and Vision
- Contrastive Objectives for Multimodal Learning
- Masked Modeling for Audio-Visual Data
- Transfer Learning to Downstream Tasks
Module 8: Downstream Tasks and Applications
- Audio-Visual Speech Recognition
- Visual Question Answering with Speech
- Multimodal Emotion Recognition
- Human-Computer Interaction Systems
Module 9: Evaluation and Deployment
- Multimodal Evaluation Protocols
- Robustness and Generalization
- Real-Time Inference Constraints
- Ethical Considerations in Multimodal AI