Visão Geral
Curso Video Transformers em Profundidade. Este curso aprofunda o estudo de Video Transformers, abordando arquiteturas modernas para modelagem espaço-temporal de vídeos em larga escala. O curso cobre desde os fundamentos de atenção aplicada a sequências visuais até modelos avançados usados em pesquisa e produção, incluindo Vision Transformers temporais, factorized attention, sparse attention, masked video modeling e integração com modelos multimodais. O foco é capacitar o aluno a projetar, treinar e avaliar modelos de vídeo de alto desempenho para tarefas complexas de compreensão visual.
Conteúdo Programatico
Module 1: Foundations of Video Representation
- Video as Spatio-Temporal Data
- Frame-Based vs Clip-Based Modeling
- Temporal Sampling Strategies
- Challenges in Video Understanding
Module 2: From CNNs to Video Transformers
- 2D CNNs vs 3D CNNs
- Limitations of Convolutional Video Models
- Motivation for Transformer-Based Video Models
- Tokenization of Video Data
Module 3: Core Video Transformer Architectures
- Vanilla Video Transformer
- Factorized Spatial-Temporal Attention
- Divided vs Joint Attention
- Patch Embedding for Video
Module 4: Efficient Attention Mechanisms for Video
- Sparse Attention for Long Videos
- Sliding Window and Local Attention
- Performer and Linear Attention
- Memory-Efficient Transformer Variants
Module 5: Temporal Modeling Strategies
- Long-Range Temporal Dependencies
- Positional Encoding for Video
- Temporal Hierarchies
- Multi-Scale Temporal Modeling
Module 6: Self-Supervised and Masked Video Modeling
- Masked Video Autoencoders
- Contrastive Video Representation Learning
- Cross-Clip Prediction Tasks
- Pretraining at Scale
Module 7: Multimodal Video Transformers
- Video-Text Transformers
- Audio-Visual Video Models
- Cross-Modal Attention
- Shared Multimodal Representations
Module 8: Downstream Tasks and Applications
- Action Recognition
- Video Classification
- Video Retrieval
- Event Detection
Module 9: Training, Scaling, and Optimization
- Large-Scale Video Datasets
- Distributed Training for Video Transformers
- Memory and Compute Optimization
- Stability and Regularization
Module 10: Evaluation and Research Frontiers
- Video Benchmarks and Metrics
- Generalization Across Domains
- Foundation Models for Video
- Open Research Challenges