Visão Geral
Este curso apresenta o uso do Apache Flink SQL aplicado diretamente ao trabalho de engenharia de dados. Ao longo da formação, você aprenderá a construir pipelines robustos, ingerir dados em alta escala, criar transformações avançadas, aplicar janelas, realizar joins complexos e integrar o Flink SQL com sistemas como Kafka, bancos de dados e data lakes. O foco é preparar você para projetar, implementar e operar arquiteturas de dados modernas baseadas em streaming.
Conteúdo Programatico
Module 1 – Introduction to Flink SQL for Data Engineering
- Role of Flink SQL in modern data engineering
- Streaming vs batch data engineering
- Architectural components and patterns
Module 2 – Data Ingestion with Flink SQL
- Kafka source DDLs
- Ingesting from files, object storage and JDBC connectors
- Schema definitions and serialization formats (JSON, Avro, Debezium, Protobuf)
- Creating multi-source pipelines
Module 3 – Time Semantics and Watermarks
- Event time vs ingestion time vs processing time
- Custom watermark strategies
- Handling late, early and out-of-order data
- Practical patterns for real-world pipelines
Module 4 – Transformations for Data Engineering
- Filtering, projections and enrichment
- Computed fields and schema evolution
- Handling changelogs and upsert semantics
- Data normalization and denormalization with SQL
Module 5 – Windows and Aggregations
- Tumbling, hopping and session windows
- Hierarchical aggregations (multi-hop aggregations)
- Real-time metrics and time-series processing
- Materializing incremental results
Module 6 – Stream and Table Joins
- Stream–stream joins at scale
- Temporal table joins
- Lookup joins with external systems
- Joins for dimension modeling and enrichment
Module 7 – Building Multi-Layer Pipelines (Bronze, Silver, Gold)
- Designing staged pipelines with Flink SQL
- Data cleansing and formatting
- Deduplication techniques
- Creating analytics-ready tables
Module 8 – Output to Data Warehouses and OLAP Systems
- Writing to Kafka sinks
- Delivering results to OLAP systems (ClickHouse, Pinot, Druid)
- Exporting to data lakes (Hudi, Iceberg, Delta)
- Multi-sink pipeline strategies
Module 9 – Monitoring, Debugging and Optimization
- Using EXPLAIN and analyzing query plans
- State tuning and memory optimization
- Avoiding backpressure and performance bottlenecks
- Techniques for high-throughput pipelines
Module 10 – Real-World Data Engineering Use Cases
- Real-time ETL pipeline
- Streaming CDC from operational databases
- IoT ingestion and normalization
- Real-time transformations for BI dashboards
Module 11 – Best Practices for Production
- Governance and schema evolution
- Versioning and catalog management
- Error handling and recovery strategies
- Observability and lineage considerations