Visão Geral
O curso Spark SQL oferece uma compreensão completa dos recursos de processamento de dados estruturados no Apache Spark. Os participantes aprenderão como usar a API SQL do Spark e o DataFrame API para realizar consultas, transformações e otimizações em grandes volumes de dados de forma distribuída e eficiente. O curso combina teoria e prática, permitindo aos alunos desenvolver aplicações analíticas robustas e de alto desempenho.
Conteúdo Programatico
Module 1: Introduction to Spark SQL
- What is Spark SQL
- Evolution from RDDs to DataFrames and Datasets
- The Spark SQL Architecture
- SparkSession and the SQL Context
Module 2: Working with DataFrames and Datasets
- Creating DataFrames from different sources
- Schema inference and manual schema definition
- DataFrame operations and transformations
- Converting between DataFrames and Datasets
Module 3: Querying Data Using Spark SQL
- Executing SQL queries with SparkSession
- Registering DataFrames as temporary views
- Running SQL queries on structured data
- Working with user-defined functions (UDFs)
Module 4: Data Sources and Formats
- Reading and writing data in different formats: CSV, JSON, Parquet, ORC
- Working with external databases using JDBC
- Integrating with Hive Metastore
- Partitioning and Bucketing
Module 5: Performance Optimization
- Understanding the Catalyst Optimizer
- Query execution plans and physical planning
- Tungsten engine and memory management
- Techniques for caching and persistence
Module 6: Aggregations and Window Functions
- Grouping and aggregating data
- Advanced SQL functions in Spark
- Window functions for analytical queries
Module 7: Integration and Real-World Scenarios
- Using Spark SQL in ETL pipelines
- Joining Spark SQL with Spark Streaming
- Working with Delta Lake and structured streaming
- Best practices and performance tuning
Module 8: Hands-on Labs
- DataFrame creation and transformations
- Executing and optimizing SQL queries
- Integrating Spark SQL with data lakes
- Practical exercises on query performance