Visão Geral
O curso Apache Spark Fundamentals proporciona uma introdução prática e abrangente ao ecossistema Apache Spark — desde os fundamentos de processamento distribuído até o uso de suas principais APIs em ambientes de Big Data. Os participantes vão aprender como projetar, implementar e executar aplicações Spark que processem dados em larga escala, com foco em performance, escalabilidade e integração com outras tecnologias do ecossistema.
Conteúdo Programatico
Module 1: Introduction to Apache Spark & Big Data
- Evolution of Big Data and limitations of traditional systems
- Spark overview and ecosystem (Spark Core, SQL, Streaming, MLlib, GraphX)
- Spark architecture: Driver, Executors, Cluster Manager (Stand-alone, YARN, Mesos, Kubernetes)
- Understanding RDDs, lazy evaluation, and execution model
Module 2: Getting Started with Spark Environment Setup
- Installing Spark locally and running in cluster mode
- Spark shells: Scala, Python (PySpark) and Spark Submit
- Configuring Spark properties (spark-conf) and resources
- Understanding Spark UI, logs and job lifecycle
Module 3: Core Spark APIs – RDDs, DataFrames and Datasets
- Working with RDDs: transformations, actions, persistence
- DataFrame API: schema, columns, SQL queries
- Dataset API: strongly typed in Scala/Java
- Converting between RDD, DataFrame, Dataset and optimisation strategies
Module 4: Data Processing with Spark SQL
- Using Spark SQL for structured data processing
- Creating tables, views, and executing SQL queries
- Integrating with external data sources (Parquet, JSON, CSV, JDBC)
- Performance tuning: caching, partitioning, broadcast joins and Catalyst engine
Module 5: Advanced Data Processing & Performance Optimization
- Partitioning strategies and data skew mitigation
- Understanding Tungsten execution and Catalyst optimizer
- Caching/persistence and memory management
- Monitoring tasks, stages and shuffle operations
- Backpressure and job performance tuning
Module 6: Integration & Ecosystem Connectivity
- Reading/writing from/to HDFS, S3, NoSQL (Cassandra, HBase)
- Integration with messaging systems: Kafka, Kinesis
- Using Spark with other big data tools (Hive, Delta Lake)
- Handling unstructured and semi-structured data
Module 7: Monitoring, Debugging and Production Readiness
- Spark UI and Spark History Server
- Logging, metrics and integration with Prometheus/Grafana
- Debugging common errors and failures (OOM, stage failures)
- Deploying Spark applications: cluster management, resource isolation, cost optimization
Module 8: Hands-On Project – Real World Data Pipeline
- Designing a data ingestion pipeline: ingest raw data, process, store results
- Building Spark job using DataFrame/Dataset API
- Applying optimizations, caching, and performance tuning
- Deploying job to cluster, monitoring execution, validating results