Visão Geral
Este curso avançado de Apache Spark aprofunda o funcionamento interno do framework, com foco em performance, escalabilidade, integração com ecossistemas Big Data e desenvolvimento de aplicações Spark de alto desempenho. O treinamento aborda desde a revisão de conceitos fundamentais até tuning avançado, Spark SQL, Spark Streaming, Machine Learning e integrações com plataformas como Kafka, Cassandra e ElasticSearch, preparando o aluno para projetar, operar e otimizar ambientes Spark em produção.
Conteúdo Programatico
Module 1: Review of Core Apache Spark Concepts
- How Spark Works
- RDD Fundamentals
- SparkSQL and DataFrames
- Spark Streaming Concepts
- Machine Learning Basics
Module 2: Understanding Spark Internals for Performance
- Schedulers, Jobs, and Tasks
- Data Structures, Sets and Data Lakes
- Shuffle and Performance
- Understanding Data Sources and Partitions
- Read, Writes and Performance
Module 3: New Features of Spark 2
- API Stability
- Core and Spark SQL Changes
- Changes to Packaging and Operations
Module 4: Working with Spark
- Debugging and Troubleshooting Spark Applications
- Developing Data Workflows
- Automated Spark Builds Using Maven
Module 5: Clustering with Spark
- Running a Spark Cluster
- Understanding Cluster Resource Requirements
- Managing Memory on Executors and Workers
- Managing Memory and Cores Across a Spark Cluster
Module 6: Performance Tuning and Best Practices
- Performance Tuning Concepts
- Clarifying Best Practices
- Performance Tuning Process
- Performance Tuning Metrics
- SQL Performance Tuning
- High Performance Caching Strategies
Module 7: Spark Integration
- Implementing Spark on DataStax, Hortonworks and Similar Platforms
- Integrating with Cassandra
- Integrating with Kafka
- Integrating with Elasticsearch
- Integrating with Other Compatible NoSQL Implementations
Module 8: Machine Learning with Spark
- Common Machine Learning Algorithms
- Commonly Used Algorithms with Scala
- Machine Learning Libraries: MLlib and H2O
- Custom Algorithms Creation
Module 9: Advanced Spark SQL and Spark Streaming
- Leveraging Spark 2 API (SparkSession and Related Components)
- Developing with Spark DataFrames
- Writing Solid Spark Jobs
Module 10: High Performance and Architecture Decisions
- Understanding When to Use Spark and When Not to Use Spark
- High Performance Spark Applications
- Cluster Resource Requirements Review
- Creating Fault-Tolerant Spark Applications