Visão Geral
Curso Spark Optimization, Este curso de treinamento de Otimização do Spark foi desenvolvido para cobrir níveis avançados do Spark para ajustar aplicativos.
O Curso Spark Optimization começa com uma revisão do Spark, incluindo arquitetura, termos e uso do Hadoop com Spark. A partir daí, os alunos aprenderão sobre o ambiente de execução do Spark e o YARN; como trabalhar com o formato de dados correto; e lidar com partições Spark. O curso termina explorando a execução física do Spark, usando a API Spark Core, cache e checkpoints, junções e otimização.
O Curso Spark Optimization é oferecido nas linguagens de programação Python/Scala.
Objetivo
Após este Curso Spark Optimization, você será capaz de:
- Integrar aspectos do Spark no YARN
- Lidar com formatos de dados binários
- Identifique as partes internas do Spark
- Otimizar Spark Core e Spark SQL Code
- Discutir as práticas recomendadas ao escrever Spark Core e Spark SQL Code
Materiais
Português/Inglês + Exercícios + Lab Pratico
Conteúdo Programatico
Spark Overview
- Logical Architecture
- Physical Architecture of Spark
- Common Concepts and Terms in Spark
- Ways to build applications on Spark
- Spark with Hadoop
Understanding Spark Execution Environment – YARN
- About YARN
- Why YARN
- Architecture of YARN
- YARN UI and Commands
- Internals of YARN
- Experience execution of Spark application on YARN
- Troubleshooting and Debugging Spark applications on YARN
- Optimizing Application Performance
Working with Right Data Format
- Why Data Formats are important for optimization
- Key Data Formats
- Comparisons – which one to choose when?
- Working with Avro
- Working with Parquet
- Working with ORC
Dealing with Spark Partitions
- How Spark determines number of Partitions
- Things to keep in mind while determining Partition
- Small Partitions Problem
- Diagnosing & Handling Post Filtering Issues (Skewness)
- Repartition vs Coalesce
Spark Physical Execution
- Spark Core Plan
- Modes of Execution
- YARN Client vs YARN Cluster
- Standalone Mode
- Physical Execution on Cluster
- Narrow vs Wide Dependency
- Spark UI
- Executor Memory Architecture
- Key Properties
Effective Development Using Spark Core API
- Use of groupbykey and reducebykey
- Using the right datatype in RDD
- How to ensure memory is utilized effectively?
- Performing Data Validation in an optimal manner
- Use of mapPartitions
- Partitioning Strategies
- Hash Partitioner
- Use of Range Partitioner
- Writing and plugging custom partitioner
Caching and Checkpointing
- When to Cache?
- How Caching helps?
- Caching Strategies
- How Spark plans changes when Caching is on
- Caching on Spark UI
- Role of Alluxio
- Checkpointing
- How Caching is different from Checkpointing
Joins
- Why optimizing joins is important
- Types of Joins
- Quick Recap of MapReduce MapSide Joins
- Broadcasting
- Bucketing
Spark SQL Optimization
- Dataframes vs Datasets
- About Tungsten
- Data Partitioning
- Query Optimizer: Catalyst Optimizer
- Debugging Spark Queries
- Explain Plan
- Partitioning & Bucketing in Spark SQL
- Best Practices for writing Spark SQL code
- Spark SQL with Binary Data formats
TENHO INTERESSE