Visão Geral
Neste Curso Cloudera Data Engineering Developing Applications with Apache Spark, você aprenderá os principais conceitos e conhecimentos que os desenvolvedores precisam para usar o Apache Spark para desenvolver aplicativos paralelos de alto desempenho na Cloudera Data Platform (CDP).
Por meio de uma combinação de laboratórios práticos e palestras interativas, você aprenderá a escrever aplicativos Spark que se integram aos principais componentes do CDP, como Hive e Kafka. Aprenda também como usar o Spark SQL para consultar dados estruturados, como usar o Spark Streaming para realizar processamento em tempo real em dados de streaming e como trabalhar com “big data” armazenado em um sistema de arquivos distribuído.
Conteúdo Programatico
Introduction to Zeppelin
- Why Notebooks?
- Zeppelin Notes
- Demo: Apache Spark In 5 Minutes
HDFS Introduction
- HDFS Overview
- HDFS Components and Interactions
- Additional HDFS Interactions
- Ozone Overview
- Exercise: Working with HDFS
YARN Introduction
- YARN Overview
- YARN Components and Interaction
- Working with YARN
- Exercise: Working with YARN
Distributed Processing History
- The Disk Years: 2000 ->2010
- The Memory Years: 2010 ->2020
- The GPU Years: 2020 ->
Working with RDDs
- Resilient Distributed Datasets (RDDs)
- Exercise: Working with RDDs
Working with DataFrames
- Introduction to DataFrames
Working with DataFrames
- About Hive
- Module 8: Hive and Spark Integration
- Hive and Spark Integration
- Exercise: Spark Integration with Hive
Data Visualization with Zeppelin
- Introduction to Data Visualization with Zeppelin
- Zeppelin Analytics
- Zeppelin Collaboration
- Exercise: AdventureWorks
Distributed Processing Challenges
- Shuffle
- Skew
- Order
Spark Distributed Processing
- Spark Distributed Processing
- Exercise: Explore Query ExecutionOrder
Spark Distributed Persistence
- DataFrame and Dataset Persistence
- Persistence Storage Levels
- Viewing Persisted RDDs
- Exercise: Persisting DataFrames
Writing, Configuring, and Running Spark Applications
- Writing a Spark Application
- Building and Running an Application
- Application Deployment Mode
- The Spark Application Web UI
- Configuring Application Properties
- Exercise: Writing, Configuring, and Running a Spark Application
Introduction to Structured Streaming
- Introduction to Structured Streaming
- Exercise: Processing Streaming Data
Message Processing with Apache Kafka
- What is Apache Kafka?
- Apache Kafka Overview
- Scaling Apache Kafka
- Apache Kafka Cluster Architecture
- Apache Kafka Command Line Tools
Structured Streaming with Apache Kafka
- Receiving Kafka Messages
- Sending Kafka Messages
- Exercise: Working with Kafka Streaming Messages
Aggregating and Joining Streaming DataFrames
- Streaming Aggregation
- Joining Streaming DataFrames
- Exercise: Aggregating and Joining Streaming DataFrames
Appendix: Working with Datasets in Scala
- Working with Datasets in Scala
- Exercise: Using Datasets in Scala