Visão Geral
Este Curso Cloudera Developer for Spark & Hadoop, oferece os principais conceitos e conhecimentos que os desenvolvedores precisam para usar o Apache Spark para desenvolver aplicativos paralelos de alto desempenho. Os participantes aprenderão como usar o Spark SQL para consultar dados estruturados e o Spark Streaming para realizar processamento em tempo real em dados de streaming de diversas fontes. Os desenvolvedores também praticarão a criação de aplicativos que usam o núcleo do Spark para executar processamento ETL e algoritmos iterativos. O Curso Cloudera Developer for Spark & Hadoop, aborda como trabalhar com “big data” armazenado em um sistema de arquivos distribuído e executar aplicativos Spark em um cluster Hadoop. Depois de fazer este curso, os participantes estarão preparados para enfrentar desafios do mundo real e construir aplicações para executar decisões mais rápidas, melhores decisões e análises interativas, aplicadas a uma ampla variedade de casos de uso, arquiteturas e indústrias.
Conteúdo Programatico
Introduction to Apache Hadoop and the Hadoop Ecosystem
- Apache Hadoop Overview
- Data Processing
- Introduction to the Hands-On Exercises
Apache Hadoop File Storage
- Apache Hadoop Cluster Components
- HDFS Architecture
- Using HDFS
Distributed Processing on an Apache Hadoop Cluster
- YARN Architecture
- Working With YARN
Apache Spark Basics
- What is Apache Spark?
- Starting the Spark Shell
- Using the Spark Shell
- Getting Started with Datasets and DataFrames
- DataFrame Operations
Working with DataFrames and Schemas
- Creating DataFrames from Data Sources
- Saving DataFrames to Data Sources
- DataFrame Schemas
- Eager and Lazy Execution
Analyzing Data with DataFrame Queries
- Querying DataFrames Using Column Expressions
- Grouping and Aggregation Queries
- Joining DataFrames
RDD Overview
- RDD Overview
- RDD Data Sources
- Creating and Saving RDDs
- RDD Operations
Transforming Data with RDDs
- Writing and Passing Transformation Functions
- Transformation Execution
- Converting Between RDDs and DataFrames
Aggregating Data with Pair RDDs
- Querying Tables in Spark Using SQL
- Querying Files and Views
- The Catalog API
- Comparing Spark SQL, Apache Impala, and Apache Hive-on-Spark
Querying Tables and Views with SQL
- Querying Tables in Spark Using SQL
- Querying Files and Views
- The Catalog API
Working with Datasets in Scala
- Datasets and DataFrames
- Creating Datasets
- Loading and Saving Datasets
- Dataset Operations
Writing, Configuring, and Running Spark Applications
- Writing a Spark Application
- Building and Running an Application
- Application Deployment Mode
- The Spark Application Web UI
- Configuring Application Properties
Spark Distributed Processing
- Review: Apache Spark on a Cluster
- RDD Partitions
- Example: Partitioning in Queries
- Stages and Tasks
- Job Execution Planning
- Example: Catalyst Execution Plan
- Example: RDD Execution Plan
Distributed Data Persistence
- DataFrame and Dataset Persistence
- Persistence Storage Levels
- Viewing Persisted RDDs
Common Patterns in Spark Data Processing
- Common Apache Spark Use Cases
- Iterative Algorithms in Apache Spark
- Machine Learning
- Example: k-means
Introduction to Structured Streaming
- Apache Spark Streaming Overview
- Creating Streaming DataFrames
- Transforming DataFrames
- Executing Streaming Queries
Structured Streaming with Apache Kafka
- Overview
- Receiving Kafka Messages
- Sending Kafka Messages
Aggregating and Joining Streaming DataFrames
- Streaming Aggregation
- Joining Streaming DataFrames
Conclusion
- Message Processing with Apache Kafka