Curso PySpark

24 horas

Visão Geral

O Curso PySpark foi desenvolvido para fornecer aos desenvolvedores e/ou analistas de dados uma introdução prática e envolvente à linguagem de programação Python e Apache PySpark. Neste Curso PySpark, você aprenderá a escrever aplicativos Spark em Python

Publico Alvo

Este Curso PySpark é recomendado para desenvolvedores e/ou analistas de dados

Pre-Requisitos

Experiência em programação e/ou script em outra linguagem que não seja python

Materiais

Inglês/Português/Lab Prático

Conteúdo Programatico

Introduction to Python

Data Types and Variables
Python Collections
Control Statements and Looping
Functions in Python
Working With Data in Python
Reading and Writing Text Files
Functional Programming Primer

Introduction to Apache Spark

What is Apache Spark
A Short History of Spark
Where to Get Spark?
The Spark Platform
Spark Logo
Common Spark Use Cases
Languages Supported by Spark
Running Spark on a Cluster
The Driver Process
Spark Applications
Spark Shell
The spark-submit Tool
The spark-submit Tool Configuration
The Executor and Worker Processes
The Spark Application Architecture
Interfaces with Data Storage Systems
Limitations of Hadoop's MapReduce
Spark vs MapReduce
Spark as an Alternative to Apache Tez
The Resilient Distributed Dataset (RDD)
Datasets and DataFrames
Spark Streaming (Micro-batching)
Spark SQL
Example of Spark SQL
Spark Machine Learning Library
GraphX
Spark vs R

The Spark Shell

The Spark Shell
The Spark v.2 + Command-Line Shells
The Spark Shell UI
Spark Shell Options
Getting Help
Jupyter Notebook Shell Environment
Example of a Jupyter Notebook Web UI (Databricks Cloud)
The Spark Context (sc) and Spark Session (spark)
Creating a Spark Session Object in Spark Applications
The Shell Spark Context Object (sc)
The Shell Spark Session Object (spark)
Loading Files
Saving Files

Spark RDDs

The Resilient Distributed Dataset (RDD)
Ways to Create an RDD
Supported Data Types
RDD Operations
RDDs are Immutable
Spark Actions
RDD Transformations
Other RDD Operations
Chaining RDD Operations
RDD Lineage
The Big Picture
What May Go Wrong
Checkpointing RDDs
Local Checkpointing
Parallelized Collections
More on parallelize() Method
The Pair RDD
Where do I use Pair RDDs?
Example of Creating a Pair RDD with Map
Example of Creating a Pair RDD with keyBy
Miscellaneous Pair RDD Operations
RDD Caching
RDD Persistence

Parallel Data Processing with Spark

Running Spark on a Cluster
Data Partitioning
Data Partitioning Diagram
Single Local File System RDD Partitioning
Multiple File RDD Partitioning
Special Cases for Small-sized Files
Parallel Data Processing of Partitions
Spark Application, Jobs, and Tasks
Stages and Shuffles
The "Big Picture"

Shared Variables in Spark

Shared Variables in Spark
Broadcast Variables
Creating and Using Broadcast Variables
Example of Using Broadcast Variables
Problems with Global Variables
Example of the Closure Problem
Accumulators
Creating and Using Accumulators
Example of Using Accumulators (Scala Example)
Example of Using Accumulators (Python Example)
Custom Accumulators

Introduction to Spark SQL

What is Spark SQL?
Uniform Data Access with Spark SQL
Hive Integration
Hive Interface
Integration with BI Tools
What is a DataFrame?
Creating a DataFrame in PySpark
Commonly Used DataFrame Methods and Properties in PySpark
Grouping and Aggregation in PySpark
The "DataFrame to RDD" Bridge in PySpark
The SQLContext Object
Examples of Spark SQL / DataFrame (PySpark Example)
Converting an RDD to a DataFrame Example
Example of Reading / Writing a JSON File
Using JDBC Sources
JDBC Connection Example
Performance, Scalability, and Fault-tolerance of Spark SQL

Repairing and Normalizing Data

Repairing and Normalizing Data
Dealing with the Missing Data
Sample Data Set
Getting Info on Null Data
Dropping a Column
Interpolating Missing Data in pandas
Replacing the Missing Values with the Mean Value
Scaling (Normalizing) the Data
Data Preprocessing with scikit-learn
Scaling with the scale() Function
The MinMaxScaler Object

Data Grouping and Aggregation in Python

Data Aggregation and Grouping
Sample Data Set
The pandas.core.groupby.SeriesGroupBy Object
Grouping by Two or More Columns
Emulating SQL's WHERE Clause
The Pivot Tables
Cross-Tabulation

TENHO INTERESSE

Cursos Relacionados

Curso Python Programação Advanced

32 horas

Ver Curso

Curso SAP ABAP S4hana Developer Foundation

32 horas

Ver Curso

Curso API Testing Overview

40 Horas

Ver Curso

Curso MATLAB Foundation

24 horas

Ver Curso

Curso Full Stack and React Essentials

32 Horas

Ver Curso

Curso Node.js for JavaScript Developers

24 horas

Ver Curso

Curso SoapUI for API Testing

16 horas

Ver Curso

Curso Elasticsearch for Developers

16H

Ver Curso

Curso Elastic Stack Foundation

24 horas

Ver Curso

Curso Apache Spark™ Foundation

8 Horas

Ver Curso

Curso PySpark

Curso PySpark

Visão Geral

Publico Alvo

Pre-Requisitos

Materiais

Conteúdo Programatico

Ficha do Curso

Investimento

Formato de Entrega

Nível

Calendário

Receba todas informações

Cursos Relacionados

Curso Python Programação Advanced

Curso SAP ABAP S4hana Developer Foundation

Curso API Testing Overview

Curso MATLAB Foundation

Curso Full Stack and React Essentials

Curso Node.js for JavaScript Developers

Curso SoapUI for API Testing

Curso Elasticsearch for Developers

Curso Elastic Stack Foundation

Curso Apache Spark™ Foundation

O que você quer aprender hoje?

Curso PySpark

Curso PySpark

Visão Geral

Publico Alvo

Pre-Requisitos

Materiais

Conteúdo Programatico

Ficha do Curso

Investimento

Formato de Entrega

Nível

Calendário

Receba todas informações

Cursos Relacionados

Curso Python Programação Advanced

Curso SAP ABAP S4hana Developer Foundation

Curso API Testing Overview

Curso MATLAB Foundation

Curso Full Stack and React Essentials

Curso Node.js for JavaScript Developers

Curso SoapUI for API Testing

Curso Elasticsearch for Developers

Curso Elastic Stack Foundation

Curso Apache Spark™ Foundation