Curso PySpark
24 horasVisão Geral
O Curso PySpark foi desenvolvido para fornecer aos desenvolvedores e/ou analistas de dados uma introdução prática e envolvente à linguagem de programação Python e Apache PySpark. Neste Curso PySpark, você aprenderá a escrever aplicativos Spark em Python
Publico Alvo
- Este Curso PySpark é recomendado para desenvolvedores e/ou analistas de dados
Pre-Requisitos
- Experiência em programação e/ou script em outra linguagem que não seja python
Materiais
Inglês/Português/Lab PráticoConteúdo Programatico
Introduction to Python
- Data Types and Variables
- Python Collections
- Control Statements and Looping
- Functions in Python
- Working With Data in Python
- Reading and Writing Text Files
- Functional Programming Primer
Introduction to Apache Spark
- What is Apache Spark
- A Short History of Spark
- Where to Get Spark?
- The Spark Platform
- Spark Logo
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Driver Process
- Spark Applications
- Spark Shell
- The spark-submit Tool
- The spark-submit Tool Configuration
- The Executor and Worker Processes
- The Spark Application Architecture
- Interfaces with Data Storage Systems
- Limitations of Hadoop's MapReduce
- Spark vs MapReduce
- Spark as an Alternative to Apache Tez
- The Resilient Distributed Dataset (RDD)
- Datasets and DataFrames
- Spark Streaming (Micro-batching)
- Spark SQL
- Example of Spark SQL
- Spark Machine Learning Library
- GraphX
- Spark vs R
The Spark Shell
- The Spark Shell
- The Spark v.2 + Command-Line Shells
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- Jupyter Notebook Shell Environment
- Example of a Jupyter Notebook Web UI (Databricks Cloud)
- The Spark Context (sc) and Spark Session (spark)
- Creating a Spark Session Object in Spark Applications
- The Shell Spark Context Object (sc)
- The Shell Spark Session Object (spark)
- Loading Files
- Saving Files
Spark RDDs
- The Resilient Distributed Dataset (RDD)
- Ways to Create an RDD
- Supported Data Types
- RDD Operations
- RDDs are Immutable
- Spark Actions
- RDD Transformations
- Other RDD Operations
- Chaining RDD Operations
- RDD Lineage
- The Big Picture
- What May Go Wrong
- Checkpointing RDDs
- Local Checkpointing
- Parallelized Collections
- More on parallelize() Method
- The Pair RDD
- Where do I use Pair RDDs?
- Example of Creating a Pair RDD with Map
- Example of Creating a Pair RDD with keyBy
- Miscellaneous Pair RDD Operations
- RDD Caching
- RDD Persistence
Parallel Data Processing with Spark
- Running Spark on a Cluster
- Data Partitioning
- Data Partitioning Diagram
- Single Local File System RDD Partitioning
- Multiple File RDD Partitioning
- Special Cases for Small-sized Files
- Parallel Data Processing of Partitions
- Spark Application, Jobs, and Tasks
- Stages and Shuffles
- The "Big Picture"
Shared Variables in Spark
- Shared Variables in Spark
- Broadcast Variables
- Creating and Using Broadcast Variables
- Example of Using Broadcast Variables
- Problems with Global Variables
- Example of the Closure Problem
- Accumulators
- Creating and Using Accumulators
- Example of Using Accumulators (Scala Example)
- Example of Using Accumulators (Python Example)
- Custom Accumulators
Introduction to Spark SQL
- What is Spark SQL?
- Uniform Data Access with Spark SQL
- Hive Integration
- Hive Interface
- Integration with BI Tools
- What is a DataFrame?
- Creating a DataFrame in PySpark
- Commonly Used DataFrame Methods and Properties in PySpark
- Grouping and Aggregation in PySpark
- The "DataFrame to RDD" Bridge in PySpark
- The SQLContext Object
- Examples of Spark SQL / DataFrame (PySpark Example)
- Converting an RDD to a DataFrame Example
- Example of Reading / Writing a JSON File
- Using JDBC Sources
- JDBC Connection Example
- Performance, Scalability, and Fault-tolerance of Spark SQL
Repairing and Normalizing Data
- Repairing and Normalizing Data
- Dealing with the Missing Data
- Sample Data Set
- Getting Info on Null Data
- Dropping a Column
- Interpolating Missing Data in pandas
- Replacing the Missing Values with the Mean Value
- Scaling (Normalizing) the Data
- Data Preprocessing with scikit-learn
- Scaling with the scale() Function
- The MinMaxScaler Object
Data Grouping and Aggregation in Python
- Data Aggregation and Grouping
- Sample Data Set
- The pandas.core.groupby.SeriesGroupBy Object
- Grouping by Two or More Columns
- Emulating SQL's WHERE Clause
- The Pivot Tables
- Cross-Tabulation