Curso PySpark

  • Development

Curso PySpark

24 horas
Visão Geral

O Curso PySpark foi desenvolvido para fornecer aos desenvolvedores e/ou analistas de dados uma introdução prática e envolvente à linguagem de programação Python e Apache PySpark. Neste Curso PySpark, você aprenderá a escrever aplicativos Spark em Python

Publico Alvo
  • Este Curso PySpark é recomendado para desenvolvedores e/ou analistas de dados
Pre-Requisitos
  • Experiência em programação e/ou script em outra linguagem que não seja python
Materiais
Inglês/Português/Lab Prático
Conteúdo Programatico

Introduction to Python

  1. Data Types and Variables
  2. Python Collections
  3. Control Statements and Looping
  4. Functions in Python
  5. Working With Data in Python
  6. Reading and Writing Text Files
  7. Functional Programming Primer

Introduction to Apache Spark

  1. What is Apache Spark
  2. A Short History of Spark
  3. Where to Get Spark?
  4. The Spark Platform
  5. Spark Logo
  6. Common Spark Use Cases
  7. Languages Supported by Spark
  8. Running Spark on a Cluster
  9. The Driver Process
  10. Spark Applications
  11. Spark Shell
  12. The spark-submit Tool
  13. The spark-submit Tool Configuration
  14. The Executor and Worker Processes
  15. The Spark Application Architecture
  16. Interfaces with Data Storage Systems
  17. Limitations of Hadoop's MapReduce
  18. Spark vs MapReduce
  19. Spark as an Alternative to Apache Tez
  20. The Resilient Distributed Dataset (RDD)
  21. Datasets and DataFrames
  22. Spark Streaming (Micro-batching)
  23. Spark SQL
  24. Example of Spark SQL
  25. Spark Machine Learning Library
  26. GraphX
  27. Spark vs R

The Spark Shell

  1. The Spark Shell
  2. The Spark v.2 + Command-Line Shells
  3. The Spark Shell UI
  4. Spark Shell Options
  5. Getting Help
  6. Jupyter Notebook Shell Environment
  7. Example of a Jupyter Notebook Web UI (Databricks Cloud)
  8. The Spark Context (sc) and Spark Session (spark)
  9. Creating a Spark Session Object in Spark Applications
  10. The Shell Spark Context Object (sc)
  11. The Shell Spark Session Object (spark)
  12. Loading Files
  13. Saving Files

Spark RDDs

  1. The Resilient Distributed Dataset (RDD)
  2. Ways to Create an RDD
  3. Supported Data Types
  4. RDD Operations
  5. RDDs are Immutable
  6. Spark Actions
  7. RDD Transformations
  8. Other RDD Operations
  9. Chaining RDD Operations
  10. RDD Lineage
  11. The Big Picture
  12. What May Go Wrong
  13. Checkpointing RDDs
  14. Local Checkpointing
  15. Parallelized Collections
  16. More on parallelize() Method
  17. The Pair RDD
  18. Where do I use Pair RDDs?
  19. Example of Creating a Pair RDD with Map
  20. Example of Creating a Pair RDD with keyBy
  21. Miscellaneous Pair RDD Operations
  22. RDD Caching
  23. RDD Persistence

Parallel Data Processing with Spark

  1. Running Spark on a Cluster
  2. Data Partitioning
  3. Data Partitioning Diagram
  4. Single Local File System RDD Partitioning
  5. Multiple File RDD Partitioning
  6. Special Cases for Small-sized Files
  7. Parallel Data Processing of Partitions
  8. Spark Application, Jobs, and Tasks
  9. Stages and Shuffles
  10. The "Big Picture"

Shared Variables in Spark

  1. Shared Variables in Spark
  2. Broadcast Variables
  3. Creating and Using Broadcast Variables
  4. Example of Using Broadcast Variables
  5. Problems with Global Variables
  6. Example of the Closure Problem
  7. Accumulators
  8. Creating and Using Accumulators
  9. Example of Using Accumulators (Scala Example)
  10. Example of Using Accumulators (Python Example)
  11. Custom Accumulators

Introduction to Spark SQL

  1. What is Spark SQL?
  2. Uniform Data Access with Spark SQL
  3. Hive Integration
  4. Hive Interface
  5. Integration with BI Tools
  6. What is a DataFrame?
  7. Creating a DataFrame in PySpark
  8. Commonly Used DataFrame Methods and Properties in PySpark
  9. Grouping and Aggregation in PySpark
  10. The "DataFrame to RDD" Bridge in PySpark
  11. The SQLContext Object
  12. Examples of Spark SQL / DataFrame (PySpark Example)
  13. Converting an RDD to a DataFrame Example
  14. Example of Reading / Writing a JSON File
  15. Using JDBC Sources
  16. JDBC Connection Example
  17. Performance, Scalability, and Fault-tolerance of Spark SQL

Repairing and Normalizing Data

  1. Repairing and Normalizing Data
  2. Dealing with the Missing Data
  3. Sample Data Set
  4. Getting Info on Null Data
  5. Dropping a Column
  6. Interpolating Missing Data in pandas
  7. Replacing the Missing Values with the Mean Value
  8. Scaling (Normalizing) the Data
  9. Data Preprocessing with scikit-learn
  10. Scaling with the scale() Function
  11. The MinMaxScaler Object

Data Grouping and Aggregation in Python

  1. Data Aggregation and Grouping
  2. Sample Data Set
  3. The pandas.core.groupby.SeriesGroupBy Object
  4. Grouping by Two or More Columns
  5. Emulating SQL's WHERE Clause
  6. The Pivot Tables
  7. Cross-Tabulation
TENHO INTERESSE

Cursos Relacionados

Curso Python Programação Advanced

32 horas

Curso SAP ABAP S4hana Developer Foundation

32 horas

Curso Full Stack and React Essentials

32 Horas

Curso Node.js for JavaScript Developers

24 horas

Curso Elasticsearch for Developers

16H

Curso Elastic Stack Foundation

24 horas

Curso Apache Spark™ Foundation

8 Horas