Curso Python for Big Data with PySpark

  • DevOps | CI | CD | Kubernetes | Web3

Curso Python for Big Data with PySpark

30 horas
Visão Geral

O Curso Python for Big Data with PySpark, oferece uma imersão prática no processamento de grandes volumes de dados utilizando o framework Apache Spark, com a linguagem Python. PySpark é amplamente utilizado por grandes empresas para realizar análises de dados em larga escala, com processamento distribuído, facilitando a manipulação de datasets gigantescos de maneira eficiente. Durante o curso, os alunos aprenderão a construir pipelines de dados, processar e analisar dados com PySpark, e otimizar fluxos de trabalho de Big Data.

Objetivo

Após realizar este curso Python for Big Data with PySpark, você será capaz de:

  • Processar grandes volumes de dados utilizando PySpark
  • Construir pipelines de dados distribuídos com Apache Spark
  • Realizar análises e transformar dados em larga escala
  • Otimizar tarefas de processamento e gerenciar clusters Spark
Publico Alvo
  • Engenheiros de dados
  • Cientistas de dados
  • Desenvolvedores que trabalham com grandes volumes de dados
  • Estudantes e profissionais de TI interessados em análise de Big Data
Pre-Requisitos
  • Conhecimentos básicos em Python
  • Familiaridade com SQL e bancos de dados
  • Noções de Big Data e Apache Spark (desejável)
Materiais
Inglês + Exercícios + Lab Pratico
Conteúdo Programatico

Introduction to Big Data and PySpark

  1. Overview of Big Data concepts
  2. Understanding the architecture of Apache Spark
  3. Setting up PySpark in a local and distributed environment

PySpark Essentials

  1. Introduction to Resilient Distributed Datasets (RDDs)
  2. Working with DataFrames in PySpark
  3. PySpark SQL for querying large datasets

Data Ingestion and Preprocessing

  1. Loading data from multiple sources (CSV, JSON, Parquet, etc.)
  2. Cleaning and transforming data with PySpark
  3. Handling missing and inconsistent data at scale

Working with RDDs and DataFrames

  1. Understanding RDD transformations and actions
  2. Manipulating and processing DataFrames in PySpark
  3. Optimizing data transformations for performance

Big Data Analytics with PySpark

  1. Performing aggregations and joins on large datasets
  2. Analyzing structured and unstructured data
  3. Using PySpark SQL for advanced queries

Machine Learning with PySpark (MLlib)

  1. Introduction to PySpark's MLlib library
  2. Building and training machine learning models at scale
  3. Applying supervised and unsupervised learning with PySpark

Real-Time Data Processing with PySpark Streaming

  1. Introduction to Spark Streaming
  2. Processing real-time data streams with PySpark
  3. Integrating PySpark with Kafka and other data sources

Optimizing Spark Jobs

  1. Understanding Spark's execution model
  2. Tuning Spark jobs for performance
  3. Managing and monitoring Spark jobs in a cluster

Building and Deploying Big Data Pipelines

  1. Designing ETL (Extract, Transform, Load) pipelines with PySpark
  2. Deploying Spark applications on a distributed cluster
  3. Managing Spark jobs in a production environment

Case Study and Project Development

  1. Real-world case study: Building a data pipeline with PySpark
  2. Hands-on project: End-to-end Big Data solution using PySpark
  3. Best practices for handling Big Data with PySpark
TENHO INTERESSE

Cursos Relacionados

Curso Ansible Red Hat Basics Automation Technical Foundation

16 horas

Curso Terraform Deploying to Oracle Cloud Infrastructure

24 Horas

Curso Ansible Linux Automation with Ansible

24 horas

Ansible Overview of Ansible architecture

16h

Advanced Automation: Ansible Best Practices

32h