Visão Geral
O Curso Data Engineering with Databricks oferece uma compreensão completa sobre como construir, otimizar e gerenciar pipelines de dados utilizando a plataforma Databricks Unified Analytics. Os participantes aprenderão a usar Delta Lake, Spark SQL e os recursos nativos do Databricks para ingerir, processar e analisar dados em larga escala.
Além disso, o curso explora tópicos avançados como processamento incremental de dados, arquiteturas multi-hop e produção de dashboards, capacitando os engenheiros de dados a entregarem soluções corporativas de alto desempenho.
Conteúdo Programatico
Module 1: Databricks Workspace and Services
- Introduction to Databricks Architecture
- Understanding Databricks Workspaces
- Databricks Runtime and Clusters
- Managing Users and Groups
- Navigating the Databricks UI
- Introduction to Databricks Utilities and Repos
Module 2: Delta Lake
- Overview of Delta Lake
- The Delta Architecture and ACID Transactions
- Schema Enforcement and Evolution
- Time Travel and Data Versioning
- Optimizing Delta Tables (VACUUM, Z-ORDER, OPTIMIZE)
- Integrating Delta Lake with BI and ML tools
Module 3: Relational Entities on Databricks
- Creating Databases, Tables, and Views
- Managed vs. External Tables
- Partitioning and Bucketing Strategies
- Query Optimization Techniques
- Working with Temporary and Global Views
- Best Practices for Data Modeling in Databricks
Module 4: ETL With Spark SQL
- Introduction to ETL Concepts
- Reading and Writing Data with Spark SQL
- Using DataFrames and Datasets
- Transformations and Actions in Spark
- Implementing Joins, Aggregations, and Filters
- Performance Tuning and Caching Strategies
Module 5: OPTIONAL - Python for Spark SQL
- Python Integration in Databricks Notebooks
- Using PySpark for Data Transformations
- Working with Spark DataFrames in Python
- UDFs (User Defined Functions) in PySpark
- Managing Dependencies with Databricks Libraries
Module 6: Incremental Data Processing
- Understanding Incremental vs. Batch Processing
- Implementing Change Data Capture (CDC)
- Designing Incremental Pipelines using Delta Lake
- Using MERGE for Upserts and Deletes
- Scheduling Incremental Jobs
Module 7: Multi-Hop Architecture
- Introduction to the Bronze, Silver, and Gold Layers
- Designing a Multi-Hop Data Flow
- Implementing Data Quality Checks
- Data Lineage and Traceability
- Performance Optimization in Multi-Hop Architectures
Module 8: Delta Live Tables
- Overview of Delta Live Tables (DLT)
- Building Reliable and Declarative Data Pipelines
- Managing Dependencies and Auto-Scaling
- Monitoring and Debugging DLT Pipelines
- Integrating DLT with Orchestration Tools
Module 9: Task Orchestration with Jobs
- Introduction to Databricks Jobs
- Configuring and Scheduling Jobs
- Job Clusters vs. Interactive Clusters
- Multi-Task Job Pipelines
- Error Handling and Retry Strategies
- Integration with External Orchestration Systems
Module 10: Running a DBSQL Query
- Introduction to Databricks SQL (DBSQL)
- Creating and Managing SQL Warehouses
- Writing and Executing SQL Queries in DBSQL
- Query Optimization and Execution Plans
- Using Query History and Query Profiles
Module 11: Managing Permissions
- Access Control in Databricks
- Workspace, Table, and Cluster Permissions
- Unity Catalog and Data Governance
- Managing Tokens and Secrets
- Auditing and Monitoring Access
Module 12: Productionalizing Dashboards and Queries in DBSQL
- Creating Dashboards in Databricks SQL
- Managing Alerts and Scheduled Queries
- Embedding and Sharing Dashboards
- Best Practices for Dashboard Performance
- Production Deployment and Maintenance