Curso Pentaho Data Integration Fundamentals
40 horasVisão Geral
Curso Pentaho Data Integration Fundamentals. Uma introdução de 24 horas à transformação de dados usando Pentaho Data Integration (PDI). Do começo ao desenvolvimento de uma estrutura ETL para ingerir arquivos de estrutura variada.
Objetivo
Transformações
- Etapas de entrada e saída
- Transformações de campo
- Junções e pesquisas
- Transformações de conjuntos
- Entradas JSON e XML
- Variáveis e portabilidade
- Registro e desempenho
- Injeção de metadados
Empregos
- Orquestração básica
- Gerenciamento de arquivos e bancos de dados
- Iteração e loop em trabalhos
Publico Alvo
Novos usuários encarregados de usar o Pentaho e/ou usuários existentes que buscam formalizar seus conhecimentos - Analistas de negócios, analistas de dados e desenvolvedores de ETL.
Materiais
Inglês/Português/Lab PráticoConteúdo Programatico
Introduction
- Installing and starting PDI. The user interface
Part I – Transformations
Input and output steps;
- Exploration of the various ways to read data into, and write data out of, PDI: CSV files, Excel files SQL queries, etc. Installing JDBC drivers
- Lab 1: CSV Input, MySQL output
Field transformations
- Overview of various transformation steps: Calculator, string manipulation, adding counters, value mapping, handling nulls, javascript and regular expressions.
Joins and lookups
- Merging two or more data streams and combining the data: managing slowly changing dimensions dimensions, in-memory and database lookups, querying HTTP services/apis, merge joins, row diff, etc.
- Lab 2: Joins and lookups (enriching data stream)
Set transformations
- Operations on groups of rows: sorting, grouping, splitting fields into rows, normalising/denormalising data, cloning, appending.
- Lab 3: Grouping data
JSON and XML inputs
- Reading XML data via Xpath and using the very fast performing StaX parser. JSON parsing via JSONpath
- Lab 4: JSON and XML inputs (Xpath, Stax parser, Jsonpath)
Variables and portability
- Setting and getting variables; global variables, runtime variables, parameters; portable connections, file paths, and other best practices
- Lab 5: Portable transformations
Logging and Performance
- Reading PDI logs; analysing performance and runtime metrics; examples of fast and slow streps, identifying bottlenecks; step copies in parallel
Metadata injection
- Use cases for metadata injection. Modifying metadata in runtime. Advanced metadata injection options.
- Lab 6: Flexible CSV loading
Part II – Jobs
Basic orchestration
- Usage of PDI jobs to orchestrate tasks; overview of job entries :sub-jobs, sub-transformations, SQL, shell scripts, conditions, error handling, getting/putting files, etc. Wrapper jobs.
File and DB management
- Using lock files; downloading and archiving files; checking database connections; conditionally create/drop/modify database structure; error handling; recording execution results
- Lab 7: Building a simple job
Iteration and looping in jobs
- Run job/transformation for each file in folder; handling different file types in one go; iterating over API results; loop until condition met; running .sh or .bat scripts depending on OS
- Lab 8: Developing a powerful ETL framework