Curso Pentaho Data Integration

  • Data Science Analytic

Curso Pentaho Data Integration

40 horas
Visão Geral

O Pentaho Data Integration (PDI), que faz parte do Pentaho Open Source BI Suite, inclui todo tipo de software para dar suporte à tomada de decisões de negócios.

Objetivo
  • Introdução ao PDI
  • Introdução às transformações
  • Manipulando dados do mundo real
  • Filtrando Pesquisando e Executando outra operação útil com Dados.
  • Transformando o conjunto de linhas
  • Executando operações avançadas com bancos de dados
  • Criando fluxos de tarefas básicas (trabalhos)
  • Integração de dados em tempo real
  • Ajuste de desempenho
  • Paralelização, cluster e particionamento
Pre-Requisitos

Conhecimento dos conceitos de ETL

Materiais
Inglês
Conteúdo Programatico

MODULO I

  • Getting Started with Pentaho Data Integration
  • Pentaho Data Integration and Pentaho BI Suite
  • Using PDI in real-world scenarios
  • Loading data warehouses or data marts
  • Integrating data
  • Data Cleansing
  • Migrating information
  • Exporting data
  • Integrating PDI along with other Pentaho tools
  • Demo – Installing PDI
  • Demo – starting and customizing Spoon
  • Setting preferences in the Options window
  • Storing transformations and jobs in a repository
  • Demo – creating a hello world transformation Directing Kettle engine with transformations
  • Exploring the Spoon interface
  • Designing a transformation
  • Running and previewing the transformation
  • Demo – installing MySQL on Windows
  • Getting Started with Transformations
  • Designing and previewing transformations
  • Demo – creating a simple transformation and getting familiar with the design process
  • Understanding the Kettle rowset
  • Seeing the results in the Execution-Results-pane
  • Demo – generating a range of dates and inspecting the data as it is being created
  • Adding/modifying fields using various PDI steps
  • Demo – avoiding errors while converting the estimated time from string to integer
  • The error handling functionality
  • Demo – configuring the error handling to see the description of the errors
  • Personalizing the error handling

MODULO II

  • Manipulating Real-world Data
  • Reading data from files
  • Demo – reading results of football matches from files
  • Reading several files at once
  • Demo – reading all your files at a time using a single text file input step
  • regular expressions
  • Troubleshooting reading files
  • Sending data to files
  • Demo – sending the results of matches to a plain file
  • Getting system information
  • Demo – reading and writing matches files with flexibility
  • Running transformations from a terminal window
  • Demo – running the matches transformation from a terminal window
  • Demo – getting data from an XML file with information about countries
  • PDI transformation files
  • Getting data from XML files
  • Kettle variables
  • Filtering/Searching/Performing related Useful Operations with Data
  • Sorting data
  • Demo – sorting information about matches with the Sort rows step
  • Calculations on groups of rows
  • Demo – calculating football match statistics by grouping data Group by Step
  • Numeric fields
  • Filtering
  • Demo – counting frequent words by filtering
  • Demo – refining the counting task by filtering even more
  • Filtering rows using the Filter rows step
  • Looking up data
  • Demo – finding out which language people speak
  • The Stream lookup step
  • Data cleaning
  • Demo – fixing words before counting them
  • Cleansing data with PDI
  • Controlling the Flow of Data
  • Splitting streams
  • Demo – browsing new features of PDI by copying a dataset
  • Copying rows
  • Distributing rows
  • Demo – assigning tasks by distributing
  • Splitting the stream based on conditions
  • Demo – assigning tasks by filtering priorities with the Filter rows step
  • PDI steps – splitting the stream-based on conditions
  • Demo – assigning tasks by filtering priorities with the Switch/Case step
  • Merging streams
  • Demo – gathering progress and merging it all together
  • PDI options for merging streams
  • Demo – giving priority to Bouchard by using the Append Stream
  • Treating invalid data by splitting and merging streams
  • Demo – treating errors in the estimated time to avoid discarding rows
  • Treating rows with invalid data
  • Transforming Your Data by Coding
  • Doing simple tasks with the JavaScript
  • Demo – counting frequent words by coding in JavaScript
  • Using the JavaScript language in PDI
  • Using transformation predefined constants
  • Testing the script using the Test script button
  • Reading and parsing unstructured files with JavaScript
  • Demo – changing a list of house descriptions with JavaScript
  • Looping over the dataset rows
  • Doing simple tasks with the Java Class
  • Demo – counting frequent words by coding in Java
  • Using the Java language in PDI
  • Sending rows to the next step
  • Data types equivalence
  • Transforming the dataset with Java
  • Demo – splitting the field to rows using Java
  • Avoiding coding by using purpose-built steps

MODULO III

  • Transforming the Row set
  • Converting rows to columns
  • Demo – enhancing the file by converting rows to columns
  • Converting row data to column data by using the Row Demoralizer
  • Aggregating data with a Row Demoralizer
  • Demo – aggregating football matches data with the Row Demoralizer
  • Using Row Demoralizer for aggregating data
  • Normalizing data
  • Demo – enhancing the matches file by normalizing the dataset
  • Modifying the dataset with a Row Normalizer
  • Summarizing: PDI steps which operate on sets of rows
  • Generate a customized time-dimension dataset using Kettle-variables
  • Demo – creating the time dimension dataset
  • Getting variables
  • Demo – parameterizing the start and end date of the time dimension dataset
  • Working with Databases
  • Introducing the Steel Wheels sample database
  • Connecting to the Steel Wheels database
  • Demo – creating a connection to the Steel Wheels database
  • Connecting with Relational Database Management Systems
  • Querying a database
  • Demo – getting data about shipped orders
  • Generating data from the database from Table input step
  • Generating a new dataset using the SELECT statement
  • Making flexible queries using parameters
  • Demo – getting orders in a range of dates using parameters
  • Adding parameters to your queries
  • Making flexible queries by using Kettle variables
  • Demo – getting orders in a range of dates by using Kettle variables
  • Using Kettle variables in your queries
  • Sending data to a database
  • Demo – loading a table with a list of manufacturers
  • Inserting or updating data by using other PDI
  • Demo – inserting new products or updating existing ones
  • Demo – testing the update of existing products
  • Eliminating data from a database
  • Demo – deleting data about discontinued items
  • Performing Advanced Operations with Databases
  • Demo – populating the Jigsaw database
  • Exploring the Jigsaw database model
  • Doing simple lookups
  • Demo – using a Database lookup step to create a list of products to buy
  • Performing complex lookups
  • Demo – using a Database join step to create a list of suggested products to buy
  • Joining data to the stream-data by using a Database-join
  • Introducing dimensional modeling
  • Loading dimensions with data
  • Demo – loading a region dimension with a Combination lookup/update step
  • Describing data with dimensions
  • Loading Type I SCD with a Combination lookup/update
  • Storing history of changes
  • Demo – keeping a history of changes in products by using the Dimension lookup/update
  • Keeping an entire history of data with a Type II SCD
  • Loading Type II SCDs with the Dimension lookup/update step

MODULO IV

  • Creating Basic Task Flows (Jobs)
  • Introducing PDI jobs
  • Demo – creating a folder with a Kettle job
  • Executing processes with PDI jobs
  • Using Spoon to design and run jobs
  • Demo – creating a simple job and getting familiar with the design process
  • Changing the flow of execution on the basis of conditions
  • Look at the results – Execution results window
  • Running transformations from jobs
  • Demo – generating a range of dates and inspecting how things are running
  • Using the Transformation job entry
  • Receiving arguments and parameters in a job
  • Demo – generating a hello world file by using arguments and parameters
  • Using named parameters in jobs
  • Running jobs from a terminal window
  • Demo – executing the hello world job from a terminal window
  • Demo – calling the hello world transformation with fixed arguments and parameters
  • Deciding: use of a command-line argument or a named parameter
  • Creating Advanced Transformations and Jobs
  • Re-using part of your transformations
  • Demo– calculating statistics with the use of a sub-transformation
  • Creating a job as a process flow
  • Demo – generating top average scores by copying and getting rows
  • Use the copy/get rows mechanism to transfer data between transformations
  • Demo – generating custom files by executing a transformation for every input row
  • Executing for each row
  • Enhancing your processes with the use of variables
  • Demo – generating custom messages by setting a variable with the name of the examination file
  • Case Study : Developing and Implementing a Sample Datamart
  • Exploring the sales data mart
  • Deciding the level of granularity
  • Loading the dimensions
  • Demo – loading the dimensions for the sales datamart
  • Extending the sales datamart model
  • Loading a fact table with aggregated data
  • Demo – loading the sales fact table by looking up dimensions
  • Get the data from the source – SQL queries
  • Translating the business keys into surrogate keys
  • Obtaining the surrogate key for Type I SCD
  • Obtaining the surrogate key for Type II SCD
  • Obtaining the surrogate key for the Junk dimension
  • Obtaining the surrogate key for the Time dimension
  • Getting facts and dimensions together
  • Demo – loading the fact table using a range of dates obtained from the command line
  • Demo – loading the SALES star schema model
  • Automating the administrative tasks
  • Demo – automating the loading of the sales data mart

MODULO V

  • Real-Time Data Integration
  • Introduction to Real-Time ETL
  • Real-Time Challenges
  • Requirements
  • Transformation Streaming
  • A Practical Example of Transformation Streaming
  • Debugging
  • Third-Party Software and Real-Time Integration
  • Java Message Service
  • Creating a JMS Connection and Session
  • Performance Tuning
  • Transformation Performance: Finding the Weakest Link
  • Finding Bottlenecks by Simplifying
  • Finding Bottlenecks by Measuring
  • copying Rows of Data
  • Improving Transformation Performance
  • Using Lazy Conversion for Reading Text Files
  • Single-File Parallel Reading, Multi-File Parallel Reading
  • Configuring the NIO Block Size
  • Changing Disks and Reading Text Files
  • Improving Performance in Writing Text Files
  • Using Lazy Conversion for Writing Text Files
  • Parallel Files Writing
  • Improving Database Performance
  • Avoiding Dynamic SQL, Handling Roundtrips
  • Handling Relational Databases
  • Sorting Data, Sorting on the Database, Sorting in Parallel
  • Reducing CPU Usage
  • Optimizing the Use of JavaScript
  • Launching Multiple Copies of a Step
  • Selecting and Removing Values
  • Managing Thread Priorities
  • Adding Static Data to Rows of Data
  • Limiting the Number of Step Copies
  • Avoiding Excessive Logging
  • Improving Job Performance
  • Loops in Jobs
  • Database Connection Pools
  • Parallelization, Clustering, and Partitioning
  • Multi-Threading
  • Row Distribution, Row Merging, and Row Redistribution
  • Data Pipelining
  • Consequences of Multi-Threading
  • Database Connections
  • Order of Execution, Parallel Execution in a Job
  • Using Carte as a Slave Server
  • The Configuration File
  • Defining Slave Servers
  • Remote Execution, Monitoring Slave Servers
  • Carte Security, Services
  • Clustering Transformations
  • Defining a Cluster Schema, Designing Clustered Transformations
  • Execution and Monitoring
  • Metadata Transformations
  • Rules
  • Data Pipelining, Partitioning
  • Defining a Partitioning Schema
  • Objectives of Partitioning, Implementing Partitioning
  • Internal Variables
  • Database Partitions
  • Partitioning in a Clustered Transformation
  • Bonus Topics
  • Working with Repositories
  • Pan/Kitchen – Launching Transformations and Jobs from the Command-Line
TENHO INTERESSE

Cursos Relacionados

Curso Fundamentos de Gerenciamento de Dados Mestres

16 horas

Curso Big Data Analyst Mineração de Dados

32 horas

Curso Técnicas de integração de dados ETL

16 horas

Curso Big Data Boot Camp Visão de Negócios

Curso Inteligência Artificial / AI Visão Geral

8 horas

Curso Oracle Fundamentos de Big Data

32 horas

Curso Fundamentos de Qualidade de Dados

16 horas

Curso Marchine Learning Com Hadoop

32 horas

Curso Python for Data Analysis

24 horas