Curso Pentaho Data Integration
40 horasVisão Geral
O Pentaho Data Integration (PDI), que faz parte do Pentaho Open Source BI Suite, inclui todo tipo de software para dar suporte à tomada de decisões de negócios.
Objetivo
- Introdução ao PDI
- Introdução às transformações
- Manipulando dados do mundo real
- Filtrando Pesquisando e Executando outra operação útil com Dados.
- Transformando o conjunto de linhas
- Executando operações avançadas com bancos de dados
- Criando fluxos de tarefas básicas (trabalhos)
- Integração de dados em tempo real
- Ajuste de desempenho
- Paralelização, cluster e particionamento
Pre-Requisitos
Conhecimento dos conceitos de ETL
Materiais
InglêsConteúdo Programatico
MODULO I
- Getting Started with Pentaho Data Integration
- Pentaho Data Integration and Pentaho BI Suite
- Using PDI in real-world scenarios
- Loading data warehouses or data marts
- Integrating data
- Data Cleansing
- Migrating information
- Exporting data
- Integrating PDI along with other Pentaho tools
- Demo – Installing PDI
- Demo – starting and customizing Spoon
- Setting preferences in the Options window
- Storing transformations and jobs in a repository
- Demo – creating a hello world transformation Directing Kettle engine with transformations
- Exploring the Spoon interface
- Designing a transformation
- Running and previewing the transformation
- Demo – installing MySQL on Windows
- Getting Started with Transformations
- Designing and previewing transformations
- Demo – creating a simple transformation and getting familiar with the design process
- Understanding the Kettle rowset
- Seeing the results in the Execution-Results-pane
- Demo – generating a range of dates and inspecting the data as it is being created
- Adding/modifying fields using various PDI steps
- Demo – avoiding errors while converting the estimated time from string to integer
- The error handling functionality
- Demo – configuring the error handling to see the description of the errors
- Personalizing the error handling
MODULO II
- Manipulating Real-world Data
- Reading data from files
- Demo – reading results of football matches from files
- Reading several files at once
- Demo – reading all your files at a time using a single text file input step
- regular expressions
- Troubleshooting reading files
- Sending data to files
- Demo – sending the results of matches to a plain file
- Getting system information
- Demo – reading and writing matches files with flexibility
- Running transformations from a terminal window
- Demo – running the matches transformation from a terminal window
- Demo – getting data from an XML file with information about countries
- PDI transformation files
- Getting data from XML files
- Kettle variables
- Filtering/Searching/Performing related Useful Operations with Data
- Sorting data
- Demo – sorting information about matches with the Sort rows step
- Calculations on groups of rows
- Demo – calculating football match statistics by grouping data Group by Step
- Numeric fields
- Filtering
- Demo – counting frequent words by filtering
- Demo – refining the counting task by filtering even more
- Filtering rows using the Filter rows step
- Looking up data
- Demo – finding out which language people speak
- The Stream lookup step
- Data cleaning
- Demo – fixing words before counting them
- Cleansing data with PDI
- Controlling the Flow of Data
- Splitting streams
- Demo – browsing new features of PDI by copying a dataset
- Copying rows
- Distributing rows
- Demo – assigning tasks by distributing
- Splitting the stream based on conditions
- Demo – assigning tasks by filtering priorities with the Filter rows step
- PDI steps – splitting the stream-based on conditions
- Demo – assigning tasks by filtering priorities with the Switch/Case step
- Merging streams
- Demo – gathering progress and merging it all together
- PDI options for merging streams
- Demo – giving priority to Bouchard by using the Append Stream
- Treating invalid data by splitting and merging streams
- Demo – treating errors in the estimated time to avoid discarding rows
- Treating rows with invalid data
- Transforming Your Data by Coding
- Doing simple tasks with the JavaScript
- Demo – counting frequent words by coding in JavaScript
- Using the JavaScript language in PDI
- Using transformation predefined constants
- Testing the script using the Test script button
- Reading and parsing unstructured files with JavaScript
- Demo – changing a list of house descriptions with JavaScript
- Looping over the dataset rows
- Doing simple tasks with the Java Class
- Demo – counting frequent words by coding in Java
- Using the Java language in PDI
- Sending rows to the next step
- Data types equivalence
- Transforming the dataset with Java
- Demo – splitting the field to rows using Java
- Avoiding coding by using purpose-built steps
MODULO III
- Transforming the Row set
- Converting rows to columns
- Demo – enhancing the file by converting rows to columns
- Converting row data to column data by using the Row Demoralizer
- Aggregating data with a Row Demoralizer
- Demo – aggregating football matches data with the Row Demoralizer
- Using Row Demoralizer for aggregating data
- Normalizing data
- Demo – enhancing the matches file by normalizing the dataset
- Modifying the dataset with a Row Normalizer
- Summarizing: PDI steps which operate on sets of rows
- Generate a customized time-dimension dataset using Kettle-variables
- Demo – creating the time dimension dataset
- Getting variables
- Demo – parameterizing the start and end date of the time dimension dataset
- Working with Databases
- Introducing the Steel Wheels sample database
- Connecting to the Steel Wheels database
- Demo – creating a connection to the Steel Wheels database
- Connecting with Relational Database Management Systems
- Querying a database
- Demo – getting data about shipped orders
- Generating data from the database from Table input step
- Generating a new dataset using the SELECT statement
- Making flexible queries using parameters
- Demo – getting orders in a range of dates using parameters
- Adding parameters to your queries
- Making flexible queries by using Kettle variables
- Demo – getting orders in a range of dates by using Kettle variables
- Using Kettle variables in your queries
- Sending data to a database
- Demo – loading a table with a list of manufacturers
- Inserting or updating data by using other PDI
- Demo – inserting new products or updating existing ones
- Demo – testing the update of existing products
- Eliminating data from a database
- Demo – deleting data about discontinued items
- Performing Advanced Operations with Databases
- Demo – populating the Jigsaw database
- Exploring the Jigsaw database model
- Doing simple lookups
- Demo – using a Database lookup step to create a list of products to buy
- Performing complex lookups
- Demo – using a Database join step to create a list of suggested products to buy
- Joining data to the stream-data by using a Database-join
- Introducing dimensional modeling
- Loading dimensions with data
- Demo – loading a region dimension with a Combination lookup/update step
- Describing data with dimensions
- Loading Type I SCD with a Combination lookup/update
- Storing history of changes
- Demo – keeping a history of changes in products by using the Dimension lookup/update
- Keeping an entire history of data with a Type II SCD
- Loading Type II SCDs with the Dimension lookup/update step
MODULO IV
- Creating Basic Task Flows (Jobs)
- Introducing PDI jobs
- Demo – creating a folder with a Kettle job
- Executing processes with PDI jobs
- Using Spoon to design and run jobs
- Demo – creating a simple job and getting familiar with the design process
- Changing the flow of execution on the basis of conditions
- Look at the results – Execution results window
- Running transformations from jobs
- Demo – generating a range of dates and inspecting how things are running
- Using the Transformation job entry
- Receiving arguments and parameters in a job
- Demo – generating a hello world file by using arguments and parameters
- Using named parameters in jobs
- Running jobs from a terminal window
- Demo – executing the hello world job from a terminal window
- Demo – calling the hello world transformation with fixed arguments and parameters
- Deciding: use of a command-line argument or a named parameter
- Creating Advanced Transformations and Jobs
- Re-using part of your transformations
- Demo– calculating statistics with the use of a sub-transformation
- Creating a job as a process flow
- Demo – generating top average scores by copying and getting rows
- Use the copy/get rows mechanism to transfer data between transformations
- Demo – generating custom files by executing a transformation for every input row
- Executing for each row
- Enhancing your processes with the use of variables
- Demo – generating custom messages by setting a variable with the name of the examination file
- Case Study : Developing and Implementing a Sample Datamart
- Exploring the sales data mart
- Deciding the level of granularity
- Loading the dimensions
- Demo – loading the dimensions for the sales datamart
- Extending the sales datamart model
- Loading a fact table with aggregated data
- Demo – loading the sales fact table by looking up dimensions
- Get the data from the source – SQL queries
- Translating the business keys into surrogate keys
- Obtaining the surrogate key for Type I SCD
- Obtaining the surrogate key for Type II SCD
- Obtaining the surrogate key for the Junk dimension
- Obtaining the surrogate key for the Time dimension
- Getting facts and dimensions together
- Demo – loading the fact table using a range of dates obtained from the command line
- Demo – loading the SALES star schema model
- Automating the administrative tasks
- Demo – automating the loading of the sales data mart
MODULO V
- Real-Time Data Integration
- Introduction to Real-Time ETL
- Real-Time Challenges
- Requirements
- Transformation Streaming
- A Practical Example of Transformation Streaming
- Debugging
- Third-Party Software and Real-Time Integration
- Java Message Service
- Creating a JMS Connection and Session
- Performance Tuning
- Transformation Performance: Finding the Weakest Link
- Finding Bottlenecks by Simplifying
- Finding Bottlenecks by Measuring
- copying Rows of Data
- Improving Transformation Performance
- Using Lazy Conversion for Reading Text Files
- Single-File Parallel Reading, Multi-File Parallel Reading
- Configuring the NIO Block Size
- Changing Disks and Reading Text Files
- Improving Performance in Writing Text Files
- Using Lazy Conversion for Writing Text Files
- Parallel Files Writing
- Improving Database Performance
- Avoiding Dynamic SQL, Handling Roundtrips
- Handling Relational Databases
- Sorting Data, Sorting on the Database, Sorting in Parallel
- Reducing CPU Usage
- Optimizing the Use of JavaScript
- Launching Multiple Copies of a Step
- Selecting and Removing Values
- Managing Thread Priorities
- Adding Static Data to Rows of Data
- Limiting the Number of Step Copies
- Avoiding Excessive Logging
- Improving Job Performance
- Loops in Jobs
- Database Connection Pools
- Parallelization, Clustering, and Partitioning
- Multi-Threading
- Row Distribution, Row Merging, and Row Redistribution
- Data Pipelining
- Consequences of Multi-Threading
- Database Connections
- Order of Execution, Parallel Execution in a Job
- Using Carte as a Slave Server
- The Configuration File
- Defining Slave Servers
- Remote Execution, Monitoring Slave Servers
- Carte Security, Services
- Clustering Transformations
- Defining a Cluster Schema, Designing Clustered Transformations
- Execution and Monitoring
- Metadata Transformations
- Rules
- Data Pipelining, Partitioning
- Defining a Partitioning Schema
- Objectives of Partitioning, Implementing Partitioning
- Internal Variables
- Database Partitions
- Partitioning in a Clustered Transformation
- Bonus Topics
- Working with Repositories
- Pan/Kitchen – Launching Transformations and Jobs from the Command-Line