Visão Geral
O Curso Scalable Machine Learning, foi projetado e desenvolvido para fornecer aos alunos exposição em Scalable Machine Learning. O curso se concentra na utilização dos frameworks Hadoop e Spark para implementar algoritmos SML por meio das linguagens de programação Scala e Python.
O Curso Scalable Machine Learning, começa com uma introdução ao SML e por que os desenvolvedores usam o Spark para SML. Em seguida, o curso se aprofunda na aquisição de dados, no pré-processamento de dados para modelagem e no trabalho com algoritmos iterativos. O curso termina com avaliação, otimização e implantação de modelos.
Objetivo
Após este Curso Scalable Machine Learning, você será capaz de:
- Descreva a função do Spark no aprendizado de máquina
- Aplique aprendizado de máquina em grandes conjuntos de dados
- Demonstrar experiência em análise e modelagem de processamento de aquisição de dados usando Hadoop e Spark
- Avalie vários tipos comuns de dados, por exemplo, dados de mídia social CSV XML JSON etc. para pré-processamento e/ou construção de modelos de aprendizado de máquina usando Spark
- Treinar, testar e implantar modelos de aprendizado de máquina
Materiais
Português/Inglês + Exercícios + Lab Pratico
Conteúdo Programatico
Introduction to SML
- What is SML?
- Why it is required?
- Key platforms for performing SML
- SMLProject End to End Pipeline
- Spark Introduction
- Why Spark for SML?
- Databricks Platform Demo
- Approaches for scaling sci-kit learn code
- Hands-on Exercise(s): Experiencing the first notebook
Why Spark for SML?
- Problems with Traditional Machine Learning Frameworks
- Machine Learning at Scale – Various options
- Iterative Algorithms
- How Spark performs well for Iterative Machine Learning Algorithms?
- Hands-on Exercise(s)
SML on Enterprise Platform
- Quick Recap/Introduction to Hadoop
- Logical View of Cloudera Distribution
- Big Data Analytics Pipelines
- Components in Cloudera Distribution for performing SML
- Hands-on Exercise(s)
Data Acquisition at Scale
- Acquiring Structured content from Relational Databases
- Acquiring Semi-structured content from Log Files
- Acquiring Unstructured content from other key sources like Web
- Tools for Performing Data acquisition at Scale
- Sqoop, Flume and Kafka Introduction, use cases and architectures
- Hands-on Exercise(s)
Data Pre-Processing for Modeling
- Using the Spark Shell
- Resilient Distributed Datasets (RDDs)
- Functional Programming with Spark
- RDD Operations
- Key-Value Pair RDDs
- MapReduce and Pair RDD Operations
- Building and Running a Spark Application
- Performing Data Validation
- Data De-Duplication
- Detecting Outliers
- Hands-on Exercise(s)
Working with Iterative Algorithms
- Dealing with RDD Infinite Lineages
- Caching Overview
- Distributed Persistence
- Checkpointing of an Iterative Machine Learning Algorithm
- Hands-on Exercise(s)
Spark SQL
- Introduction
- Dataframe API
- Performing ad-hoc query analysis using Spark SQL
- Hands-on Exercise(s)
Spark Machine Learning Using MLLib
- Spark ML vs Spark MLLib
- Data types and key terms
- Feature Extraction
- Linear Regression using Spark MLLib
- Hands-on Exercise(s)
Spark Machine Learning Using ML
- Spark ML Overview
- Transformers and Estimators
- Pipelines
- Implementing Decision Trees
- K-Means Clustering using Spark ML
- Hands-on Exercise(s)
Decision Trees and Random Forest
- Types – Classification and Regression trees
- Gini Index, Entropy and Information Gain
- Building Decision Trees
- Pruning the trees
- Prediction using Trees
- Ensemble Models
- Bagging and Boosting
- Advantages of using Random Forest
- Working with Random Forest
- Ensemble Learning
- How ensemble learning works
- Building models using Bagging
- Random Forest algorithm
- Random Forest model building
- Fine tuning hyper-parameters
- Hands-on Exercise(s)
Model Evaluation, Optimization and Deployment
- Model Evaluation
- Optimizing a Model
- Deploying Model
- Best Practices
TENHO INTERESSE