Conteúdo Programatico
Day 01
Overview of Big Data Business Intelligence for Criminal Intelligence Analysis
- Case Studies from Law Enforcement - Predictive Policing
- Big Data adoption rate in Law Enforcement Agencies and how they are aligning their future operation around Big Data Predictive Analytics
- Emerging technology solutions such as gunshot sensors, surveillance video and social media
- Using Big Data technology to mitigate information overload
- Interfacing Big Data with Legacy data
- Basic understanding of enabling technologies in predictive analytics
- Data Integration & Dashboard visualization
- Fraud management
- Business Rules and Fraud detection
- Threat detection and profiling
- Cost benefit analysis for Big Data implementation
Varieties of Data: Introduction to Data Cleaning issues in Big Data
- RDBMS – static structure/schema, does not promote agile, exploratory environment.
- NoSQL – semi structured, enough structure to store data without exact schema before storing data
- Data cleaning issues
Hadoop
- When to select Hadoop?
- STRUCTURED - Enterprise data warehouses/databases can store massive data (at a cost) but impose structure (not good for active exploration)
- SEMI STRUCTURED data – difficult to carry out using traditional solutions (DW/DB)
- Warehousing data = HUGE effort and static even after implementation
- For variety & volume of data, crunched on commodity hardware – HADOOP
- Commodity H/W needed to create a Hadoop Cluster
Introduction to Map Reduce /HDFS
- MapReduce – distribute computing over multiple servers
- HDFS – make data available locally for the computing process (with redundancy)
- Data – can be unstructured/schema-less (unlike RDBMS)
- Developer responsibility to make sense of data
- Programming MapReduce = working with Java (pros/cons), manually loading data into HDFS
Day 02
Big Data Ecosystem -- Building Big Data ETL (Extract, Transform, Load) -- Which Big Data Tools to use and when?
- Hadoop vs. Other NoSQL solutions
- For interactive, random access to data
- Hbase (column oriented database) on top of Hadoop
- Random access to data but restrictions imposed (max 1 PB)
- Not good for ad-hoc analytics, good for logging, counting, time-series
- Sqoop - Import from databases to Hive or HDFS (JDBC/ODBC access)
- Flume – Stream data (e.g. log data) into HDFS
Big Data Management System
- Moving parts, compute nodes start/fail :ZooKeeper - For configuration/coordination/naming services
- Complex pipeline/workflow: Oozie – manage workflow, dependencies, daisy chain
- Deploy, configure, cluster management, upgrade etc (sys admin) :Ambari
- In Cloud : Whirr
Predictive Analytics -- Fundamental Techniques and Machine Learning based Business Intelligence
- Introduction to Machine Learning
- Learning classification techniques
- Bayesian Prediction -- preparing a training file
- Support Vector Machine
- KNN p-Tree Algebra & vertical mining
- Neural Networks
- Big Data large variable problem -- Random forest (RF)
- Big Data Automation problem – Multi-model ensemble RF
- Automation through Soft10-M
- Text analytic tool-Treeminer
- Agile learning
- Agent based learning
- Distributed learning
- Introduction to Open source Tools for predictive analytics : R, Python, Rapidminer, Mahut
Predictive Analytics Ecosystem and its application in Criminal Intelligence Analysis
- Technology and the investigative process
- Insight analytic
- Visualization analytics
- Structured predictive analytics
- Unstructured predictive analytics
- Threat/fraudstar/vendor profiling
- Recommendation Engine
- Pattern detection
- Rule/Scenario discovery – failure, fraud, optimization
- Root cause discovery
- Sentiment analysis
- CRM analytics
- Network analytics
- Text analytics for obtaining insights from transcripts, witness statements, internet chatter, etc.
- Technology assisted review
- Fraud analytics
- Real Time Analytic
Day 03
Real Time and Scalable Analytics Over Hadoop
- Why common analytic algorithms fail in Hadoop/HDFS
- Apache Hama- for Bulk Synchronous distributed computing
- Apache SPARK- for cluster computing and real time analytic
- CMU Graphics Lab2- Graph based asynchronous approach to distributed computing
- KNN p -- Algebra based approach from Treeminer for reduced hardware cost of operation
Tools for eDiscovery and Forensics
- eDiscovery over Big Data vs. Legacy data – a comparison of cost and performance
- Predictive coding and Technology Assisted Review (TAR)
- Live demo of vMiner for understanding how TAR enables faster discovery
- Faster indexing through HDFS – Velocity of data
- NLP (Natural Language processing) – open source products and techniques
- eDiscovery in foreign languages -- technology for foreign language processing
Big Data BI for Cyber Security – Getting a 360-degree view, speedy data collection and threat identification
- Understanding the basics of security analytics -- attack surface, security misconfiguration, host defenses
- Network infrastructure / Large datapipe / Response ETL for real time analytic
- Prescriptive vs predictive – Fixed rule based vs auto-discovery of threat rules from Meta data
Gathering disparate data for Criminal Intelligence Analysis
- Using IoT (Internet of Things) as sensors for capturing data
- Using Satellite Imagery for Domestic Surveillance
- Using surveillance and image data for criminal identification
- Other data gathering technologies -- drones, body cameras, GPS tagging systems and thermal imaging technology
- Combining automated data retrieval with data obtained from informants, interrogation, and research
- Forecasting criminal activity
Day 04
Fraud prevention BI from Big Data in Fraud Analytics
- Basic classification of Fraud Analytics -- rules-based vs predictive analytics
- Supervised vs unsupervised Machine learning for Fraud pattern detection
- Business to business fraud, medical claims fraud, insurance fraud, tax evasion and money laundering
Social Media Analytics -- Intelligence gathering and analysis
- How Social Media is used by criminals to organize, recruit and plan
- Big Data ETL API for extracting social media data
- Text, image, meta data and video
- Sentiment analysis from social media feed
- Contextual and non-contextual filtering of social media feed
- Social Media Dashboard to integrate diverse social media
- Automated profiling of social media profile
- Live demo of each analytic will be given through Treeminer Tool
Big Data Analytics in image processing and video feeds
- Image Storage techniques in Big Data -- Storage solution for data exceeding petabytes
- LTFS (Linear Tape File System) and LTO (Linear Tape Open)
- GPFS-LTFS (General Parallel File System - Linear Tape File System) -- layered storage solution for Big image data
- Fundamentals of image analytics
- Object recognition
- Image segmentation
- Motion tracking
- 3-D image reconstruction
Biometrics, DNA and Next Generation Identification Programs
- Beyond fingerprinting and facial recognition
- Speech recognition, keystroke (analyzing a users typing pattern) and CODIS (combined DNA Index System)
- Beyond DNA matching: using forensic DNA phenotyping to construct a face from DNA samples
Big Data Dashboard for quick accessibility of diverse data and display:
- Integration of existing application platform with Big Data Dashboard
- Big Data management
- Case Study of Big Data Dashboard: Tableau and Pentaho
- Use Big Data app to push location based services in Govt.
- Tracking system and management
Day 05
How to justify Big Data BI implementation within an organization:
- Defining the ROI (Return on Investment) for implementing Big Data
- Case studies for saving Analyst Time in collection and preparation of Data – increasing productivity
- Revenue gain from lower database licensing cost
- Revenue gain from location based services
- Cost savings from fraud prevention
- An integrated spreadsheet approach for calculating approximate expenses vs. Revenue gain/savings from Big Data implementation.
Step by Step procedure for replacing a legacy data system with a Big Data System
- Big Data Migration Roadmap
- What critical information is needed before architecting a Big Data system?
- What are the different ways for calculating Volume, Velocity, Variety and Veracity of data
- How to estimate data growth
- Case studies
Review of Big Data Vendors and review of their products.
- Accenture
- APTEAN (Formerly CDC Software)
- Cisco Systems
- Cloudera
- Dell
- EMC
- GoodData Corporation
- Guavus
- Hitachi Data Systems
- Hortonworks
- HP
- IBM
- Informatica
- Intel
- Jaspersoft
- Microsoft
- MongoDB (Formerly 10Gen)
- MU Sigma
- Netapp
- Opera Solutions
- Oracle
- Pentaho
- Platfora
- Qliktech
- Quantum
- Rackspace
- Revolution Analytics
- Salesforce
- SAP
- SAS Institute
- Sisense
- Software AG/Terracotta
- Soft10 Automation
- Splunk
- Sqrrl
- Supermicro
- Tableau Software
- Teradata
- Think Big Analytics
- Tidemark Systems
- Treeminer
- VMware (Part of EMC)
Q/A session