Big Data Science

Course Overview

Opportunities for data scientists — one of today’s hottest jobs, are rapidly growing in response to the exponential amounts of data being captured and analyzed. Companies hire data scientists to find insights and to solve meaningful business problems. It is beyond a doubt that data science in an integral part of our daily lives so as to make well-informed decisions in the areas of business, financial, operational, marketing, application of technology and many other activities.

Through this program, trainees will learn the concepts, tools and techniques that are needed to begin learning data science. This course strikes an appropriate balance between theory and practical, and key concepts are taught using case studies. Upon completion, trainees will be able to perform the basic data handling tasks, collect and analyze data, and present them using industry standard tools.





Course Objectives

Upon completion of this course, trainees will be able to:

  • Identify appropriate model for different data types
  • Create data process and analysis workflow
  • Define and explain the key concepts and models relevant to data science
  • Differentiate key data ETL process, from cleaning, processing to visualization
  • Implement algorithms to extract information from dataset
  • Apply best practices in data science, and be familiar with standard tools

Training Methodology

  • Business Understanding
  • Analytic Approach
  • Data Requirements
  • Data Collection
  • Data Understanding
  • Data Preparation
  • Modelling
  • Evaluation
  • Deployment
  • Feedback

Course Content

Day 1

Module 1: Introduction To Data Science

  • What is data?
  • Types of data
  • What is data science?
  • Statistical thinking
  • Extract, Transform and Load (ETL)
  • Data cleansing
  • Aggregation, filtering, sorting, joining
  • Data workflow

Module 2: Data Quality

  • Raw vs Tidy data
  • Key features of data quality
  • Maintenance of data quality
  • Data profiling
  • Data completeness and consistency

Module 3: Life of A Data Scientist

  • Identify problem
  • Define question
  • Define ideal dataset
  • Obtain data
  • Analyse data
  • Interpret results
  • Distribute results

Module 4: Beginning Databases

  • Types of databases
  • Relational databases
  • NoSQL
  • Hybrid database

Day 2

Module 5: Structured Query Language (Sql)

  • Performing CRUD (create, retrieve, update, delete)
  • Designing a real world database
  • Normalizing a table

Module 6: Nosql Searching And Querying

  • Modeling KV stores
  • Modeling column-family stores
  • Modeling graph DBs
  • SOLR/Elastic/Lucene search

Module 7: Data Gathering

  • Obtain data from online repositories
  • Import data from local file formats (.json, .xml)
  • Import data using web API
  • Scrape website for data

Module 8: Exploratory Data Analysis

  • What is EDA?
  • Goals of EDA
  • The role of graphics
  • Handling outliers
  • Dimension reduction

Day 3

Module 9: Introduction To Text Mining

  • What is text mining? Natural language processing
  • Pre-processing text data
  • Extracting features from documents
  • Using beautifulsoup
  • Measuring document similarity

Module 10: Supervised Learning

  • What is prediction?
  • Sampling, training set, testing set
  • Constructing a decision tree

Module 11: Presenting Data

  • Choosing the right visualization

Module 12: Big Data Landscape

  • What is Small Data?
  • What is Big Data?
  • Big Data Analytics vs Data Science
  • Key Elements in Big Data (3Vs)
  • Extracting Values from Big Data
  • Challenges in Big Data

Module 13: Big Data Tools And Applications

  • Introducing Hadoop and Spark Ecosystem
  • Cloudera vs Hortonworks
  • Real World Big Data Applications