IBM Data Science Certificate

01 Defining Data Science and What Data Scientist Do

Defining Data Science

  • What is Data Science?
  • Fundamentals of Data Science
  • The Many Paths to Data Science
  • Advice for New Data Scientist What do Data Scientists Do
  • A day in the life of a Data Scientist
  • Old problems, new problems, Data Science solutions
  • Data Science Topics and Algorithms
  • Cloud for Data Science

Big Data and Data Mining

  • Foundations of Big Data
    • The V’s of Big Data: Velocity, Volume, Variety, Veracity (quality and origin), and Value
  • What is Hadoop?
  • How Big Data is Driving Digital Transformation
  • Data Science Skills & Big Data
  • Data Scientists at New York University

Deep Learning and Machine Learning

  • What’s the difference?
    • Big Data, Data Mining, Artificial Intelligence, Machine Learning, Deep Learning, Neural Networks
  • Neural Networks and Deep Learning?
  • Applications of Machine Learning
    • Recommender Systems, Classification, Cluster Analysis, Market Basket Analysis, Predictive Analysis, Bayesian Analysis
    • Fintech: recommendations
    • Retail: fraud detection Hands on Exercise: Data Science and Exploration Data Science in Business
  • How Data Science is saving lives
  • How Should Companies Get Started in Data Science
  • Applications of Data Science Careers and Recruiting in Data Science
  • How Can Someone Become a Data Scientist?
  • Recruiting for Data Science
  • Careers in Data Science
  • High School Students and Data Science Careers

IBM Skills Network Labs Apache Spark Hadoop Clusters IPython, Jupyter, Decision Sciences @ Stern Precision vs Recall, Oversampling

02 Tools for Data Science

Module 1: Overview of Data Science Tools

Data Science Tools

  • Categories of Data Science Tools
    • Data Science Tasks: Data Management, Data Integration and Transformation, Data Visualization, Model Building, Model Deployment, Model Monitoring and Assessment
    • Code Asset Management
    • Data Asset Management (DAM platforms)
    • Development Environments
  • Open Source Tools for Data Science - Part 1
    • Data Management - MySQL, PostgreSQL, Hadoop File System
    • Data Integration and Transformation - Apache AirFlow, KubeFlow, Apache Kafka, NodeRED
    • Data Visualization - PixieDuset, Apache Superset, Hue, Kibana
    • Model Deployment - Apache PredictionIO, Seldon, mleap, Tensorflow Serving
    • Model Monitoring - ModelDB, Prometheus, AI Fairness 360, IBM Adversarial Robustness, AI Explainability
    • Data Asset Management - ApacheAtlas, ODPiEgeria, Kylo
  • Open Source Tools for Data Science - Part 2
    • jupyter jupyter lab
    • RStudio, Spyder
    • Apache Spark - cluster execution environment, Flink
    • Ray - enables large scale deep learning training
    • Knime, Orange
  • Commercial Tools for Data Science
    • Data Management - Oracle Database, Mcrosoft SQL server, IBM DB2
    • Data Integration and Transformation - Informatica, IBM InfoSphere datastage, Waston Studio Desktop
    • Data Visualization - tableau, Microsoft Power bi, IBM Cognos Analytics, Watson Studio Desktop
    • Data Mining - SPSS Modeler
    • Model Buiding & Deployment - SPSS Modeler
    • Data Asset Management - Informatica, IBM
    • Waston Studio, Waston Open Source cover entire data science lifecycle!
  • Cloud Based Tools for Data Science

Module 2: Languages of Data Science

  • Languages of Data Science

    • Python, R, SQL, Scala, Java, C++, and Julia
  • Introduction to Python

  • Introduction to R Language

  • Introduction to SQL

  • Other Languages for Data Science

  • Python, Scala, Java, R, Julia, C++

Module 3: Libraries, APIs, Datasets, and Models

  • Libraries for Data Science
    • Python Libraries
      • Scientific Computing - Pandas, Numpy
      • Visualization - Matplotlib, Seaborn
      • Machine and Deep Learning - Scikit-learn, Keras, Tensorflow, PyTorch
    • Apache Spark - cluster computing can be done in Python, R
    • Scala
    • R - ggplot2, can also interface with tensorflow,
  • Application Programming Interfaces (APIs)
  • Data Sets - Powering Data Science
  • Sharing Enterprise Data - Data Asset eXchange
  • Machine Learning Models - Learning from Models to Make Predictions

Module 4: Jupyter Project

Module 5: RStudio and Github

  • Introduction to R and RStudio
    • Popular R Libraries for Data Science
      • dplyr data manipulation
      • stringr string manipulation
      • ggplot data visualization
      • Lattice complex, multi-variable data sets
      • Leaflet interactive plot
      • caret machine learning
  • Optional Reading: Download & Install R and RStudio
  • R Basics with RStudio
  • Plotting in RStudio
  • Getting started with RStudio and Installing packages
  • Creating Data Visualizations using ggplot
  • Plotting with RStudio
  • Overview of Git/Github
  • Introduction to Github
  • GitHub Repositories
  • GitHub - Getting Started
  • Hands-on Lab: Getting Started with GitHub
  • GitHub - Working with Branches
  • Getting Started with Branches using Git Commands
  • **Hands-On Lab: Branching and Merging (Web UI)
  • GitHub Branches

Module 6: Final Project - Create Jupyter Notebook

Module 7: (OPTIONAL) IBM Tools for Data Science

  • Introduction to Waston Studio

03 Data Science Methodology

Data science methodology can be represented with a total of 10 stages, however, it is cyclic and an iterative process!

  1. Business Understanding
  2. Analytic Approach - identifying what type of patterns will be needed to address the question most effectively
  3. Data Requirements - identify the necessary data contents, formats, and sources for initial classification
  4. Data Collection - techniques to assess the content, quality, and initial insights about the data
  5. Data Understanding - is the data you collected representative of the problem to be solved?
  6. Data Preparation - what are the ways in which data is prepared?
  7. Modeling - in what way can the data be visualized to get the answer that is required?
  8. Evaluation - does the model used really answer the initial question or does it need to be adjusted?
  9. Deployment
  10. Feedback

Imagine the case study:

What is hospital readmissions? A situation where you were discharged from the hospital and wind up going back in the same or related care within 30 days.

Modeling

  1. The positive non-readmission
  2. The negative is readmission

TP. Person is correctly classified for non-readmission. FP. Person is incorrectly classified for non-readmission. TN. Person is correctly classified for readmission. FN. Person is incorrectly classified for readmission

If FP occurs, resources are wasted, cost is high. If FN occurs, those that need help do not get it.

04 Python for Data Science, AI & Development

As a next step, you can take the appropriate follow-on Python Project from the list below to apply your new found skills in a real-world scenario.

Note: Successful completion of this course is a pre-requisite for these Python Project courses. You can explore the courses below to further hone and develop your skills for working with Data and Python:

05 Python Project for Data Science

The Problem How can we help Tom determine the best price for his car?

Open dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names

Data Wrangling

  1. Dealing with missing values
    1. check with data collection source
    2. drop the variable, data entry
    3. replace the missing values, with average, frequency, etc
    4. leave it as missing data
  2. Data formatting
  3. Data normalization
      1. simple feature calling
    1. min-max
    2. z-score
  4. Binning
    1. group values into “bins”
    2. converts numeric into categorical variables
    3. group a set of numerical values into a set of bins

Exploratory Data Analysis

It is a preliminary step in data analysis to

  • summarize main characteristics
  • gain better understanding of the data set
  • uncover relationships between variables
  • extract important variables

06 Databases and SQL for Data Science with Python

Learn the basics of sql and relational database model

What is SQL? A language used for relational database.

What is a Relational Database?

DBMS?

DDL? Data definition language are used to define, change, or drop database object. Including CREATE, ALTER, DROP, TRUNCATE DML? Read and modify data. Including INSERT, SELECT, UPDATE, DELETE

Select from table

SELECT * FROM table_name
SELECT col1, col2 FROM table_name
SELECT col1, col2 FROM table_name WHERE col1 = value1
 
SELECT COUNT(*) FROM table_name 
SELECT DISTINCT col1 FROM table_name WHERE col1 = value1
SELECT * FROM table_name LIMIT 1
INSERT INTO table_name 
	(col1, ...)
VALUES
	(val1, ...);
UPDATE table_name
SET col1 = value1, col2 = value2, ...
WHERE condition
DELETE FROM table_name
	WHERE col1 in (val1, val2)

Create table

CREATE TABLE table_name
	(
	col_name datatype optional_parameter
	...
	)

Alter table

ALTER TABLE table_name
	ADD COLUMN col_name datatype;
 
ALTER TABLE table_name CHANGE col_name
	col_name datatype
 
ALTER TABLE table_name
	ALTER COLUMN col_name SET DATATYPE CHAR(20)
 
ALTER TABLE table_name
	DROP COLUMN col_name

Delete table

DROP TABLE table_name;

Delete all rows

TRUNCATE TABLE table_name IMMEDIATE;

String Patterns, Ranges, Sets of Values

... WHERE column LIKE R%
... WHERE pages between ... AND ...
... WHERE country IN (..., ..., ...)

Sorting

... ORDER BY col_name ASC/DESC
... ORDER BY col_num 

Grouping Results Sets

SELECT col_name, COUNT(col_name) FROM table_name GROUP BY col_name AS new_col_name HAVING COUNT(col_name) > X

Built-in Database Functions

  • Aggregate Functions - perform operations on entire cols, AVG, SUM, MIN, MAX
  • Scalar Functions - perform operations on every value: ROUND, LENGTH, UCASE, LCASE
  • Date and Time - special data types for dates and times: YEAR, MONTH, DAY, DAYOFMONTH

Sub-queries and Nested Selects A query inside another query.

  • subqueries to evaluate aggregate functions
  • subqueries in a list of columns
  • subqueries in from clause
SELECT col1 FROM table_name
	WHERE col2 = (SELECT ... FROM table_name)
 

Working with Multiple Tables Access multiple tables in the same query

  1. Sub-queries
  2. Implicit Join
  3. JOIN operators (INNER JOIN, OUTER JOIN, etc)
SELECT column_name [, column_name ]
FROM table1 [, table2 ]
WHERE column_name OPERATOR
   (SELECT column_name [, column_name ]
   FROM table1 [, table2 ]
   WHERE condition);

What is Python DB-API?

What is a Model? A model can be thought of as a mathematical equation used to predict a value given one or more other values. Relating one or more independent variables to dependent variables.

Model Evaluation and Refinement?

08 Data Visualization with Python

**What is data visualization? Why do we build visualizations? What are important types of plots? What are popular plot libraries?

  • matplotlib
  • pandas
  • seaborn
  • folium
  • plotly
  • pywaffle What is Dash-boarding?*
  • Dash

09 Machine Learning with Python

10 Applied Capstone

IBM Resources Mentioned

IBM Waston Open Scale IBM Watson Studio IBM Developer Website

Open Data Sets