IBM Data Science Certificate

01 Defining Data Science and What Data Scientist Do

Defining Data Science

What is Data Science?
Fundamentals of Data Science
The Many Paths to Data Science
Advice for New Data Scientist What do Data Scientists Do
A day in the life of a Data Scientist
Old problems, new problems, Data Science solutions
Data Science Topics and Algorithms
Cloud for Data Science

Big Data and Data Mining

Foundations of Big Data
- The V’s of Big Data: Velocity, Volume, Variety, Veracity (quality and origin), and Value
What is Hadoop?
How Big Data is Driving Digital Transformation
Data Science Skills & Big Data
Data Scientists at New York University

Deep Learning and Machine Learning

What’s the difference?
- Big Data, Data Mining, Artificial Intelligence, Machine Learning, Deep Learning, Neural Networks
Neural Networks and Deep Learning?
Applications of Machine Learning
- Recommender Systems, Classification, Cluster Analysis, Market Basket Analysis, Predictive Analysis, Bayesian Analysis
- Fintech: recommendations
- Retail: fraud detection Hands on Exercise: Data Science and Exploration Data Science in Business
How Data Science is saving lives
How Should Companies Get Started in Data Science
Applications of Data Science Careers and Recruiting in Data Science
How Can Someone Become a Data Scientist?
Recruiting for Data Science
Careers in Data Science
High School Students and Data Science Careers

IBM Skills Network Labs Apache Spark Hadoop → Clusters IPython, Jupyter, Decision Sciences @ Stern Precision vs Recall, Oversampling

02 Tools for Data Science

Module 1: Overview of Data Science Tools

Data Science Tools

Categories of Data Science Tools
- Data Science Tasks: Data Management, Data Integration and Transformation, Data Visualization, Model Building, Model Deployment, Model Monitoring and Assessment
- Code Asset Management
- Data Asset Management (DAM platforms)
- Development Environments
Open Source Tools for Data Science - Part 1
- Data Management - MySQL, PostgreSQL, Hadoop File System
- Data Integration and Transformation - Apache AirFlow, KubeFlow, Apache Kafka, NodeRED
- Data Visualization - PixieDuset, Apache Superset, Hue, Kibana
- Model Deployment - Apache PredictionIO, Seldon, mleap, Tensorflow Serving
- Model Monitoring - ModelDB, Prometheus, AI Fairness 360, IBM Adversarial Robustness, AI Explainability
- Data Asset Management - ApacheAtlas, ODPiEgeria, Kylo
Open Source Tools for Data Science - Part 2
- jupyter → jupyter lab
- RStudio, Spyder
- Apache Spark - cluster execution environment, Flink
- Ray - enables large scale deep learning training
- Knime, Orange
Commercial Tools for Data Science
- Data Management - Oracle Database, Mcrosoft SQL server, IBM DB2
- Data Integration and Transformation - Informatica, IBM InfoSphere datastage, Waston Studio Desktop
- Data Visualization - tableau, Microsoft Power bi, IBM Cognos Analytics, Watson Studio Desktop
- Data Mining - SPSS Modeler
- Model Buiding & Deployment - SPSS Modeler
- Data Asset Management - Informatica, IBM
- Waston Studio, Waston Open Source cover entire data science lifecycle!
Cloud Based Tools for Data Science

Module 2: Languages of Data Science

Languages of Data Science
- Python, R, SQL, Scala, Java, C++, and Julia
Introduction to Python
Introduction to R Language
Introduction to SQL
Other Languages for Data Science
Python, Scala, Java, R, Julia, C++

Module 3: Libraries, APIs, Datasets, and Models

Libraries for Data Science
- Python Libraries
  - Scientific Computing - Pandas, Numpy
  - Visualization - Matplotlib, Seaborn
  - Machine and Deep Learning - Scikit-learn, Keras, Tensorflow, PyTorch
- Apache Spark - cluster computing can be done in Python, R
- Scala
- R - ggplot2, can also interface with tensorflow,
Application Programming Interfaces (APIs)
Data Sets - Powering Data Science
Sharing Enterprise Data - Data Asset eXchange
Machine Learning Models - Learning from Models to Make Predictions

Module 4: Jupyter Project

Module 5: RStudio and Github

Introduction to R and RStudio
- Popular R Libraries for Data Science
  - dplyr data manipulation
  - stringr string manipulation
  - ggplot data visualization
  - Lattice complex, multi-variable data sets
  - Leaflet interactive plot
  - caret machine learning
Optional Reading: Download & Install R and RStudio
R Basics with RStudio
Plotting in RStudio
Getting started with RStudio and Installing packages
Creating Data Visualizations using ggplot
Plotting with RStudio
Overview of Git/Github
Introduction to Github
GitHub Repositories
GitHub - Getting Started
Hands-on Lab: Getting Started with GitHub
GitHub - Working with Branches
Getting Started with Branches using Git Commands
**Hands-On Lab: Branching and Merging (Web UI)
GitHub Branches

Module 6: Final Project - Create Jupyter Notebook

Module 7: (OPTIONAL) IBM Tools for Data Science

Introduction to Waston Studio

03 Data Science Methodology

Data science methodology can be represented with a total of 10 stages, however, it is cyclic and an iterative process!

Business Understanding
Analytic Approach - identifying what type of patterns will be needed to address the question most effectively
Data Requirements - identify the necessary data contents, formats, and sources for initial classification
Data Collection - techniques to assess the content, quality, and initial insights about the data
Data Understanding - is the data you collected representative of the problem to be solved?
Data Preparation - what are the ways in which data is prepared?
Modeling - in what way can the data be visualized to get the answer that is required?
Evaluation - does the model used really answer the initial question or does it need to be adjusted?
Deployment
Feedback

Imagine the case study:

What is hospital readmissions? A situation where you were discharged from the hospital and wind up going back in the same or related care within 30 days.

Modeling

The positive non-readmission
The negative is readmission

TP. Person is correctly classified for non-readmission. FP. Person is incorrectly classified for non-readmission. TN. Person is correctly classified for readmission. FN. Person is incorrectly classified for readmission

If FP occurs, resources are wasted, cost is high. If FN occurs, those that need help do not get it.

04 Python for Data Science, AI & Development

As a next step, you can take the appropriate follow-on Python Project from the list below to apply your new found skills in a real-world scenario.

Note: Successful completion of this course is a pre-requisite for these Python Project courses. You can explore the courses below to further hone and develop your skills for working with Data and Python:

Databases and SQL for Data Science with Python If you are looking to start a career in Data Science, Data Engineering or AI & Application Development, note that this course is part of the following Professional Certificates which are designed to empower you with the skills to become job-ready in these fields.
IBM Applied AI Professional Certificate
IBM Data Analyst Professional Certificate
IBM Data Science Professional Certificate
IBM Data Engineering Professional Certificate
IBM Full Stack Cloud Developer Professional Certificate

05 Python Project for Data Science

The Problem How can we help Tom determine the best price for his car?

Open dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names

Data Wrangling

Dealing with missing values
1. check with data collection source
2. drop the variable, data entry
3. replace the missing values, with average, frequency, etc
4. leave it as missing data
Data formatting
Data normalization
1. 1. simple feature calling
2. min-max
3. z-score
Binning
1. group values into “bins”
2. converts numeric into categorical variables
3. group a set of numerical values into a set of bins

Exploratory Data Analysis

It is a preliminary step in data analysis to

summarize main characteristics
gain better understanding of the data set
uncover relationships between variables
extract important variables

06 Databases and SQL for Data Science with Python

Learn the basics of sql and relational database model

What is SQL? A language used for relational database.

What is a Relational Database?

DBMS?

DDL? Data definition language are used to define, change, or drop database object. Including CREATE, ALTER, DROP, TRUNCATE DML? Read and modify data. Including INSERT, SELECT, UPDATE, DELETE

Select from table

SELECT * FROM table_name
SELECT col1, col2 FROM table_name
SELECT col1, col2 FROM table_name WHERE col1 = value1
 
SELECT COUNT(*) FROM table_name 
SELECT DISTINCT col1 FROM table_name WHERE col1 = value1
SELECT * FROM table_name LIMIT 1

INSERT INTO table_name 
	(col1, ...)
VALUES
	(val1, ...);

UPDATE table_name
SET col1 = value1, col2 = value2, ...
WHERE condition

DELETE FROM table_name
	WHERE col1 in (val1, val2)

Create table

CREATE TABLE table_name
	(
	col_name datatype optional_parameter
	...
	)

Alter table

ALTER TABLE table_name
	ADD COLUMN col_name datatype;
 
ALTER TABLE table_name CHANGE col_name
	col_name datatype
 
ALTER TABLE table_name
	ALTER COLUMN col_name SET DATATYPE CHAR(20)
 
ALTER TABLE table_name
	DROP COLUMN col_name

Delete table

DROP TABLE table_name;

Delete all rows

TRUNCATE TABLE table_name IMMEDIATE;

String Patterns, Ranges, Sets of Values

... WHERE column LIKE R%
... WHERE pages between ... AND ...
... WHERE country IN (..., ..., ...)

Sorting

... ORDER BY col_name ASC/DESC
... ORDER BY col_num

Grouping Results Sets

SELECT col_name, COUNT(col_name) FROM table_name GROUP BY col_name AS new_col_name HAVING COUNT(col_name) > X

Built-in Database Functions

Aggregate Functions - perform operations on entire cols, AVG, SUM, MIN, MAX
Scalar Functions - perform operations on every value: ROUND, LENGTH, UCASE, LCASE
Date and Time - special data types for dates and times: YEAR, MONTH, DAY, DAYOFMONTH

Sub-queries and Nested Selects A query inside another query.

subqueries to evaluate aggregate functions
subqueries in a list of columns
subqueries in from clause

SELECT col1 FROM table_name
	WHERE col2 = (SELECT ... FROM table_name)

Working with Multiple Tables Access multiple tables in the same query

Sub-queries
Implicit Join
JOIN operators (INNER JOIN, OUTER JOIN, etc)

SELECT column_name [, column_name ]
FROM table1 [, table2 ]
WHERE column_name OPERATOR
   (SELECT column_name [, column_name ]
   FROM table1 [, table2 ]
   WHERE condition);

What is Python DB-API?

What is a Model? A model can be thought of as a mathematical equation used to predict a value given one or more other values. Relating one or more independent variables to dependent variables.

Model Evaluation and Refinement?

08 Data Visualization with Python

**What is data visualization? Why do we build visualizations? What are important types of plots? What are popular plot libraries?

matplotlib
pandas
seaborn
folium
plotly
pywaffle What is Dash-boarding?*
Dash

🧠

Explorer

IBM Data Science Professional Certificate