IBM Data Science Certificate
01 Defining Data Science and What Data Scientist Do
Defining Data Science
- What is Data Science?
- Fundamentals of Data Science
- The Many Paths to Data Science
- Advice for New Data Scientist What do Data Scientists Do
- A day in the life of a Data Scientist
- Old problems, new problems, Data Science solutions
- Data Science Topics and Algorithms
- Cloud for Data Science
Big Data and Data Mining
- Foundations of Big Data
- The V’s of Big Data: Velocity, Volume, Variety, Veracity (quality and origin), and Value
- What is Hadoop?
- How Big Data is Driving Digital Transformation
- Data Science Skills & Big Data
- Data Scientists at New York University
Deep Learning and Machine Learning
- What’s the difference?
- Big Data, Data Mining, Artificial Intelligence, Machine Learning, Deep Learning, Neural Networks
- Neural Networks and Deep Learning?
- Applications of Machine Learning
- Recommender Systems, Classification, Cluster Analysis, Market Basket Analysis, Predictive Analysis, Bayesian Analysis
- Fintech: recommendations
- Retail: fraud detection Hands on Exercise: Data Science and Exploration Data Science in Business
- How Data Science is saving lives
- How Should Companies Get Started in Data Science
- Applications of Data Science Careers and Recruiting in Data Science
- How Can Someone Become a Data Scientist?
- Recruiting for Data Science
- Careers in Data Science
- High School Students and Data Science Careers
IBM Skills Network Labs Apache Spark Hadoop → Clusters IPython, Jupyter, Decision Sciences @ Stern Precision vs Recall, Oversampling
02 Tools for Data Science
Module 1: Overview of Data Science Tools
Data Science Tools
- Categories of Data Science Tools
- Data Science Tasks: Data Management, Data Integration and Transformation, Data Visualization, Model Building, Model Deployment, Model Monitoring and Assessment
- Code Asset Management
- Data Asset Management (DAM platforms)
- Development Environments
- Open Source Tools for Data Science - Part 1
- Data Management - MySQL, PostgreSQL, Hadoop File System
- Data Integration and Transformation - Apache AirFlow, KubeFlow, Apache Kafka, NodeRED
- Data Visualization - PixieDuset, Apache Superset, Hue, Kibana
- Model Deployment - Apache PredictionIO, Seldon, mleap, Tensorflow Serving
- Model Monitoring - ModelDB, Prometheus, AI Fairness 360, IBM Adversarial Robustness, AI Explainability
- Data Asset Management - ApacheAtlas, ODPiEgeria, Kylo
- Open Source Tools for Data Science - Part 2
- jupyter → jupyter lab
- RStudio, Spyder
- Apache Spark - cluster execution environment, Flink
- Ray - enables large scale deep learning training
- Knime, Orange
- Commercial Tools for Data Science
- Data Management - Oracle Database, Mcrosoft SQL server, IBM DB2
- Data Integration and Transformation - Informatica, IBM InfoSphere datastage, Waston Studio Desktop
- Data Visualization - tableau, Microsoft Power bi, IBM Cognos Analytics, Watson Studio Desktop
- Data Mining - SPSS Modeler
- Model Buiding & Deployment - SPSS Modeler
- Data Asset Management - Informatica, IBM
- Waston Studio, Waston Open Source cover entire data science lifecycle!
- Cloud Based Tools for Data Science
Module 2: Languages of Data Science
-
Languages of Data Science
- Python, R, SQL, Scala, Java, C++, and Julia
-
Introduction to Python
-
Introduction to R Language
-
Introduction to SQL
-
Other Languages for Data Science
-
Python, Scala, Java, R, Julia, C++
Module 3: Libraries, APIs, Datasets, and Models
- Libraries for Data Science
- Python Libraries
- Scientific Computing - Pandas, Numpy
- Visualization - Matplotlib, Seaborn
- Machine and Deep Learning - Scikit-learn, Keras, Tensorflow, PyTorch
- Apache Spark - cluster computing can be done in Python, R
- Scala
- R - ggplot2, can also interface with tensorflow,
- Python Libraries
- Application Programming Interfaces (APIs)
- Data Sets - Powering Data Science
- Sharing Enterprise Data - Data Asset eXchange
- Machine Learning Models - Learning from Models to Make Predictions
Module 4: Jupyter Project
Module 5: RStudio and Github
- Introduction to R and RStudio
- Popular R Libraries for Data Science
dplyr
data manipulationstringr
string manipulationggplot
data visualizationLattice
complex, multi-variable data setsLeaflet
interactive plotcaret
machine learning
- Popular R Libraries for Data Science
- Optional Reading: Download & Install R and RStudio
- R Basics with RStudio
- Plotting in RStudio
- Getting started with RStudio and Installing packages
- Creating Data Visualizations using ggplot
- Plotting with RStudio
- Overview of Git/Github
- Introduction to Github
- GitHub Repositories
- GitHub - Getting Started
- Hands-on Lab: Getting Started with GitHub
- GitHub - Working with Branches
- Getting Started with Branches using Git Commands
- **Hands-On Lab: Branching and Merging (Web UI)
- GitHub Branches
Module 6: Final Project - Create Jupyter Notebook
Module 7: (OPTIONAL) IBM Tools for Data Science
- Introduction to Waston Studio
03 Data Science Methodology
Data science methodology can be represented with a total of 10 stages, however, it is cyclic and an iterative process!
- Business Understanding
- Analytic Approach - identifying what type of patterns will be needed to address the question most effectively
- Data Requirements - identify the necessary data contents, formats, and sources for initial classification
- Data Collection - techniques to assess the content, quality, and initial insights about the data
- Data Understanding - is the data you collected representative of the problem to be solved?
- Data Preparation - what are the ways in which data is prepared?
- Modeling - in what way can the data be visualized to get the answer that is required?
- Evaluation - does the model used really answer the initial question or does it need to be adjusted?
- Deployment
- Feedback
Imagine the case study:
What is hospital readmissions? A situation where you were discharged from the hospital and wind up going back in the same or related care within 30 days.
Modeling
- The positive non-readmission
- The negative is readmission
TP. Person is correctly classified for non-readmission. FP. Person is incorrectly classified for non-readmission. TN. Person is correctly classified for readmission. FN. Person is incorrectly classified for readmission
If FP occurs, resources are wasted, cost is high. If FN occurs, those that need help do not get it.
04 Python for Data Science, AI & Development
As a next step, you can take the appropriate follow-on Python Project from the list below to apply your new found skills in a real-world scenario.
- Python Project for Data Science
- Python Project for Data Engineering
- Python Project for AI & Application Development
Note: Successful completion of this course is a pre-requisite for these Python Project courses. You can explore the courses below to further hone and develop your skills for working with Data and Python:
- Databases and SQL for Data Science with Python If you are looking to start a career in Data Science, Data Engineering or AI & Application Development, note that this course is part of the following Professional Certificates which are designed to empower you with the skills to become job-ready in these fields.
- IBM Applied AI Professional Certificate
- IBM Data Analyst Professional Certificate
- IBM Data Science Professional Certificate
- IBM Data Engineering Professional Certificate
- IBM Full Stack Cloud Developer Professional Certificate
05 Python Project for Data Science
The Problem How can we help Tom determine the best price for his car?
Open dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
Data Wrangling
- Dealing with missing values
- check with data collection source
- drop the variable, data entry
- replace the missing values, with average, frequency, etc
- leave it as missing data
- Data formatting
- Data normalization
-
- simple feature calling
- min-max
- z-score
-
- Binning
- group values into “bins”
- converts numeric into categorical variables
- group a set of numerical values into a set of bins
Exploratory Data Analysis
It is a preliminary step in data analysis to
- summarize main characteristics
- gain better understanding of the data set
- uncover relationships between variables
- extract important variables
06 Databases and SQL for Data Science with Python
Learn the basics of sql and relational database model
What is SQL? A language used for relational database.
What is a Relational Database?
DBMS?
DDL? Data definition language are used to define, change, or drop database object. Including CREATE, ALTER, DROP, TRUNCATE DML? Read and modify data. Including INSERT, SELECT, UPDATE, DELETE
Select from table
Create table
Alter table
Delete table
DROP TABLE table_name;
Delete all rows
TRUNCATE TABLE table_name IMMEDIATE;
String Patterns, Ranges, Sets of Values
... WHERE column LIKE R%
... WHERE pages between ... AND ...
... WHERE country IN (..., ..., ...)
Sorting
Grouping Results Sets
Built-in Database Functions
- Aggregate Functions - perform operations on entire cols, AVG, SUM, MIN, MAX
- Scalar Functions - perform operations on every value: ROUND, LENGTH, UCASE, LCASE
- Date and Time - special data types for dates and times: YEAR, MONTH, DAY, DAYOFMONTH
Sub-queries and Nested Selects A query inside another query.
- subqueries to evaluate aggregate functions
- subqueries in a list of columns
- subqueries in from clause
Working with Multiple Tables Access multiple tables in the same query
- Sub-queries
- Implicit Join
- JOIN operators (INNER JOIN, OUTER JOIN, etc)
What is Python DB-API?
What is a Model? A model can be thought of as a mathematical equation used to predict a value given one or more other values. Relating one or more independent variables to dependent variables.
Model Evaluation and Refinement?
08 Data Visualization with Python
**What is data visualization? Why do we build visualizations? What are important types of plots? What are popular plot libraries?
- matplotlib
- pandas
- seaborn
- folium
- plotly
- pywaffle What is Dash-boarding?*
- Dash
09 Machine Learning with Python
10 Applied Capstone
IBM Resources Mentioned
IBM Waston Open Scale IBM Watson Studio IBM Developer Website
Open Data Sets