This Data Analysis with Python course provided by IBM is designed to teach future data analysts how to prepare data for analysis, perform simple statistical analysis, create meaningful data visualizations, predict future trends from data through a number of lecture, lab, and assignments using Python libraries. The following are the notes I took during this course.
Importing Datasets
Data, Dataset & Attributes in the dataset
Python Packages for Data Science
- Scientifics Computing Libraries: Pandas, NumPy, SciPy
- Visualization Libraries: Matplotlib, Seaborn
- Algorithmic Libraries: Scikit-learn, Statsmodels
Importing Data: Format & File Path of dataset
1 | import pandas as pd |
Data Types
df.dtypes
Checks datatypesdf.describe(include="all")
Returns a statistical summarydf.info()
Provides a concise summary of dataframe
Accessing Databases with Python (DB-API)
Data Wrangling
Pre-processing Data in Python = Data Cleaning = Data Wrangling
Dealing with Missing Values
- Drop:
df.dropna(subset=["column_name"], axis=0, inplace=True)
- Replace:
df.replace(missing_value, new_value)
- Mean:
df["column_name"].mean()
Data Formatting
1 | df["city-mpg"] = 235/df["city-mpg"] |
dataframe.astype()
Converts data type
Data Normalization
- Simple Feature scaling: $x{new}=\dfrac{x{old}}{x_{max}}$
1 | df["length"]=df["length"]/df["length"].max() |
Min - Max: $x{new}=\dfrac{x{old}-x{min}}{x{max}-x_{min}}$
1 | df["length"]=(df["length"]-df["length"].min())/(df["length"].max()-df["length"].min()) |
Z-score (Standard Score): $x{new}=\dfrac{x{old}-\mu}{\sigma }$
z-score values typically range between [-3,3]
1 | df["length"]=(df["length"]-df["length"].mean())/df["length"].std() |
Binning: Converts numeric into categorical variables by Grouping them in a set of “bins”
1 | bins=np.linspace(min(df["price"]),max(df["price"]),4) |
Turning categorical variables into quantitative variables
- Convert categorical variables into dummy variables (0 or 1)
pd.get_dummies(df['fuel'])
Exploratory Data Analysis (EDA)
EDA is preliminary step in data analysis to summarize main characteristics of the data, gain better understanding of the data set, uncover relationships between different variables, and extract important variables for the problem we’re trying to solve.
Descriptive Statistics
- Describe basic features of data
- Giving short summaries about the sample and measures of the data
df.describe
Summarize statistics,value_counts()
Summarize the categorical data
1 | wheels_counts=df["wheels"].value_counts().to_frame() |
Box Plot
1 | sns.boxplot(x="wheels",y="price",data=df) |
Scatter Plot
1 | y=df["price"] |
GroupBy
- The groupby method is used on categorical variables, groups the data into subsets according to the different categories of that variable.
1 | df_test=df[['drive-wheels','body-style','price']] |
Pivot table
1 | df_pivot=df_grp.pivot(index='drive-wheels',columns='body-style') |
Heatmap
1 | plt.pcolor(df_pivot,cmap='RdBu') |
Correlation
- Correlation measures to what extent different variables are interdependent
- Correlation doesn’t imply causation
- Linear Relationship
1 | import seaborn as sns |
Correlation - Statistics
- Pearson Correlation: Measure the strength of the correlation between two features (Correlation coefficient & P-value)
- Correlation coefficient close to +1: Large Positive relationship, close to -1: Large Negative relationship, close to 0: No relationship
- P-value < 0.001: Strong certainty in the result, < 0.05: Moderate certainty in the result, < 0.1: Weak certainty in the result, > 0.1: No certainty in the result
1 | pearson_coef, p_value=stats.pearsonr(df['horsepower'],df['price']) |
Correlation-Heatmap
Chi-Square: Test for Association
- The Chi-square test is intended to test how likely it is that an observed distribution is due to chance.
- It measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent.
- The Chi-square does not tell you the type of relationship that exists between both variables only that a relationship exists.
- $X^{2}=\sum^{n}{i=1}\dfrac{\left(O{i}-E{i}\right)^{2}}{E{i}}$
- Degree of freedom = (row-1) * (column-1)
1 | scipy.stats.chi2_contingency(cont_table, correction = True) |
Model Development
A model can be thought of as a mathematical equation used to predict a value given one or more other values, relating one or more independent variables to dependent variables
Usually the more relevant data you have the more accurate your model is
Linear regression will refer to one independent variable to make a prediction
Multiple Linear Regression will refer to multiple independent variables to make a prediction
Simple Linear Regression (SLR)
- The predictor (independent) variable - x, The target (dependent) variable - y
- $b{0}$: the intercept, $b{1}$: the slope
- $y=b{0}+b{1}x$
- Fit: $\left( b{0},b{1}\right)$
1 | from sklearn.linear_model import LinearRegression |
Multiple Linear Regression (MLR)
- One continuous target (Y) variables, Two or more predictor (X) variables
- $\widehat{Y}=b{0}+b{1}x{1}+b{2}x{2}+b{3}x{3}+b{4}x_{4}$
- $b{0}$: interception (x=0), $b{1}$: the coefficient or parameter of $x{1}$, $b{2}$: the coefficient of parameter $x_{2}$ and so on
1 | Z=df[['horsepower','curb-weight','engine-size','highway-mpg']] #extract predictor variables and store them in the variable Z (dataframe) |
Model Evaluation using Visualization
- Regression plots are a good estimate of the relationship between two variables, the strength of the correlation, and the direction of the relationship (positive or negative)
- Regression plot shows a combination of scatterplot and the fitted linear regression line ($\widehat{Y}$)
1 | import seaborn as sns |
Residual plot represents the error between the actual value
1 | import seaborn as sns |
Distribution plot counts the predicted value versus the actual value
MLR - Distribution plots
1 | import seaborn as sns |
Polynomial Regression and Pipelines
- Polynomial regression is a special case of the general linear regression model, useful for describing curvilinear relationships
- Curvilinear relationship: By squaring or setting higher order terms of the predictor variables
- Quadrac - 2nd order: $\widehat{Y}=b{0}+b{1}x{1}+b{2}(x_{1})^2$
- Cubic - 3rd order: $\widehat{Y}=b{0}+b{1}x{1}+b{2}(x{1})^2+b{3}(x_{1})^3$
1 | f=np.polyfit(x,y,3) #calculate polynomial of 3rd order |
We could also have multi-dimensional polynomial linear regression: $\widehat{Y}=b{0}+b{1}x{1}+b{2}x{2}+b{3}x{1}x{2}+b{4}(x{1})^2+b{5}(x{3})^2+..$
1 | from sklearn.preprocessing import PolynomialFeatures |
Pre-processing
1 | from sklearn.preprocessing import StandardScaler |
Pipelines: transform -> prediction
1 | from sklearn.preprocessing import PolynomialFeatures |
Measures for In-Sample Evaluation
- A way to numerically determine how good the model fits on our data
- Mean Squared Error (MSE)
1 | from sklearn.metrics import mean_squared_error |
R-squared: The coefficient of determination
R-squared is a measure to determine how close the data is to the fitted regression line
R^2: the percentage of variation of the target variable (Y) that is explained by the linear model
$R^{2}=\left( 1-\dfrac{MSE-of-regression-line}{MSE-of-the-average-of-the-data}\right)$
1 | lm.fit(X,Y) |
Generally the values of the MSE are between 0 and 1
R^2 = 0 indicates your model performs worst
Prediction and Decision Making
1 | lm.fit(df['highway-mpg'],df['prices']) #Train the model |
Residual Plot
1 | import numpy as np |
Numerical measures for Evaluation
Comparing MLR and SLR: A lower Mean Square Error does not necessarily imply better fit
Mean Square Error for a Multiple Linear Regression Model will be smaller than the Mean Square Error for a Simple Linear Regression model, since the errors of the data will decrease when more variables are included in the model
Polynomial regression will also have a smaller Mean Square Error than the linear regular regression
Model Evaluation and Refinement
In-sample evaluation tells us how well our model will fit the data used to train it
Build and train the model with a training set (70% dataset)
Use testing set to access the performance of a predictive model (30% dataset)
Function train_test_split()
: Split data into random train and test subsets
- x_data: features or independent variables, y_data: dataset target (
df['price']
) - x_train, y_train: parts of available data as training set
1 | from sklearn.model_selection import train_test_split |
Generalization error is measure of how well our data does at predicting previously unseen data, the error we obtain using our testing data is an approximation of this error.
Lots of training data
Cross Validation
- Most common out-of-sample evaluation metrics
- More effective use of data (each observation is used for both training and testing)
Function cross_val_score()
1 | from sklearn.model_selection import cross_val_score |
Function cross_val_predict()
- It returns the prediction that was obtained for each element when it was in the test set
1 | from sklearn.model_selection import cross_val_score #Has a similar interface to cross_cal_score() |
Model Selection: y(x)+noise
Underfitting: Where the model is too simple to fit the data.
Overfitting: Where the model is too flexible and fits the noise rather than the function
Ridge regression
- A regression that is employed in a Multiple regression model when Multicollinearity occurs
- Multicollinearity is when there is a strong relationship among the independent variables
- Ridge regression is very common with polynomial regression
1 | from sklearn.linear_model import Ridge |
Grid Search allows us to scan through multiple free parameters with few lines of code.
- Hyperparameters: Parameters that are not part of the fitting or training process. (eg. Ridge regression)
- Scikit-learn has a means of automatically iterating over these hyperparameters using cross-validation called Grid Search
- Training data、Validation data、Test data
1 | from sklearn.linear_model import Ridge |
Project Case
Determining the market price of a house given a set of features
Final Assignment Notebook Url (May not be accessible from Mainland China)
Assignments
Visit my Github Repository