Data Analysis With Python

本文发布于：2021年3月27日

字数：2.2k字

时长：13分钟

This Data Analysis with Python course provided by IBM is designed to teach future data analysts how to prepare data for analysis, perform simple statistical analysis, create meaningful data visualizations, predict future trends from data through a number of lecture, lab, and assignments using Python libraries. The following are the notes I took during this course.

Importing Datasets

Data, Dataset & Attributes in the dataset

Python Packages for Data Science

Scientifics Computing Libraries: Pandas, NumPy, SciPy
Visualization Libraries: Matplotlib, Seaborn
Algorithmic Libraries: Scikit-learn, Statsmodels

Importing Data: Format & File Path of dataset

import pandas as pd
url = ""
df = pd.read_csv(url, header = None)  #create dataframe
df.head(n) #shows the first n rows while df.tail(n) shows the bottom n rows
path = ""
df.to_csv(path) #exporting dataframe, you can also use df.to_json() / df.to_excel() / df.to_sql()

Data Types

df.dtypes Checks datatypes
df.describe(include="all") Returns a statistical summary
df.info() Provides a concise summary of dataframe

Accessing Databases with Python (DB-API)

Data Wrangling

Pre-processing Data in Python = Data Cleaning = Data Wrangling

Dealing with Missing Values

Drop: df.dropna(subset=["column_name"], axis=0, inplace=True)
Replace: df.replace(missing_value, new_value)
Mean: df["column_name"].mean()

Data Formatting

1
2
3

df["city-mpg"] = 235/df["city-mpg"]
df.rename(columns=("city_mpg": "city-L/100km"), inplace=True)
df["city-L/100km"]=df["city-L/100km"].astype("float")

dataframe.astype() Converts data type

Data Normalization

Simple Feature scaling: $x{new}=\dfrac{x{old}}{x_{max}}$

1	df["length"]=df["length"]/df["length"].max()

Min - Max: $x{new}=\dfrac{x{old}-x{min}}{x{max}-x_{min}}$

1	df["length"]=(df["length"]-df["length"].min())/(df["length"].max()-df["length"].min())

Z-score (Standard Score): $x{new}=\dfrac{x{old}-\mu}{\sigma }$

z-score values typically range between [-3,3]

1	df["length"]=(df["length"]-df["length"].mean())/df["length"].std()

Binning: Converts numeric into categorical variables by Grouping them in a set of “bins”

1
2
3

bins=np.linspace(min(df["price"]),max(df["price"]),4)
group_names=["Low","Medium","High"]
df["price-binned"]=pd.cut(df["price"],bins,labels=group_names,include_lowest=True)

Turning categorical variables into quantitative variables

Convert categorical variables into dummy variables (0 or 1)
pd.get_dummies(df['fuel'])

Exploratory Data Analysis (EDA)

EDA is preliminary step in data analysis to summarize main characteristics of the data, gain better understanding of the data set, uncover relationships between different variables, and extract important variables for the problem we’re trying to solve.

Descriptive Statistics

Describe basic features of data
Giving short summaries about the sample and measures of the data
df.describe Summarize statistics, value_counts() Summarize the categorical data

1	wheels_counts=df["wheels"].value_counts().to_frame()

Box Plot

1	sns.boxplot(x="wheels",y="price",data=df)

Scatter Plot

y=df["price"]
x=df["engine-size"]
plt.scatter(x,y)
plt.title("Scatterplot of Engine Size vs Price")
plt.xlabel("Engine Size")
plt.ylabel("Price")

GroupBy

The groupby method is used on categorical variables, groups the data into subsets according to the different categories of that variable.

1
2
3

df_test=df[['drive-wheels','body-style','price']]
df_grp=df_test.grop[by(['drive-wheels','body-style'], as_index= False).mean()
df_grp

Pivot table

1	df_pivot=df_grp.pivot(index='drive-wheels',columns='body-style')

Heatmap

1
2
3

plt.pcolor(df_pivot,cmap='RdBu')
plt.colorbar()
plt.show()

Correlation

Correlation measures to what extent different variables are interdependent
Correlation doesn’t imply causation
Linear Relationship

1
2
3

import seaborn as sns
sns.regplot(x="engine-size",y="price",data=df)
plt.ylim(0,)

Correlation - Statistics

Pearson Correlation: Measure the strength of the correlation between two features (Correlation coefficient & P-value)
Correlation coefficient close to +1: Large Positive relationship, close to -1: Large Negative relationship, close to 0: No relationship
P-value < 0.001: Strong certainty in the result, < 0.05: Moderate certainty in the result, < 0.1: Weak certainty in the result, > 0.1: No certainty in the result

1	pearson_coef, p_value=stats.pearsonr(df['horsepower'],df['price'])

Correlation-Heatmap

Chi-Square: Test for Association

The Chi-square test is intended to test how likely it is that an observed distribution is due to chance.
It measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent.
The Chi-square does not tell you the type of relationship that exists between both variables only that a relationship exists.
$X^{2}=\sum^{n}{i=1}\dfrac{\left(O{i}-E{i}\right)^{2}}{E{i}}$
Degree of freedom = (row-1) * (column-1)

1	scipy.stats.chi2_contingency(cont_table, correction = True)

Model Development

A model can be thought of as a mathematical equation used to predict a value given one or more other values, relating one or more independent variables to dependent variables

Usually the more relevant data you have the more accurate your model is

Linear regression will refer to one independent variable to make a prediction

Multiple Linear Regression will refer to multiple independent variables to make a prediction

Simple Linear Regression (SLR)

The predictor (independent) variable - x, The target (dependent) variable - y
$b{0}$: the intercept, $b{1}$: the slope
$y=b{0}+b{1}x$
Fit: $\left( b{0},b{1}\right)$

from sklearn.linear_model import LinearRegression
lm=LinearRegression() #create a linear regression object using the constructor
x=df[['highway-mpg']] #define the predictor variable and target variable
y=df['price']
lm.fit(x,y) #fit the model
Yhat=lm.predict(X) #obtain a prediction
lm.intercept_ #view the intercept
lm.coef_ #view the slope

Multiple Linear Regression (MLR)

One continuous target (Y) variables, Two or more predictor (X) variables
$\widehat{Y}=b{0}+b{1}x{1}+b{2}x{2}+b{3}x{3}+b{4}x_{4}$
$b{0}$: interception (x=0), $b{1}$: the coefficient or parameter of $x{1}$, $b{2}$: the coefficient of parameter $x_{2}$ and so on

Z=df[['horsepower','curb-weight','engine-size','highway-mpg']] #extract predictor variables and store them in the variable Z (dataframe)
lm.fit(Z, df['price']) #train the model
Yhat=lm.predict(X) #obtain a prediction
lm.intercept_ #view the intercept
lm.coef_ #view the slope

Model Evaluation using Visualization

Regression plots are a good estimate of the relationship between two variables, the strength of the correlation, and the direction of the relationship (positive or negative)
Regression plot shows a combination of scatterplot and the fitted linear regression line ($\widehat{Y}$)

1
2
3

import seaborn as sns
sns.regplot(x="highway-mpg",y="price",data=df)
plt.ylim(0,)

Residual plot represents the error between the actual value

1 2	import seaborn as sns sns.resifplot(df['highway-mpg'],df['price'])

Distribution plot counts the predicted value versus the actual value

MLR - Distribution plots

1
2
3

import seaborn as sns
ax1=sns.displot(df['price'],hist=False,color='r',label='Actual Value')
sns.distplot(Yhat,hist=False,color='b',label='Fitted Values',ax=axl)

Polynomial Regression and Pipelines

Polynomial regression is a special case of the general linear regression model, useful for describing curvilinear relationships
Curvilinear relationship: By squaring or setting higher order terms of the predictor variables
Quadrac - 2nd order: $\widehat{Y}=b{0}+b{1}x{1}+b{2}(x_{1})^2$
Cubic - 3rd order: $\widehat{Y}=b{0}+b{1}x{1}+b{2}(x{1})^2+b{3}(x_{1})^3$

1
2
3

f=np.polyfit(x,y,3) #calculate polynomial of 3rd order
p=np.polyld(f)
print(p) #print out the model

We could also have multi-dimensional polynomial linear regression: $\widehat{Y}=b{0}+b{1}x{1}+b{2}x{2}+b{3}x{1}x{2}+b{4}(x{1})^2+b{5}(x{3})^2+..$

1
2
3

from sklearn.preprocessing import PolynomialFeatures
pr=PolynomialFeatures(degree=2,include_bias=False)
x_polly=pr.fit_transform(x[['horsepower','curb-weight']])

Pre-processing

from sklearn.preprocessing import StandardScaler
SCALE=StandardScaler()
SCALE.fit(x_data[['horsepower','curb-weight']])
x_scale=SCALE.transform(x_data[['horsepower','curb-weight']])

Pipelines: transform -> prediction

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
Input=[('Scale',StandardScaler()),('polynomial',PolynomialFeatures(degree=2),('mode',LinearRegression()))]
Pipe.fit(df[['horsepower','curb-weight','engine-size','highway-mpg']],y) #train the pipeline object
yhat=Pipe.predict(X[['horsepower','curb-weight','engine-size','highway-mpg']])

Measures for In-Sample Evaluation

A way to numerically determine how good the model fits on our data
Mean Squared Error (MSE)

1 2	from sklearn.metrics import mean_squared_error mean_squared_error(df['price'],Y_predict_simple_fit)

R-squared: The coefficient of determination

R-squared is a measure to determine how close the data is to the fitted regression line

R^2: the percentage of variation of the target variable (Y) that is explained by the linear model

$R^{2}=\left( 1-\dfrac{MSE-of-regression-line}{MSE-of-the-average-of-the-data}\right)$

1 2	lm.fit(X,Y) lm.score(X,y)

Generally the values of the MSE are between 0 and 1

R^2 = 0 indicates your model performs worst

Prediction and Decision Making

1
2
3

lm.fit(df['highway-mpg'],df['prices'])  #Train the model
lm.predict(np.array(30.0).reshape(-1,1))  #Predict the price
lm.coef_  #Value of the Slope

Residual Plot

1
2
3

import numpy as np
new_input=np.arrange(1,101,1).reshape(-1,1)  #Arrange to generate sequence
yhat=lm.predict(new_input)  #Predict new values

Numerical measures for Evaluation

Comparing MLR and SLR: A lower Mean Square Error does not necessarily imply better fit

Mean Square Error for a Multiple Linear Regression Model will be smaller than the Mean Square Error for a Simple Linear Regression model, since the errors of the data will decrease when more variables are included in the model

Polynomial regression will also have a smaller Mean Square Error than the linear regular regression

In-sample evaluation tells us how well our model will fit the data used to train it

Build and train the model with a training set (70% dataset)

Use testing set to access the performance of a predictive model (30% dataset)

Function train_test_split(): Split data into random train and test subsets

x_data: features or independent variables, y_data: dataset target (df['price'])
x_train, y_train: parts of available data as training set

1 2	from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x_data,y_data,test_size=0.3,random_state=0)

Generalization error is measure of how well our data does at predicting previously unseen data, the error we obtain using our testing data is an approximation of this error.

Lots of training data

Cross Validation

Most common out-of-sample evaluation metrics
More effective use of data (each observation is used for both training and testing)

Function cross_val_score()

1
2
3

from sklearn.model_selection import cross_val_score
scores=cross_val_score(lr,x_data,y_data,cv=3)
np.mean(scores)

Function cross_val_predict()

It returns the prediction that was obtained for each element when it was in the test set

1 2	from sklearn.model_selection import cross_val_score #Has a similar interface to cross_cal_score() scores=cross_val_score(lr,x_data,y_data,cv=3)

Model Selection: y(x)+noise

Underfitting: Where the model is too simple to fit the data.

Overfitting: Where the model is too flexible and fits the noise rather than the function

Ridge regression

A regression that is employed in a Multiple regression model when Multicollinearity occurs
Multicollinearity is when there is a strong relationship among the independent variables
Ridge regression is very common with polynomial regression

from sklearn.linear_model import Ridge
RidgeModel=Ridge(alpha=0.1)
RidgeModel.fit(X,y)
Yhat=RidgeModel.predict(X)

Grid Search allows us to scan through multiple free parameters with few lines of code.

Hyperparameters: Parameters that are not part of the fitting or training process. (eg. Ridge regression)
Scikit-learn has a means of automatically iterating over these hyperparameters using cross-validation called Grid Search
Training data、Validation data、Test data

from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
parameters1=[{'alpha':[0.001,0.1,1,10,100,1000,10000,100000,1000000]}]
parameters2=[{'alpha':[0.001,0.1,1,10,100],'normalize':[True,False]}]
RR=Ridge()
Grid1=GridSearchCV(rr,parameters1,cv=4)
Grid1.fit(x_data[['horsepower','curb-weight','engine-size','highway-mpg']],y_data)
Grid.best_estimator_
scores=Grid.cv_results_
scores['mean_test_score']

Project Case

Determining the market price of a house given a set of features

Final Assignment Notebook Url (May not be accessible from Mainland China)

Assignments

Visit my Github Repository

IBM Data Science

Bezhuang

2021 阿里云 Java 训练营第三期

本期训练营是阿里云开发者社区Java训练营的第3期，主要基于最流行的Java Spring Cloud, 结合阿里巴巴淘宝微服务案例，实战模拟淘宝Order微服务，实战演练微服务开发，扩展学习...

2021 阿里云 Java 训练营第二期

本期训练营是继Java新手训练营后的第2期，课程由阿里云开发者社区提供，同样采用5天不间断直播授课的形式，主要内容为Spring Boot 2.5自动化配置原理、实战开发REST API、My...