This might be a good time for you to start thinking about what you want to do in your term paper. In this assignment, we will learn about how predictions and classifications (the 2 hot topics) can be performed by using real data on the PM2.5 concentration in Beijing. We will go through procedures from loading, inspecting and cleaning data to further analysis as like in your practical project. Note that although neural networks can also be useful under this context, we will discuss about neural networks only in later assignment. This assignment would focus on techniques discussed in notebooks B2 (Regression), B3 (Cross Validation) and B4 (Classification).
Good dishes cannot be made without good ingredients. As of data science, high quality data are foundations for empirical researches. Therefore, when considering your own research idea, the first thing that you might want to do is to look for interesting and reliable datasets.
In our previous lecture and assignment, we have learnt how to scrape data from websites. When comparing to second hand data, these first hand data could be more up-to-date, and they might best fit your needs when no existing data is avaliable. However, to get your raw materials ready for analysis, you must go through different procedures of data cleaning, which could be tedious, time-consuming and difficult. Besides, you may also encounter difficulties like: unable to reach the data scource, website's design discouraging scrapping, ...
So, we should always look for the availability of related datasets before bothering to gather them by ourselves. Below I recommend some of the high quality data sources for machine learning analysis. These high quality data are both reliable and ready to use. For those of you who still have no idea on what you want to do in your term paper, looking at these datasets might help you to brainstorm new ideas.
Of course, you are always welcome to use different data sources, but make sure that they are reliable so that your result can be made solid. As the saying goes, "Garbage in, garbage out", to aviod making unreliable or even wrong conclusions, flawed data must be avioded.
When formulating your research idea, you may find most of the published papers on machine learning are too advanced for beginners to follow, or even to understand. Therefore, other than our past samples, if you need some extra materials for brainstorming, you can also take a look at http://cs229.stanford.edu/projects.html, which is the website for the computer science course CS229 of the Standford University that provides projects done by their previous students. These student projects might be able to give you a better idea about what you can do in your term paper.
In this assignment, we will cover skills mentioned in the following notebooks:
When writing your answer, you should include lines to import libraries needed to finish the tasks by yourself.
As mentioned in the introduction, the UCI Machine Learning Repository provides high quality data that are both reliable and ready to use. In this exercise, we will use its Beijing PM2.5 concentration data for our analysis. We have provided you with the dataset in our course website, and it is retrieved from here. Descriptions of the dataset can be referred to here.
PRSA_data_2010.1.1-2014.12.31.csv
from the course website and put it in the same file with your assignment. Load the data and store it with the name raw_data
.¶#Put your answer here
Before proceeding further, you must first understand what we have in the dataset. As in your paper, it is also important for you to have a comprehensive description about your data before proceeding to any analysis. Typically, you would have to answer some of the following questions:
Summarize statistics, charts, and plots might be helpful for you to visualize your data to readers.
#Put your answer here
You can see that in the dataset, there are 13 variables in total. Details of the variables can be referred to its documentation. In this exercise, only 8 of the variables pm2.5
, DEWP
, TEMP
, PRES
, cbwd
, Iws
, Is
and Ir
would be used.
pm2.5
, DEWP
, TEMP
, PRES
, cbwd
, Iws
, Is
and Ir
. Store it with another name df_8var
.¶#Put your answer here
From the output of Q2.1, you can see that some of the entries are missing. This is not unusual that missing values may occur in your dataset. Depending on the situation, there are multiple ways to deal with missing values. The most direct and simple way is to drop observations that are imcomplete. In the pandas
package, we can use isna()
to detect for NULL
values in a particular dataframe and dropna()
help us from removing these observations.
df_8var
that you created in Q2.2.¶Hint: For a specific dataframe df
, with the function isna()
under the pandas
package, df.isnull().sum()
can always return the number of NaN
values in each column.
#Put your answer here
You can see that only the variable pm2.5
is suffered with missing values.
#Put your answer here
Besides, you should also observe that the variable recording the wind direction, cbwd
, is a categorical variable with 4 different levels: SE
, NE
, NW
, cv
.
#Nothing is required to do in this cell
#Print unique values in a variable by using set() in Python
for wind in set(df_8var_NAdrop['cbwd']):
print(wind)
SE NE cv NW
Categorical variables cannot be directly used for regressions. Therefore, we have to create dummy variables for it.
SE
, NE
, NW
(e.g When cbwd
== SE
, SE
= 1, otherwise, it is equal to 0). Concatenate them with the existing dataframe df_8var_NAdrop
. Name the new dataframe as df_final
.¶Hint: You can use the get_dummies()
function in the pandas
package for creating the dummy variables. Details of the get_dummies()
function can be referred to here. For concatenation, you may want to use the concat()
function from pandas
. Read here to learn about the method and you will need to set the axis
parameters to finish this task.
Resulted table should look like:
pm2.5 | DEWP | TEMP | PRES | cbwd | Iws | Is | Ir | NE | NW | SE | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | |||||||||||
1 | |||||||||||
2 |
#Put your answer here
Now we have already prepared a dataframe df_final
that is ready to proceed for analyses below.
train_test_split
method, split the data into:¶x_in
(independent variables), y_in
(dependent variable)¶x_out
(independent variables), y_out
(dependent variable)¶pm2.5
is the dependent variable, which is our target to predict and classify. DEWP
, TEMP
, PRES
, Iws
, Is
, Ir
, NE
, NW
and SE
are ingredients used for predicting and classifying the PM2.5 concentration in Beijing, therefore would be the independent variables .¶#Put your answer here
fitted_ols
and print out the resulted coefficients (with intercept).¶#Put your answer here
The OLS regression coefficients $\hat{\beta}$ is obtained by the following formula:
$$ \hat{\beta} = (X'X)^{-1}(X'Y) $$where $X$ is a $N*K$ matrix storing values of independent variables and $Y$ is a $N*1$ matrix of the dependent variable. $N$ is the total number of observations and $K$ is the number of the independent variables (plus 1 for the intercept term).
#Nothing is required to do in this cell
import numpy as np
x_in_const = x_in.copy()
x_in_const["Intercept"] = 1 #Add a column of 1 in the independent variable matrix to get the intercept
X = np.matrix(x_in_const)
Y = np.matrix(y_in).T
X
and Y
to get the coefficients as in Q3.2, check if they are the same.¶#Put your answer here
ols_predict
.¶#Put your answer here
In notebook B2 Regression, we have learnt to evaluate model performance by the R-squared, which is a measurement about the proportion of variation in the dependent variable that has been explained by independent variables in the model. However, it is often that researchers would be more interested in the prediction performances of models, instead of their explanatory performances. In this case, R-squared would not be useful and the Mean-Squared Error (MSE) becomes a popular choice.
The $MSE$ takes the following form:
$$ MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i-\hat{y}_i)^{2} $$where $y$ is the actual value observed, $\hat{y}$ is the predicted value, and $N$ is the total number of observations.
$MSE$ therefore measures the average of squared distances between actual and predicted values within the sample. A lower $MSE$ indicates a better predicting ability of the model.
predict
and actual
. Use the function to return the MSE of predicted values in Q3.4, and save it as MSE_ols
. Print MSE_ols
.¶#Put your answer here
Actually, the scikit-learn
package has already provided the mean_squared_error()
function to return the MSE. Click here to learn about how to use the function.
mean_squared_error()
function in scikit-learn
.¶#Put your answer here
GridSearchCV()
function to fit the training set for both of the Lasso regression model and the Ridge regression model with the cross validation procedures of details specified below. Save the fitted models as fitted_lasso
and fitted_ridge
respectively.¶In Lasso()
and Ridge()
:
In GridSearchCV()
:
{'alpha':[1,5,10,15,20]}
#Put your answer here
fitted_lasso
and fitted_ridge
from Q4.1, make predictions with their best tuned models fitted_lasso.best_estimator_
and fitted_ridge.best_estimator_
on the test set. Print out their best tuned parameter together with the MSE for their predictions.¶#Put your answer here
You can see that when comparing to the OLS model, the Ridge regression model can provide slightly better predictions in average, while the Lasso model performed the wrost.
In notebook B4 Classification, we have focused on applying different classification techniques on the 2 cases scenarios (e.g. High-spending group or Low spending group, Male or Female, ...). How about if we want to do classifications when there is more than 2 categories? Below we will try to apply the learnt skills on a muti-class scenario.
Suppose now we can divide the PM2.5 concentration values into 6 levels as following:
PM2.5 ($ug/m^3$) | Air Quality Category | Integer Representing the Category |
---|---|---|
0-35 | Excellent | 1 |
36-75 | Good | 2 |
76-115 | Slight Pollution | 3 |
116-150 | Moderate Pollution | 4 |
151-250 | Heavy Pollution | 5 |
>250 | Severe Pollution | 6 |
pm2.5
to integers indicating their air quality category (i.e. if pm2.5
<=35, return 1). Apply the function on y_in
and y_out
and save the new values as y_in_classification
and y_out_classification
respectively.¶Hint: You can apply a function func
on the whole column col
of the dataframe df
by using the line df["col"].apply(func)
.
#Put your answer here
#Put your answer here
#Put your answer here