ECON4130 Assignment 2 - PM2.5 Concentration in Beijing

Introduction

This might be a good time for you to start thinking about what you want to do in your term paper. In this assignment, we will learn about how predictions and classifications (the 2 hot topics) can be performed by using real data on the PM2.5 concentration in Beijing. We will go through procedures from loading, inspecting and cleaning data to further analysis as like in your practical project. Note that although neural networks can also be useful under this context, we will discuss about neural networks only in later assignment. This assignment would focus on techniques discussed in notebooks B2 (Regression), B3 (Cross Validation) and B4 (Classification).

Data

Good dishes cannot be made without good ingredients. As of data science, high quality data are foundations for empirical researches. Therefore, when considering your own research idea, the first thing that you might want to do is to look for interesting and reliable datasets.

In our previous lecture and assignment, we have learnt how to scrape data from websites. When comparing to second hand data, these first hand data could be more up-to-date, and they might best fit your needs when no existing data is avaliable. However, to get your raw materials ready for analysis, you must go through different procedures of data cleaning, which could be tedious, time-consuming and difficult. Besides, you may also encounter difficulties like: unable to reach the data scource, website's design discouraging scrapping, ...

So, we should always look for the availability of related datasets before bothering to gather them by ourselves. Below I recommend some of the high quality data sources for machine learning analysis. These high quality data are both reliable and ready to use. For those of you who still have no idea on what you want to do in your term paper, looking at these datasets might help you to brainstorm new ideas.

  1. Kaggle: https://www.kaggle.com/datasets
  2. UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php

Of course, you are always welcome to use different data sources, but make sure that they are reliable so that your result can be made solid. As the saying goes, "Garbage in, garbage out", to aviod making unreliable or even wrong conclusions, flawed data must be avioded.

Reseach idea

When formulating your research idea, you may find most of the published papers on machine learning are too advanced for beginners to follow, or even to understand. Therefore, other than our past samples, if you need some extra materials for brainstorming, you can also take a look at http://cs229.stanford.edu/projects.html, which is the website for the computer science course CS229 of the Standford University that provides projects done by their previous students. These student projects might be able to give you a better idea about what you can do in your term paper.

Coverage

In this assignment, we will cover skills mentioned in the following notebooks:

Important Notes

When writing your answer, you should include lines to import libraries needed to finish the tasks by yourself.

Part 1. Loading the Data

As mentioned in the introduction, the UCI Machine Learning Repository provides high quality data that are both reliable and ready to use. In this exercise, we will use its Beijing PM2.5 concentration data for our analysis. We have provided you with the dataset in our course website, and it is retrieved from here. Descriptions of the dataset can be referred to here.

Q1.1 Download the data file PRSA_data_2010.1.1-2014.12.31.csv from the course website and put it in the same file with your assignment. Load the data and store it with the name raw_data.

Part 2. Understanding the Data and Data Cleaning

Before proceeding further, you must first understand what we have in the dataset. As in your paper, it is also important for you to have a comprehensive description about your data before proceeding to any analysis. Typically, you would have to answer some of the following questions:

Summarize statistics, charts, and plots might be helpful for you to visualize your data to readers.

Q2.1 Show the first 30 rows of the raw data.

You can see that in the dataset, there are 13 variables in total. Details of the variables can be referred to its documentation. In this exercise, only 8 of the variables pm2.5, DEWP, TEMP, PRES, cbwd, Iws, Is and Ir would be used.

Q2.2 Fetch columns of the 8 variables pm2.5, DEWP, TEMP, PRES, cbwd, Iws, Is and Ir. Store it with another name df_8var.

From the output of Q2.1, you can see that some of the entries are missing. This is not unusual that missing values may occur in your dataset. Depending on the situation, there are multiple ways to deal with missing values. The most direct and simple way is to drop observations that are imcomplete. In the pandas package, we can use isna() to detect for NULL values in a particular dataframe and dropna() help us from removing these observations.

Q2.3 Check the number of missing entries in each column of df_8var that you created in Q2.2.

Hint: For a specific dataframe df, with the function isna() under the pandas package, df.isnull().sum() can always return the number of NaN values in each column.

You can see that only the variable pm2.5 is suffered with missing values.

Q2.4 Drop observations that include missing values in either one of the columns of df_8var and reset the row index to count from 0 with reset_index(drop=True) in pandas. Name this new dataframe as df_8var_NAdrop.

Details of the reset_index() function can be referred to here.

Besides, you should also observe that the variable recording the wind direction, cbwd, is a categorical variable with 4 different levels: SE, NE, NW, cv.

Categorical variables cannot be directly used for regressions. Therefore, we have to create dummy variables for it.

Q2.5 Create 3 dummy variables SE, NE, NW (e.g When cbwd == SE, SE = 1, otherwise, it is equal to 0). Concatenate them with the existing dataframe df_8var_NAdrop. Name the new dataframe as df_final.

Hint: You can use the get_dummies() function in the pandas package for creating the dummy variables. Details of the get_dummies() function can be referred to here. For concatenation, you may want to use the concat() function from pandas. Read here to learn about the method and you will need to set the axis parameters to finish this task.

Resulted table should look like:

  pm2.5 DEWP TEMP PRES cbwd Iws Is Ir NE NW SE
0
1
2

Now we have already prepared a dataframe df_final that is ready to proceed for analyses below.

Part 3. OLS Regression

Q3.1 By using sckit-learn's train_test_split method, split the data into:

- Training set : x_in(independent variables), y_in (dependent variable)

- Test set: x_out (independent variables), y_out (dependent variable)

with training set size of 0.7 and random state set to be 4130

where pm2.5 is the dependent variable, which is our target to predict and classify. DEWP, TEMP, PRES, Iws, Is, Ir, NE, NW and SE are ingredients used for predicting and classifying the PM2.5 concentration in Beijing, therefore would be the independent variables .

Q3.2 Perform Ordinary Least Squares (OLS) Regression on the training set, save the fitted model as fitted_ols and print out the resulted coefficients (with intercept).

The OLS regression coefficients $\hat{\beta}$ is obtained by the following formula:

$$ \hat{\beta} = (X'X)^{-1}(X'Y) $$

where $X$ is a $N*K$ matrix storing values of independent variables and $Y$ is a $N*1$ matrix of the dependent variable. $N$ is the total number of observations and $K$ is the number of the independent variables (plus 1 for the intercept term).

Q3.3 Use the above matrices X and Y to get the coefficients as in Q3.2, check if they are the same.

Q3.4 Predict the PM2.5 concentration on the test set with the OLS model fitted in Q3.2. Store it as ols_predict.

In notebook B2 Regression, we have learnt to evaluate model performance by the R-squared, which is a measurement about the proportion of variation in the dependent variable that has been explained by independent variables in the model. However, it is often that researchers would be more interested in the prediction performances of models, instead of their explanatory performances. In this case, R-squared would not be useful and the Mean-Squared Error (MSE) becomes a popular choice.

The $MSE$ takes the following form:

$$ MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i-\hat{y}_i)^{2} $$

where $y$ is the actual value observed, $\hat{y}$ is the predicted value, and $N$ is the total number of observations.

$MSE$ therefore measures the average of squared distances between actual and predicted values within the sample. A lower $MSE$ indicates a better predicting ability of the model.

Q3.5 Build a function that can return MSE with 2 arguments predict and actual. Use the function to return the MSE of predicted values in Q3.4, and save it as MSE_ols. Print MSE_ols.

Actually, the scikit-learn package has already provided the mean_squared_error() function to return the MSE. Click here to learn about how to use the function.

Q3.6 Return the MSE in Q3.5 by using the mean_squared_error() function in scikit-learn.

Part 4. Regularized Regression

Q4.1 Use the GridSearchCV() function to fit the training set for both of the Lasso regression model and the Ridge regression model with the cross validation procedures of details specified below. Save the fitted models as fitted_lasso and fitted_ridge respectively.

In Lasso() and Ridge():

In GridSearchCV():

Q4.2 For fitted_lasso and fitted_ridge from Q4.1, make predictions with their best tuned models fitted_lasso.best_estimator_ and fitted_ridge.best_estimator_ on the test set. Print out their best tuned parameter together with the MSE for their predictions.

You can see that when comparing to the OLS model, the Ridge regression model can provide slightly better predictions in average, while the Lasso model performed the wrost.

Part 5. Classification

In notebook B4 Classification, we have focused on applying different classification techniques on the 2 cases scenarios (e.g. High-spending group or Low spending group, Male or Female, ...). How about if we want to do classifications when there is more than 2 categories? Below we will try to apply the learnt skills on a muti-class scenario.

Suppose now we can divide the PM2.5 concentration values into 6 levels as following:

PM2.5 ($ug/m^3$) Air Quality Category Integer Representing the Category
0-35 Excellent 1
36-75 Good 2
76-115 Slight Pollution 3
116-150 Moderate Pollution 4
151-250 Heavy Pollution 5
>250 Severe Pollution 6

Q5.1 Define a function that can transfrom values from the variable pm2.5 to integers indicating their air quality category (i.e. if pm2.5 <=35, return 1). Apply the function on y_in and y_out and save the new values as y_in_classification and y_out_classification respectively.

Hint: You can apply a function func on the whole column col of the dataframe df by using the line df["col"].apply(func).

Q5.2 Perform classifications by using the following methods (with the training set) and save their sets of predictions (on the test set) seperately.

1. Logistic regression

2. Naive Bayes

3. Linear Discriminant Analysis (LDA)

4. Support Vector Machine (SVC)

5. Decision Tree

6. Nearest Neighbor

Unless specified above, use the default value of parameters.

Remember that you should use y_in_classification, x_in for training and x_out for predictions, y_out_classification for acessing the model performances. y_in and y_out are not our targets anymore.

Q5.3 Learn to use the classification_report function from here and print out the reports for the 6 sets of predictions from Q5.2 respectively.

End of assignment