ECON4130 Assignment 2 - PM2.5 Concentration in Beijing

Introduction¶

This might be a good time for you to start thinking about what you want to do in your term paper. In this assignment, we will learn about how predictions and classifications (the 2 hot topics) can be performed by using real data on the PM2.5 concentration in Beijing. We will go through procedures from loading, inspecting and cleaning data to further analysis as like in your practical project. Note that although neural networks can also be useful under this context, we will discuss about neural networks only in later assignment. This assignment would focus on techniques discussed in notebooks B2 (Regression), B3 (Cross Validation) and B4 (Classification).

Data¶

Good dishes cannot be made without good ingredients. As of data science, high quality data are foundations for empirical researches. Therefore, when considering your own research idea, the first thing that you might want to do is to look for interesting and reliable datasets.

In our previous lecture and assignment, we have learnt how to scrape data from websites. When comparing to second hand data, these first hand data could be more up-to-date, and they might best fit your needs when no existing data is avaliable. However, to get your raw materials ready for analysis, you must go through different procedures of data cleaning, which could be tedious, time-consuming and difficult. Besides, you may also encounter difficulties like: unable to reach the data scource, website's design discouraging scrapping, ...

So, we should always look for the availability of related datasets before bothering to gather them by ourselves. Below I recommend some of the high quality data sources for machine learning analysis. These high quality data are both reliable and ready to use. For those of you who still have no idea on what you want to do in your term paper, looking at these datasets might help you to brainstorm new ideas.

Kaggle: https://www.kaggle.com/datasets
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php

Of course, you are always welcome to use different data sources, but make sure that they are reliable so that your result can be made solid. As the saying goes, "Garbage in, garbage out", to aviod making unreliable or even wrong conclusions, flawed data must be avioded.

Reseach idea¶

When formulating your research idea, you may find most of the published papers on machine learning are too advanced for beginners to follow, or even to understand. Therefore, other than our past samples, if you need some extra materials for brainstorming, you can also take a look at http://cs229.stanford.edu/projects.html, which is the website for the computer science course CS229 of the Standford University that provides projects done by their previous students. These student projects might be able to give you a better idea about what you can do in your term paper.

Coverage¶

In this assignment, we will cover skills mentioned in the following notebooks:

A4 Functions
B1 Numpy
B2 Regression
B3 Cross Validation
B4 Classification

Important Notes¶

When writing your answer, you should include lines to import libraries needed to finish the tasks by yourself.

Part 1. Loading the Data¶

As mentioned in the introduction, the UCI Machine Learning Repository provides high quality data that are both reliable and ready to use. In this exercise, we will use its Beijing PM2.5 concentration data for our analysis. We have provided you with the dataset in our course website, and it is retrieved from here. Descriptions of the dataset can be referred to here.

Q1.1 Download the data file `PRSA_data_2010.1.1-2014.12.31.csv` from the course website and put it in the same file with your assignment. Load the data and store it with the name `raw_data`.¶

Part 2. Understanding the Data and Data Cleaning¶

Before proceeding further, you must first understand what we have in the dataset. As in your paper, it is also important for you to have a comprehensive description about your data before proceeding to any analysis. Typically, you would have to answer some of the following questions:

Where do I get the data?
How do I get the data?
How many variables are included in my dataset? What are they? Why would you want to include them (and only them)?
How is the collected data distributed?

Summarize statistics, charts, and plots might be helpful for you to visualize your data to readers.

Q2.1 Show the first 30 rows of the raw data.¶

You can see that in the dataset, there are 13 variables in total. Details of the variables can be referred to its documentation. In this exercise, only 8 of the variables pm2.5, DEWP, TEMP, PRES, cbwd, Iws, Is and Ir would be used.

Q2.2 Fetch columns of the 8 variables `pm2.5`, `DEWP`, `TEMP`, `PRES`, `cbwd`, `Iws`, `Is` and `Ir`. Store it with another name `df_8var`.¶

From the output of Q2.1, you can see that some of the entries are missing. This is not unusual that missing values may occur in your dataset. Depending on the situation, there are multiple ways to deal with missing values. The most direct and simple way is to drop observations that are imcomplete. In the pandas package, we can use isna() to detect for NULL values in a particular dataframe and dropna() help us from removing these observations.

Q2.3 Check the number of missing entries in each column of `df_8var` that you created in Q2.2.¶

Hint: For a specific dataframe df, with the function isna() under the pandas package, df.isnull().sum() can always return the number of NaN values in each column.

You can see that only the variable pm2.5 is suffered with missing values.

Q2.4 Drop observations that include missing values in either one of the columns of `df_8var` and reset the row index to count from 0 with `reset_index(drop=True)` in `pandas`. Name this new dataframe as `df_8var_NAdrop`.¶

Details of the reset_index() function can be referred to here.

Besides, you should also observe that the variable recording the wind direction, cbwd, is a categorical variable with 4 different levels: SE, NE, NW, cv.

Categorical variables cannot be directly used for regressions. Therefore, we have to create dummy variables for it.

Q2.5 Create 3 dummy variables `SE`, `NE`, `NW` (e.g When `cbwd` == `SE`, `SE` = 1, otherwise, it is equal to 0). Concatenate them with the existing dataframe `df_8var_NAdrop`. Name the new dataframe as `df_final`.¶

Hint: You can use the get_dummies() function in the pandas package for creating the dummy variables. Details of the get_dummies() function can be referred to here. For concatenation, you may want to use the concat() function from pandas. Read here to learn about the method and you will need to set the axis parameters to finish this task.

Resulted table should look like:

	pm2.5	DEWP	TEMP	PRES	cbwd	Iws	Is	Ir	NE	NW	SE
0
1
2

Now we have already prepared a dataframe df_final that is ready to proceed for analyses below.

Part 3. OLS Regression¶

Q3.1 By using sckit-learn's `train_test_split` method, split the data into:¶

- Training set : `x_in`(independent variables), `y_in` (dependent variable)¶

- Test set: `x_out` (independent variables), `y_out` (dependent variable)¶

with training set size of 0.7 and random state set to be 4130¶

where `pm2.5` is the dependent variable, which is our target to predict and classify. `DEWP`, `TEMP`, `PRES`, `Iws`, `Is`, `Ir`, `NE`, `NW` and `SE` are ingredients used for predicting and classifying the PM2.5 concentration in Beijing, therefore would be the independent variables .¶

Q3.2 Perform Ordinary Least Squares (OLS) Regression on the training set, save the fitted model as `fitted_ols` and print out the resulted coefficients (with intercept).¶

The OLS regression coefficients $\hat{\beta}$ is obtained by the following formula:

$$ \hat{\beta} = (X'X)^{-1}(X'Y) $$

where $X$ is a $N*K$ matrix storing values of independent variables and $Y$ is a $N*1$ matrix of the dependent variable. $N$ is the total number of observations and $K$ is the number of the independent variables (plus 1 for the intercept term).

Q3.3 Use the above matrices `X` and `Y` to get the coefficients as in Q3.2, check if they are the same.¶

Q3.4 Predict the PM2.5 concentration on the test set with the OLS model fitted in Q3.2. Store it as `ols_predict`.¶

In notebook B2 Regression, we have learnt to evaluate model performance by the R-squared, which is a measurement about the proportion of variation in the dependent variable that has been explained by independent variables in the model. However, it is often that researchers would be more interested in the prediction performances of models, instead of their explanatory performances. In this case, R-squared would not be useful and the Mean-Squared Error (MSE) becomes a popular choice.

The $MSE$ takes the following form:

$$ MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i-\hat{y}_i)^{2} $$

where $y$ is the actual value observed, $\hat{y}$ is the predicted value, and $N$ is the total number of observations.

$MSE$ therefore measures the average of squared distances between actual and predicted values within the sample. A lower $MSE$ indicates a better predicting ability of the model.

Q3.5 Build a function that can return MSE with 2 arguments `predict` and `actual`. Use the function to return the MSE of predicted values in Q3.4, and save it as `MSE_ols`. Print `MSE_ols`.¶

Actually, the scikit-learn package has already provided the mean_squared_error() function to return the MSE. Click here to learn about how to use the function.

Q3.6 Return the MSE in Q3.5 by using the `mean_squared_error()` function in `scikit-learn`.¶

Part 4. Regularized Regression¶

Q4.1 Use the `GridSearchCV()` function to fit the training set for both of the Lasso regression model and the Ridge regression model with the cross validation procedures of details specified below. Save the fitted models as `fitted_lasso` and `fitted_ridge` respectively.¶

In Lasso() and Ridge():

random_state = 4130

In GridSearchCV():

cv = 10
param_grid = {'alpha':[1,5,10,15,20]}
scoring = "neg_mean_squared_error" (so that it would choose the values of hyperparameters that minimize the average MSE)

Q4.2 For `fitted_lasso` and `fitted_ridge` from Q4.1, make predictions with their best tuned models `fitted_lasso.best_estimator_` and `fitted_ridge.best_estimator_` on the test set. Print out their best tuned parameter together with the MSE for their predictions.¶

You can see that when comparing to the OLS model, the Ridge regression model can provide slightly better predictions in average, while the Lasso model performed the wrost.

Part 5. Classification¶

In notebook B4 Classification, we have focused on applying different classification techniques on the 2 cases scenarios (e.g. High-spending group or Low spending group, Male or Female, ...). How about if we want to do classifications when there is more than 2 categories? Below we will try to apply the learnt skills on a muti-class scenario.

Suppose now we can divide the PM2.5 concentration values into 6 levels as following:

PM2.5 ($ug/m^3$)	Air Quality Category	Integer Representing the Category
0-35	Excellent	1
36-75	Good	2
76-115	Slight Pollution	3
116-150	Moderate Pollution	4
151-250	Heavy Pollution	5
>250	Severe Pollution	6

Q5.1 Define a function that can transfrom values from the variable `pm2.5` to integers indicating their air quality category (i.e. if `pm2.5` <=35, return 1). Apply the function on `y_in` and `y_out` and save the new values as `y_in_classification` and `y_out_classification` respectively.¶

Hint: You can apply a function func on the whole column col of the dataframe df by using the line df["col"].apply(func).

Q5.2 Perform classifications by using the following methods (with the training set) and save their sets of predictions (on the test set) seperately.¶

1. Logistic regression¶

C = 1
max_iter = 6000

2. Naive Bayes¶

3. Linear Discriminant Analysis (LDA)¶

4. Support Vector Machine (SVC)¶

gamma = "auto"

5. Decision Tree¶

random_state = 4130

6. Nearest Neighbor¶

n_neighbors=3

Unless specified above, use the default value of parameters.¶

Remember that you should use y_in_classification, x_in for training and x_out for predictions, y_out_classification for acessing the model performances. y_in and y_out are not our targets anymore.¶

ECON4130 Assignment 2 - PM2.5 Concentration in Beijing¶

Introduction¶

Data¶

Reseach idea¶

Coverage¶

Important Notes¶

Part 1. Loading the Data¶

Q1.1 Download the data file PRSA_data_2010.1.1-2014.12.31.csv from the course website and put it in the same file with your assignment. Load the data and store it with the name raw_data.¶

Part 2. Understanding the Data and Data Cleaning¶

Q2.1 Show the first 30 rows of the raw data.¶

Q2.2 Fetch columns of the 8 variables pm2.5, DEWP, TEMP, PRES, cbwd, Iws, Is and Ir. Store it with another name df_8var.¶

Q2.3 Check the number of missing entries in each column of df_8var that you created in Q2.2.¶

Q2.4 Drop observations that include missing values in either one of the columns of df_8var and reset the row index to count from 0 with reset_index(drop=True) in pandas. Name this new dataframe as df_8var_NAdrop.¶

Q2.5 Create 3 dummy variables SE, NE, NW (e.g When cbwd == SE, SE = 1, otherwise, it is equal to 0). Concatenate them with the existing dataframe df_8var_NAdrop. Name the new dataframe as df_final.¶

Part 3. OLS Regression¶

Q3.1 By using sckit-learn's train_test_split method, split the data into:¶

- Training set : x_in(independent variables), y_in (dependent variable)¶

- Test set: x_out (independent variables), y_out (dependent variable)¶