Background¶
Ka Yan believes that including more variables always makes the model better because $R^2$ goes up. Today, you’ll use a real Wooldridge housing dataset to check that idea and see what can go wrong.
Instructions¶
- Run the setup cell below to load the data (
wooldridge.hprice1). - Keep answers short and clear. Aim to finish in 10–15 minutes.
In [4]:
!pip install wooldridge statsmodels pandas numpy --quiet
import numpy as np, pandas as pd, statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import wooldridge as woo
# Load Wooldridge housing data
df = woo.data('hprice1').dropna().copy()
df.head()
Out[4]:
| price | assess | bdrms | lotsize | sqrft | colonial | lprice | lassess | llotsize | lsqrft | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 300.0 | 349.100006 | 4 | 6126.0 | 2438 | 1 | 5.703783 | 5.855359 | 8.720297 | 7.798934 |
| 1 | 370.0 | 351.500000 | 3 | 9903.0 | 2076 | 1 | 5.913503 | 5.862210 | 9.200593 | 7.638198 |
| 2 | 191.0 | 217.699997 | 3 | 5200.0 | 1374 | 0 | 5.252274 | 5.383118 | 8.556414 | 7.225482 |
| 3 | 195.0 | 231.800003 | 3 | 4600.0 | 1448 | 1 | 5.273000 | 5.445875 | 8.433811 | 7.277938 |
| 4 | 373.0 | 319.100006 | 4 | 6095.0 | 2514 | 1 | 5.921578 | 5.765504 | 8.715224 | 7.829630 |
Variable descriptions (from Wooldridge hprice1 see here):¶
- price: house price (USD).
- lotsize: lot size (square feet).
- sqrft: house size (square feet of finished area).
- bdrms: number of bedrooms.
- assess: assessed value of the house (USD).
- colonial: indicator for colonial‑style house (1 = colonial).
- lprice, llotsize, lsqrft: natural logs of
price,lotsize,sqrft.
Task (10–15 minutes)¶
Q1. Fit Ka Yan’s “everything” model and report fit¶
Use price as the outcome and include at least these regressors: lotsize, sqrft, bdrms.
- Report R².
In [ ]:
# Put your answer here
Q2. Quick diagnostic: correlations & VIF (exclude const)¶
- Print the pairwise correlation matrix for the regressors.
- Compute VIF for each regressor.
In [5]:
# Put your answer here: Correlations among regressors
# Put your answer here: VIFs
Q3. Short write‑up¶
Answer briefly (2-3 sentences):
- Highest correlation pairs / highest VIFs.
- The single change you would made and why.
- What would you expect to happen to R² if you made that change? (Feel free to check in a code cell but it is not required!)
- What would you tell Ka Yan about her idea that “including more variables always makes the model better”?
Put your answer here:¶
End of Lab Exercise.