!pip install -q numpy pandas statsmodels scipy matplotlib 

import numpy as np, pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.regression.gmm import IV2SLS

np.random.seed(2025)

[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: pip install --upgrade pip

# Simulation: generate the data

# Sample size
n = 100000

# Unobserved ability / family background
ability = np.random.normal(size=n)

# Instrument: z = 1 if "near a college", independent of ability
z = np.random.binomial(1, 0.5, size=n)

# Education decision (endogenous, depends on ability and z)
u_educ = np.random.normal(size=n)
educ = 12 + 2 * z + 1.0 * ability + u_educ   # observed education

# Wage equation (true model)
beta_true = 0.10   # true return to education
gamma = 0.5        # effect of ability on wage
eps = np.random.normal(size=n)

wage = beta_true * educ + gamma * ability + eps

# Put into a DataFrame
df = pd.DataFrame({
    "wage": wage,
    "educ": educ,
    "z": z,
    "ability": ability  # we won't use this in the regressions
})

print("True beta (return to education):", beta_true)
print("\nFirst few rows of the simulated data:")
print(df.head())

True beta (return to education): 0.1

First few rows of the simulated data:
       wage       educ  z   ability
0  2.617478  11.274131  0 -0.177377
1  2.281824  13.605614  0  1.689619
2  1.068528  10.248895  0 -0.727140
3  0.637553  12.198370  0  1.083520
4 -1.311154  10.372619  0 -1.634474

# 5.3 Estimation: OLS and 2SLS

# OLS
X_ols = sm.add_constant(df["educ"])
ols_res = sm.OLS(df["wage"], X_ols).fit()

print("OLS results (wage ~ educ):")
print(ols_res.summary().tables[1], "\n")

# 2SLS / IV: wage on educ, instrumented by z
y = df["wage"]
X = sm.add_constant(df["educ"])   # endogenous regressor(s) + constant
Z = sm.add_constant(df["z"])      # instruments + constant

iv_res = IV2SLS(y, X, Z).fit()

print("2SLS / IV results (wage ~ educ, instrumented by z):")
print(iv_res.summary(), "\n")

print("Comparison of coefficients:")
print(f"  True beta       : {beta_true: .4f}")
print(f"  OLS beta (educ) : {ols_res.params['educ']: .4f}")
print(f"  2SLS beta       : {iv_res.params['educ']: .4f}")

OLS results (wage ~ educ):
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.1974      0.026    -84.925      0.000      -2.248      -2.147
educ           0.2691      0.002    136.270      0.000       0.265       0.273
============================================================================== 

2SLS / IV results (wage ~ educ, instrumented by z):
                          IV2SLS Regression Results                           
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.095
Model:                         IV2SLS   Adj. R-squared:                  0.095
Method:                     Two Stage   F-statistic:                     800.0
                        Least Squares   Prob (F-statistic):          2.60e-175
Date:                Fri, 14 Nov 2025                                         
Time:                        04:01:08                                         
No. Observations:              100000                                         
Df Residuals:                   99998                                         
Df Model:                           1                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0084      0.046     -0.182      0.856      -0.099       0.082
educ           0.1005      0.004     28.285      0.000       0.094       0.108
==============================================================================
Omnibus:                        0.227   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.893   Jarque-Bera (JB):                0.232
Skew:                           0.003   Prob(JB):                        0.891
Kurtosis:                       2.997   Cond. No.                         99.7
============================================================================== 

Comparison of coefficients:
  True beta       :  0.1000
  OLS beta (educ) :  0.2691
  2SLS beta       :  0.1005

Week 11: Critical Thinking & Causal Inference – Endogeneity & Instrumental Variables (IV)¶

(Completed version)¶

Learning goals¶

1. Motivation: education and earnings¶

Recap: Omitted Variable Bias (OVB) and the “ideal fix”¶

2. IV example and directed acyclic graph (DAG): distance to college¶

3. IV assumptions in the DAG (with intuition)¶

3.1. Relevance: $Z$ really moves $X$¶

3.2 Independence: $Z$ is “as‑good‑as‑random” w.r.t. $U$¶

3.3 Exclusion restriction: $Z$ affects $Y$ only through $X$¶

4. Two-stage least squares (2SLS) – intuition first¶

5. OLS vs IV in a simple simulation (ability / family background story)¶

5.1 Data-generating story¶

5.2 Plan¶

📦 Required libraries¶

5.3 Estimation: OLS vs manual 2SLS¶

5.4 Interpreting the results¶

References & Acknowledgments¶