ECON320LAB — Week 7: Omitted Variable Bias (OVB) (Complete version)¶


🔄 Connecting from last lecture — Multicollinearity and Model Specification 🎯¶

  • Last week, we saw that when included variables overlap too much, we get multicollinearity —

    • coefficients become unstable, standard errors get large, and we might choose to simplify the model by dropping redundant variables.
  • This week, we focus on the opposite concern:

    • leaving out a relevant variable (especially one correlated with our regressor of interest) causes Omitted Variable Bias (OVB) —

    • our estimates become biased, even if they appear statistically precise.

  • Multicollinearity: included variables overlap too much → large SEs, but no bias.
  • OVB: a relevant variable is omitted → coefficients become biased.

💡 The art of model specification is to balance both risks — include variables that meaningfully explain $y$, without overloading the model with redundant information.


1) 🧠 Quick Review: What is Omitted Variable Bias (OVB)?¶

Before we start coding, recall one of the key OLS assumptions:

Exogeneity: $$E[u|X] = 0$$
The error term has a mean of zero conditional on the regressors.

💡 Intuition:

  • If something in $y$ is still systematically related to $X$ after estimation, it means a relevant variable is missing — its effect sneaks into the error term.
  • That’s when Omitted Variable Bias (OVB) arises.

🧩 Model setup¶

Suppose the true model for wages is
$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u, $$
where

  • $y$ = wage (outcome variable)
  • $x_1$ = education (included regressor)
  • $x_2$ = ability (omitted regressor; innate ability — affects wage, correlated with $x_1$, but not caused by education)
  • $u$ = other random factors

Under the true model assumption (exogeneity),
$$ E[u|x_1, x_2] = 0, $$
meaning the unobserved factors $u$ are not systematically related to either regressor in the correctly specified model.


If we omit $x_2$ and estimate the short model
$$ y = \alpha_0 + \alpha_1 x_1 + e, $$
then the new error term is
$$ e = u + \beta_2 x_2. $$

Because $x_2$ (ability) is systematically related to $x_1$ (education), $E[e|x_1] \neq 0$ — violating the exogeneity assumption.


🧮 Why $E[e|x_1] \neq 0$¶

From the short model:
$$ E[e|x_1] = E[u|x_1] + \beta_2 E[x_2|x_1]. $$

Under the true model assumption, $E[u|x_1, x_2] = 0$, so
$$ E[e|x_1] = \beta_2 E[x_2|x_1]. $$

For exogeneity to hold, this must be zero — which happens only if either

  • $\beta_2 = 0$ (the omitted variable has no effect on $y$), or
  • $E[x_2|x_1] = 0$ (the omitted variable is unrelated to $x_1$).

If both are non-zero, $E[e|x_1] \neq 0$ and OLS is biased.

💡 Intuition:
You can think of $E[x_2|x_1]$ as the fitted value from regressing $x_2$ on $x_1$.
Therefore, If the omitted variable both affects $y$ ($\beta_2 \neq 0$) and is related to $x_1$, its effect ($\beta_2 x_2$) seeps into the error term $e$ — the essence of Omitted Variable Bias.


🔹 The Bias Formula¶

When a relevant variable $x_2$ is omitted, the bias in the estimated coefficient on $x_1$ is:
$$ \text{Bias}(\hat{\alpha}_1) = \beta_2\,\frac{\operatorname{Cov}(x_1,x_2)}{\operatorname{Var}(x_1)}. $$

💡 Intuition:

  • $\beta_2$: how strongly the omitted variable ($x_2$) affects $y$

  • $\displaystyle \frac{\operatorname{Cov}(x_1,x_2)}{\operatorname{Var}(x_1)}$: how closely $x_2$ is related to $x_1$ (often written as $\delta$, the slope from regressing $x_2$ on $x_1$)


🔹 Direction of the Bias¶

  • If both $\beta_2$ and $\operatorname{Cov}(x_1,x_2)$ have the same sign
    → OLS overstates the effect of $x_1$ (positive bias) ⬆️

  • If they have opposite signs
    → OLS understates or may even reverse the effect (negative bias) ⬇️

Example:

  • If more educated people also tend to have higher ability ($\operatorname{Cov}(x_1,x_2) > 0$) and ability raises wages ($\beta_2 > 0$),

  • then the short model wrongly attributes some of ability’s effect to education —

  • overestimating the return to schooling.


📦 Required libraries¶

Now, let's start coding to verify the OVB in a simulated dataset.

In [ ]:
!pip install numpy pandas statsmodels --quiet

import numpy as np
import pandas as pd
import statsmodels.api as sm

# For reproducibility
rng = np.random.default_rng(123)

2) Setup & Toy Data — Education, Ability, and Wages¶

We simulate a world where ability affects wages and is positively correlated with education: $$ \begin{aligned} \text{ability} &= \delta\cdot \text{educ} + \varepsilon_z,\quad \delta=0.6 \\ \text{wage} &= \beta_0 + \beta_1\,\text{educ} + \beta_2\,\text{ability} + \beta_3\,\text{exper} + u. \end{aligned} $$

We will omit ability in the short model and see the bias on the education coefficient.


🧩 Data Generating Process (DGP)¶

We’ll now simulate the data according to the equations above.

Parameter settings used below:
$$ \beta_1\;\text{(educ effect)} = 1.0\;\text{ (modest)} $$

$$ \beta_2\;\text{(ability effect)} = 5.0\;\text{ (strong)} $$

$$ \beta_3\;\text{(exper effect)} = 0.5 $$

Error terms (distributions):

$$ \varepsilon_z \sim \mathcal{N}(0, 1) $$

$$ u \sim \mathcal{N}(0, 1) $$

In [15]:
# --- Data Generating Process (DGP): wage ~ educ + ability + exper ---

n = 8000

# Core regressors
educ   = rng.integers(10, 25, size=n)                # years of schooling
exper  = rng.integers(0, 50,  size=n)                # years of work experience

# Omitted factor: ability, positively correlated with educ (delta ≈ 0.6)
eps_z = rng.normal(0, 1, n)  # ability noise
delta = 0.6
ability = delta * educ + eps_z

# True DGP parameters (chosen for dramatic OVB contrast)
beta0 = 10.0
beta1 = 1.0   # modest education effect
beta2 = 5.0   # strong ability effect
beta3 = 0.5   # moderate experience effect

# Wage error term
u = rng.normal(0, 1, n)  # wage noise

# Generate wage (levels)
wage = beta0 + beta1*educ + beta2*ability + beta3*exper + u

# Bundle into a DataFrame
df = pd.DataFrame({
    "wage": wage,
    "educ": educ,
    "ability": ability,   # will be omitted in the short model
    "exper": exper
})


df.head()
Out[15]:
wage educ ability exper
0 104.135775 19 13.557308 12
1 117.799802 22 12.372861 47
2 92.422357 16 9.433475 43
3 88.381337 16 9.481032 28
4 83.330458 17 11.191150 0

3) Estimate the short model (omit ability)¶

In [16]:
y = df["wage"]
X_short = sm.add_constant(df[["educ","exper"]])
m_short = sm.OLS(y, X_short).fit()
m_short.params
Out[16]:
const    9.979546
educ     4.004190
exper    0.500020
dtype: float64

4) Estimate the long model (add ability)¶

In [17]:
X_long = sm.add_constant(df[["educ","exper","ability"]])
m_long = sm.OLS(y, X_long).fit()
m_long.params
Out[17]:
const      10.033981
educ        0.992020
exper       0.500760
ability     5.007569
dtype: float64

5) Which assumption does OVB violate?¶

In the short model (when we omit ability), the error term becomes
$$ e = u + \beta_2\,\text{ability}. $$

Because ability is correlated with educ in our DGP, the short-model error is correlated with a regressor:
$$ \mathbb{E}[e \mid \text{educ}] \ne 0. $$

This violates the exogeneity / zero conditional mean assumption of OLS.

Therefore:

  • OLS is biased,
  • and the bias is given by
    $$ \text{Bias}(\hat{\alpha}_1) = \beta_2 \times \delta, $$ where $\delta$ is the slope from the regression of ability on educ: $$ \text{ability} = \gamma_0 + \delta\,\text{educ} + r. $$

In other words, the bias equals the omitted variable’s effect on $y$ ($\beta_2$) times how strongly that variable is related to the included regressor (educ).


Since the theoretical bias is approximately equal to the difference between the short and long model coefficients on educ, let’s experiment and compare them in our simulated data.

In [18]:
# --- Empirical Bias via β2 × δ ---

# 1) Theoretical bias
bias_formula = beta2 * delta
print(f"Theoretical OVB (β2 × δ): {bias_formula:.4f}")

# Empirical difference in estimated coefficients
emp_bias = m_short.params["educ"] - m_long.params["educ"]
print(f"Empirical OVB (short - long): {emp_bias:.4f}")
Theoretical OVB (β2 × δ): 3.0000
Empirical OVB (short - long): 3.0122

6) ✅ Should we include ability?¶

Goal: estimate the effect of educ on wage as accurately as possible.

Decision guide:

  1. If a variable affects wage and is plausibly correlated with educ → include it.
  2. If it is caused by educ → omit it for the total effect.
  3. If it’s redundant (adds no new info) → you may drop it.

In our example:

  • ability is a true determinant of wage and correlated with education.
  • So, the long model (with ability) gives the unbiased estimate of the effect of education.
  • The short model omits ability and thus suffers OVB — used here only for illustration.

But what if “ability” is shaped by education?

  • If “ability” reflects innate skills or pre-education factors → include it (or a proxy).
  • If “ability” reflects skills developed through schooling → omit it if your goal is the total effect of education.

7) ⚠️ What if it’s perfect multicollinearity?¶

  • Our example only has moderate correlation (ability ≈ 0.6 × educ + noise), so OLS can still estimate separate effects — just with higher standard errors.

  • But if the relationship were perfect (say ability = 0.6 × educ exactly), then educ and ability would carry identical information.

In that case:

  • The model suffers perfect multicollinearity —
    • OLS literally cannot compute separate coefficients (no unique estimates), and you must drop one variable/redefine a new variable to avoid redundancy.

8) Wrap‑up — Balancing model specification ⚖️¶

  • Too many overlapping variables → multicollinearity 😵‍💫 → large SEs, unstable estimates
  • Too few relevant variables → OVB 🎯 → biased and misleading coefficients

When you design or refine a regression model, always keep both Omitted Variable Bias and Multicollinearity in mind 🤝 —

  • Our goal is not just a higher $R^2$, but a well‑specified, interpretable model 💡

In short:

  • A well‑specified model finds the sweet spot — include variables that are relevant and conceptually justified, but avoid redundant overlap that adds noise without insight.
  • Good econometrics is not about “more” or “less” — it’s about clear reasoning and purposeful inclusion.

✅ In conclusion¶

  • Multicollinearity → happens when you include redundant information.

    • Variables overlap in what they explain → coefficients are unstable, but still unbiased.
  • Omitted Variable Bias (OVB) → happens when you omit relevant information.

    • A missing variable affects $y$ and is correlated with an included regressor → coefficients become biased.

💡 Key takeaway:

  • Multicollinearity = too much of the same information.

  • OVB = missing important information.

  • A good model strikes the balance: include variables that truly matter, but not redundant ones.


References & Acknowledgments¶

  • This teaching material was prepared with the assistance of OpenAI's ChatGPT (GPT-5).

End of lecture notebook.