📘 ECON 320 Lab Problem Set 1¶

  • Name : [Your Name]

  • Lab Section: [Your Lab Section Here]

Please submit the exercise on Canvas in form of a HTML/PDF file.¶


This assignment builds on:¶

  • Week 1: Descriptive Statistics & Basic Python Coding
  • Week 2: Understanding & Presenting Data
  • data: J.M. Wooldridge (2019) Introductory Econometrics: A Modern Approach, Cengage Learning, 7th edition.

You will practice summary statistics, basic data cleaning, and choosing appropriate visualizations.


🎯 Learning Objectives¶

By the end of this assignment, you should be able to:

  1. Compute and interpret summary statistics.
  2. Practice basic data cleaning.
  3. Conduct different data visualizations.
  4. Reflect on the difference between correlation and causation.

📝 Grading (Total = 10 points)¶

  • Q1: Summary statistics — 2 pt
  • Q2: Data cleaning — 2 pts
  • Q3: Visualizations — 4 pts
  • Q5: Critical thinking — 2 pts

🔧 Setup¶

⚠️ Please run the cells in this section before starting the problem set!¶

📦 Download and import required libraries¶
In [ ]:
%pip install wooldridge matplotlib seaborn

import wooldridge as wr
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ Libraries ready.")

📥 Load the dataset¶
  • In this problem set, we will use the econmath dataset from the wooldridge package.

  • It contains information on students from a large college course in introductory microeconomics.

In [ ]:
# Show dataset description (variables, source, sample) for the "econmath" dataset
wr.data("econmath", description=True)
In [ ]:
# Load DataFrame for analysis
df_raw = wr.data("econmath")
df_raw.head()

❓ Q1 — Summary Statistics (2 pts, 2 sub-questions)¶

1. Use .describe() to summarize the variables.¶
In [ ]:
# Put your answer here
2. Report mean, std, min, max for score, actmth, and acteng.¶
In [ ]:
# Put your answer here

❓ Q2 — Cleaning data (2 pts, 3 sub-questions)¶

1. Check for missing values.¶

💡 Hint:

  • You can use <dataframe name>.isna().sum() to count missing values in each column.

  • In this case, the dataframe name is df_raw.

In [ ]:
# Put your answer here
2. For this assignment, assume the missing ACT scores are missing at random.¶
  • Drop all rows with missing ACT scores (actmth or acteng).

  • Report how many rows remain after dropping missing values.

In [ ]:
# Solution is provided for this step

df_clean = df_raw.dropna(subset=['actmth', 'acteng'])
print(f"Rows remaining after dropping missing ACT scores: {df_clean.shape[0]}")
3. Create a new DataFrame (name it df_analysis) based on df_clean to keep only the following relevant variables for the rest of the assignment:¶
  • score (test score on the introductory microeconomics course)

  • actmth (ACT math score)

  • acteng (ACT English/verbal score)

In [ ]:
# Put your answer here

❓ Q3 — Visualizing the Data (6 pts, 5 sub-questions)¶

Using the Dataframe df_analysis created, generate plots to explore the data:

1. Histogram of score. (1 pt)¶
In [ ]:
# Put your answer here
# 1) Histogram of score
2. Boxplot of score. (1 pt)¶
In [ ]:
# Put your answer here
# 2) Boxplot of score

3. Scatter plot of score vs actmth (with best-fit line). (1 pt)¶

In [ ]:
# Put your answer here
# 3) Scatter: score vs actmth
4. Scatter plot of score vs acteng (with best-fit line). (1 pt)¶
In [ ]:
# Put your answer here
# 4) Scatter: score vs acteng
  1. ✍️ Brief interpretation (2–4 sentences): (2 pts, open-ended)
  • What do you notice about the distribution of score?
  • Are score and ACT scores positively related?
  • Any caveats about correlation ≠ causation?

✍️ Put your answer here.


End of Problem Set.