Background¶
You’re given test scores from two lab sessions (A and B, which are taught by instructors Alice and Bob, respectively).
Students come in with different prep levels (“High” vs “Low”).
Instructions¶
Before answering the questions below, run the code cell to generate the dataset.
The below cell creates a sample dataset for you to work with. Run the cell to generate the data BUT DO NOT MODIFY IT.
# **************** You don't need to modify anything in this cell.*****************
# **************** Just run it to generate the dataset. *****************
import numpy as np
import pandas as pd
rng = np.random.default_rng(7)
n_A, n_B = 120, 120
prep_A = rng.choice(["High", "Low"], size=n_A, p=[0.8, 0.2])
prep_B = rng.choice(["High", "Low"], size=n_B, p=[0.3, 0.7])
def gen_scores(prep, mean_high_A=85, mean_low_A=72, mean_high_B=88, mean_low_B=74, sd=6, instructor="A"):
means = np.where(prep=="High",
mean_high_A if instructor=="Alice" else mean_high_B,
mean_low_A if instructor=="Alice" else mean_low_B)
return rng.normal(loc=means, scale=sd, size=len(prep))
scores_A = gen_scores(prep_A, instructor="Alice")
scores_B = gen_scores(prep_B, instructor="Bob")
df = pd.DataFrame({
"instructor": np.r_[np.repeat("Alice", n_A), np.repeat("Bob", n_B)],
"prep": np.r_[prep_A, prep_B],
"score": np.r_[scores_A, scores_B]
})
- Take a look at the dataset by running the following code cell.
# **************** You don't need to modify anything in this cell.*****************
# **************** Just run it to Preview the first five rows of the dataset. *****************
df.head(5)
Task (15-20 minutes)¶
Q1. Find the number of students in each lab section (A and B)¶
Your task is to calculate how many students are in each lab section (A and B).
Store the results in the variables n_A and n_B.
💡 Hints:
- You can filter the dataframe by instructor, e.g.:
df[df["instructor"] == "Alice"] df[df["instructor"] == "Bob"]- After filtering, you can count the rows using any of these:
len(df_filtered)→ counts rows directlydf_filtered.shape[0]→ returns the number of rowsdf_filtered["score"].count()→ counts non-missing scores only- Pick whichever method you prefer!
# Put your answer for Q1 here:
# Print the answers (NO NEED TO MODIFY)
print("The number of students in lab A is:", n_A)
print("The number of students in lab B is:", n_B)
You should expect to see the following output after running the code cell below:
The number of students in lab A is: 120
The number of students in lab B is: 120
Q2. Compute the mean test score for each lab section (A and B)¶
Using the dataset provided, compute the mean test score separately for:
- Instructor Alice → Lab Section A
- Instructor Bob → Lab Section B
💡 Hint:
Use thegroupby()function in combination withmean()in pandas.
# Put your answer for Q2 here:
Q3. Calculate the average test score by prep level and teacher¶
Now, let’s find the mean test score for each prep level (High vs. Low) within each teacher’s lab.
💡 Hint:
Use thegroupby()function in combination withmean()in pandas, for example:df.groupby(["prep", "instructor"])["score"].mean().reset_index()
- The first argument
["prep", "instructor"]groups the data by both columns.- The
["score"].mean()part calculates the average score for each group.- The
reset_index()part is optional but helps to convert the result back to a DataFrame.
# Put your answer for Q3 here:
Q4. Count how many high-prep students are in each instructor's lab section¶
Let's see if the number of high-prep students differs between class of Instructor Alice and Instructor Bob.
Count the number of students with prep == "High" in each lab section.
💡 Hint:
Usegroupby()with a filter:df[df["prep"] == "High"].groupby("instructor")["prep"].count().reset_index()This filters for students with
prep == "High"and counts them by instructor.
# Put your answer for Q4 here:
Q5. From the results above, do you think Instructor A is a better teacher than Instructor B? Why or why not?¶
Put your answer for Q5 here:¶
End of Lab Exercise.