ECON4130 Assignment 1 - Getting To Know Our Faculty

Introduction

Learning a new programming language is never an easy task. In order to get you comfortable working with Python, this assignment provides you with an opportunity to play with the basic Python skills learnt from previous lectures. In this assignment, we will scrape data about our faculty from the department's website for multiple processings. We hope this assignment can build your confidence in using Python, and also to alllow you to know more about the faculty of our department.

Coverage

In this assignment, we will cover skills mentioned in the following notebooks:

Part 1. Data Scraping in One Specific Static Webpage

In our department's website, information about the faculty is stored in the subpage: https://www.econ.cuhk.edu.hk/econ/en-gb/people/faculty.

In contrast to the Hong Kong Jockey Club's page used in our previous lectures, it is a static page instead of a dynamic one. Therefore, we can simply use requests to fetch the webpage, without bothering to use selenium for activating an actual browser.

As a demonstration, we first try to get the data about each faculty member's name and title.

By right-clicking on the webpage and seletct "View Page Source", or, simply use the keyboard shortcut CTRL+U, you can see that such information is enclosed by a nested <p> tag in form of:

<p class="faculty-name"><a href="/econ/en-gb/people/faculty?view=faculty&id=ycchow">Dr. CHOW Yan Chi, Vinci&nbsp;周恩誌<p class="faculty-title"><span>Senior Lecturer</span></p></a></p>

The "outer layer" is a <p> tag with the class "faculty-name" while the "inner layer" is another <p> tag with the class "faculty-title".

Therefore, if we use soup.find_all to locate information by the outter <p> tag, it will return both the member's name and title in an entangled form like:

Dr. CHOW Yan Chi, Vinci 周恩誌Senior Lecturer

Since we want to store these 2 pieces of information into seperate variables, this entangeled form would not be ideal for us.

Therefore, we can first start with the inner <p> tag to get the faculty's title, then, by using precious_sibling, we can locate the member's name.

Q1 Scrape and print out the email of the faculty member (if provided) with their names and titles on the same row.

Part 2. List Operations on the Scraped Data

Q2.1 Create a list storing titles of the faculty members and print it out.

To be more specific, your answer should looks like:

['Department Head/ Associate Professor', 'Assistant Professor',...]

Hint: You may want to create an empty list by list_title = [] first, then, by using a for loop with list_title.append() to make entries one by one.

Q2.2 By using the list created in Q2.1, count the number of Assistant Professor in our faculty and print out a message: "The number of Assistant Professors in our faculty is x.", where x is the number of Assistant Professor in the faculty.

There are mutliple possible ways to finish the task, for example:

  1. With the list comprehension skills learnt in lecture, create another list containing 'Assistant Professor' only. Then, use len() to count the number of items in the new list.
  2. You can use a for loop with if/else statements to count the number of matched result. Hint: Setting the count to 0 at first, then, add it by 1 every time when a matched result appears.
  3. You can learn to use the count() function designed to count elements of lists. More details about the count() function can be referred to https://www.programiz.com/python-programming/methods/list/count.

You are encouraged to try all of above, but you can get a full mark as long as you can finish the task (by either one of the three ways, or any other way that you can think of).

Part 3. Creating a Faculty Dictionary

In Part 1, we have already tried to obtain information about our faculty members on their names, titles, and emails. In this section, we want to build a dictionary with their names, titles, and research interest(s).

Q3.1 Scrape and print out the research interest(s) of the faculty member in a single column.

We can see that the result printed out is not desired for preparing the dictionary and formatting changes is needed.

Q3.2 Format the scraped data on the faculty member's research interest(s) following the below instructions and store the formatted data into a list. Print out the list.

Formatting Changes required:

Step 1. Seperate different research interests in the same line with /

Hint: We can see that the BeautifulSoup output takes the form of:

<p class="research-field">Research Interest(s):<span><p>Microeconomic Theory<br/>Financial Economics<br/>Industrial Organization</p></span></p>

Therefore, the text we got by the function .text() in Q3.1 is a combination of 4 seperate parts: "Research Interest(s):", "Microeconomic Theory", "Financial Economics" and "Industrial Organization". By default, they are connected with the empty space "". Therefore, different research interests are sticked together and that's why we will got a result that looks like:

Research Interest(s):Microeconomic TheoryFinancial EconomicsIndustrial Organization

In this way, we can use .get_text(separator="/") from BeautifulSoup to specify that we want to use / to join different pieces of text together.

More about the .get_text() function can be referred to the documentation of the BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Step 2: Deleting the phase "\xa0/".

The output generated from step 1 with .get_text() is left with the symbol "\xa0/" which is unwanted, and therefore we have to handle it by deleting the phase.

Hint: .replace("\xa0/","")

Step 3. Deleting the phase "Research Interest(s):" in front of each line.

Hint: .replace("Research Interest(s):/","")

Q3.3 Create another list storing the faculty's name.

If the faculty's name is appears with the symbol "\xa0", again, please apply .replace("\xa0"," ") again for formatting.

Q3.4 Put the information on the faculty member's name, title, and research interest(s) into a dictionary system (a dictionary of dictionaries), with his/her name as the outter layer key, and "Title", "Reseach interest(s)" being the inner layer keys.

Now we have 3 seperate list on the faculty member's name (Q3.3), title (Q2.1) and research interest(s) (Q3.2). By using the 3 lists, create a nested dictionary for our faculty.

Items in the nested dictionary should look like:

{..., 'Dr. CHOW Yan Chi, Vinci 周恩誌': {'Title': 'Senior Lecturer', 'Reseach interest(s)': 'Behavioral Economics/Experimental Economics/Machine Learning'}, ...}

Hint: you can create the dictionary by using the line:

dict = {list_1[i]: {"Title": list_2[i], "Reseach interest(s)": list_3[i]} for i in range(0, len(list_1)) }

to loop through your 3 lists

Q3.5 Check the title and the reseach interest(s) of your academic advisor by using the dictionary you created in Q3.4. Print the message: "My academic advisor is x. He/She is a y and his/her research interest(s) are z", where x is your advisor's name with a format match with the dictionary, and y and z are the corresponding outputs from the dictionary.

In our department, each student would be matched with an academic advisor once admitted. If you are not sure about the identity of your academic advisor, login into the CUSIS system and click "Academic Progress", and then "Advisors" for checking.

For students who comes from other department, please treat our instructor, "Dr. CHOW Yan Chi, Vinci 周恩誌", as your academic advisor in this exercise.

End of the assignment