Learning a new programming language is never an easy task. In order to get you comfortable working with Python, this assignment provides you with an opportunity to play with the basic Python skills learnt from previous lectures. In this assignment, we will scrape data about our faculty from the department's website for multiple processings. We hope this assignment can build your confidence in using Python, and also to alllow you to know more about the faculty of our department.
In this assignment, we will cover skills mentioned in the following notebooks:
In our department's website, information about the faculty is stored in the subpage: https://www.econ.cuhk.edu.hk/econ/en-gb/people/faculty.
In contrast to the Hong Kong Jockey Club's page used in our previous lectures, it is a static page instead of a dynamic one. Therefore, we can simply use requests
to fetch the webpage, without bothering to use selenium
for activating an actual browser.
#Nothing is required to do in this cell
import requests
from bs4 import BeautifulSoup
# URL of the subpage listing information of the faculty
url = "https://www.econ.cuhk.edu.hk/econ/en-gb/people/faculty"
# Accessing the static webpage by "requests"
# **Setting ``verify = False`` to disable the SSL verification,
# it helps to aviod the error appears when you use the same IP address to make excessive requests in a short period of time.**
page = requests.get(url, verify=False)
# Pass the page content into BeautifulSoup for processing
soup = BeautifulSoup(page.content,"html.parser")
C:\Users\kyche\Anaconda3\lib\site-packages\urllib3\connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
By right-clicking on the webpage and seletct "View Page Source", or, simply use the keyboard shortcut CTRL+U
, you can see that such information is enclosed by a nested <p>
tag in form of:
<p class="faculty-name"><a href="/econ/en-gb/people/faculty?view=faculty&id=ycchow">Dr. CHOW Yan Chi, Vinci 周恩誌<p class="faculty-title"><span>Senior Lecturer</span></p></a></p>
The "outer layer" is a <p>
tag with the class "faculty-name" while the "inner layer" is another <p>
tag with the class "faculty-title".
Therefore, if we use soup.find_all
to locate information by the outter <p>
tag, it will return both the member's name and title in an entangled form like:
Dr. CHOW Yan Chi, Vinci 周恩誌Senior Lecturer
Since we want to store these 2 pieces of information into seperate variables, this entangeled form would not be ideal for us.
Therefore, we can first start with the inner <p>
tag to get the faculty's title, then, by using precious_sibling
, we can locate the member's name.
#Nothing is required to do in this cell
# Find all p tags with "faculty-title"
fac_titles = soup.find_all("p", class_="faculty-title")
# Print out the result
#for fac_title in fac_titles:
# fac_name = fac_title.previous_sibling
# print(fac_name.ljust(40),
# fac_title.text)
# Alternatively, you can achieve the same result by this procedure
# In this case, you are assigning a ID from 1 to 38 for each faculty member
# *You may find this form more convienient to use when answering Q1*
for i in range(0,len(fac_titles)):
fac_name = fac_titles[i].previous_sibling
print(fac_name.ljust(35),
fac_titles[i].text)
Prof. KWONG Kai Sun, Sunny 鄺啟新 Department Head/ Associate Professor Prof. BAI Ying 白營 Assistant Professor Prof. CHAN Hing Chi, Jimmy 陳慶池 Professor Prof. CHONG Tai Leung, Terence 莊太量 Associate Professor Prof. DU Julan 杜巨瀾 Associate Professor Prof. GUO Naijia 郭乃嘉 Assistant Professor Prof. HE Wei 何暐 Assistant Professor Prof. HUANG Ji 黃吉 Assistant Professor Prof. LI Duozhe 李多哲 Associate Professor Prof. LIN Shu 林曙 Professor Prof. LYU Dan 呂丹 Assistant Professor Prof. LU Xun 陸迅 Associate Professor Prof. MENG Lingsheng 孟嶺生 Associate Professor Prof. NG Ka Ho, Travis 吳嘉豪 Associate Professor Prof. PEI Guangyu 裴光宇 Assistant Professor Prof. SHENG Liugang 盛柳剛 Associate Professor Prof. SHI Ce, Matthew 史冊 Assistant Professor Prof. SHI Kang 施康 Associate Professor Prof. SHI Zhentao 史震濤 Associate Professor Prof. SONG Zheng, Michael 宋錚 Professor Prof. WANG Xiaohu 王曉虎 Assistant Professor Prof. YIP Chong Kee 葉創基 Professor Prof. ZHANG Junsen 張俊森 Wei Lun Professor of Economics Prof. ZHANG Yifan 張軼凡 Associate Professor Prof. LIU Pak Wai 廖柏偉 Emeritus Professor Prof. SUNG Yun Wing 宋恩榮 Adjunct Professor Dr. CHOW Yan Chi, Vinci 周恩誌 Senior Lecturer Dr. CHUNG Chun Kit, Andy 鍾俊傑 Lecturer Dr. IP Tak Sang, Hugo 葉德生 Lecturer Ms. LEUNG Yuk Chun, Priscilla 梁玉珍 Lecturer Dr. MOK Kai Chung, Wallace 莫啟聰 Senior Lecturer Dr. WOO Wai-chiu 胡偉潮 Lecturer Dr. YAN Wai Hin 殷偉憲 Lecturer Dr. YUNG Chor Wing, Linda 容楚穎 Senior Lecturer Dr. WANG Xin 王鑫 Research Associate Prof. CHO Jin Seo 趙鎮緒 Visiting Scholar Prof. YAO Feng 姚丰 Visiting Scholar Prof. WONG Kam Chau 黃錦就 Part-time Lecturer (Adjunct Associate Professor)
#Put your answer here:
To be more specific, your answer should looks like:
['Department Head/ Associate Professor', 'Assistant Professor',...]
Hint: You may want to create an empty list by list_title = []
first, then, by using a for loop with list_title.append()
to make entries one by one.
#Put your answer here:
There are mutliple possible ways to finish the task, for example:
len()
to count the number of items in the new list.count()
function designed to count elements of lists. More details about the count()
function can be referred to https://www.programiz.com/python-programming/methods/list/count.You are encouraged to try all of above, but you can get a full mark as long as you can finish the task (by either one of the three ways, or any other way that you can think of).
#Put your answer here:
#Method 1:
#Put your answer here:
#Method 2:
#Put your answer here:
#Method 3:
#Put your answer here:
#Other method:
In Part 1, we have already tried to obtain information about our faculty members on their names, titles, and emails. In this section, we want to build a dictionary with their names, titles, and research interest(s).
#Put your answer here:
We can see that the result printed out is not desired for preparing the dictionary and formatting changes is needed.
Formatting Changes required:
Step 1. Seperate different research interests in the same line with /
Hint: We can see that the BeautifulSoup
output takes the form of:
<p class="research-field">Research Interest(s):<span><p>Microeconomic Theory<br/>Financial Economics<br/>Industrial Organization</p></span></p>
Therefore, the text we got by the function .text()
in Q3.1 is a combination of 4 seperate parts: "Research Interest(s):", "Microeconomic Theory", "Financial Economics" and "Industrial Organization". By default, they are connected with the empty space ""
. Therefore, different research interests are sticked together and that's why we will got a result that looks like:
Research Interest(s):Microeconomic TheoryFinancial EconomicsIndustrial Organization
In this way, we can use .get_text(separator="/")
from BeautifulSoup
to specify that we want to use /
to join different pieces of text together.
More about the .get_text()
function can be referred to the documentation of the BeautifulSoup
: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Step 2: Deleting the phase "\xa0/".
The output generated from step 1 with .get_text()
is left with the symbol "\xa0/" which is unwanted, and therefore we have to handle it by deleting the phase.
Hint: .replace("\xa0/","")
Step 3. Deleting the phase "Research Interest(s):" in front of each line.
Hint: .replace("Research Interest(s):/","")
#Put your answer here:
#*Remember to store your formatted data into a list, instead of simply printing it out.
If the faculty's name is appears with the symbol "\xa0", again, please apply .replace("\xa0"," ")
again for formatting.
#Put your answer here:
Now we have 3 seperate list on the faculty member's name (Q3.3), title (Q2.1) and research interest(s) (Q3.2). By using the 3 lists, create a nested dictionary for our faculty.
Items in the nested dictionary should look like:
{...,
'Dr. CHOW Yan Chi, Vinci 周恩誌': {'Title': 'Senior Lecturer',
'Reseach interest(s)': 'Behavioral Economics/Experimental Economics/Machine Learning'},
...}
Hint: you can create the dictionary by using the line:
dict = {list_1[i]:
{"Title": list_2[i], "Reseach interest(s)": list_3[i]} for i in range(0, len(list_1)) }
to loop through your 3 lists
#Put your answer here:
In our department, each student would be matched with an academic advisor once admitted. If you are not sure about the identity of your academic advisor, login into the CUSIS system and click "Academic Progress", and then "Advisors" for checking.
For students who comes from other department, please treat our instructor, "Dr. CHOW Yan Chi, Vinci 周恩誌", as your academic advisor in this exercise.
#Put your answer here: