In the previous notebook, we have learnt several basic Natural Language Processing (NLP) techniques to deal with English text data. However, we shall notice that natural language is never just English. There are a wide range of interesting research questions concerning text data of different languages.
In this notebook, we will discuss methods to handle Chinese text data. Even so, you do not need to understand Chinese in order to use this notebook. This is the power of Machine Learning: We teach machines to learn, or, most likely, we use the taught machines, so that we can know answers from them without even learning by ourselves.
Packages that would be used in this notebook:
1. googletrans
pip install googletrans==4.0.0-rc1
2. ltp
(Language Technology Platform)
Provided with online User Interface: http://ltp.ai/demo.html (Try it!)
Extensive list of tables listing their classifications: https://github.com/HIT-SCIR/ltp/blob/master/docs/appendix.rst
3. snownlp
4. opencc
(Open Chinese Convert)
All the above are applicable to both Simplified Chinese & Traditional Chinese characters
5. stopwordsiso
stopwordsiso
is only avaliable for Simplified Chinese characters
When encoutering Chinese text data, there are (at least) 2 ways to deal with it:
Method 1: Translate it into English, and use English-based NLP libraries like the ones introduced in earlier notebook
Method 2: Use Chinese-based NLP libraries developed on Python
Let us start with 2 simple examples
Positive statement:
Negative statment:
In this notebook we are using traditional Chinese characters for demonstration
#Store the statements into variables
Pos_statement = "我最喜歡的經濟課就是 ECON4130 了!"
Neg_statement = "我最討厭的經濟課就是 ECON4130 了!"
statements_examples = [Pos_statement, Neg_statement]
Lets have a look on what languages do googletrans
support:
import pandas as pd
import googletrans
#Extracting the whole list of languages which are supported by the library (it is in form of a dictionary)
lang_list = googletrans.LANGCODES
#Organize the dictionary into a table, and display the whole table
with pd.option_context("display.max_rows", 1000): #Without adjusting the option, only parts of the table would be shown
display(pd.DataFrame(lang_list.items(), columns = ["Language", "Short Form/Code"]))
Language | Short Form/Code | |
---|---|---|
0 | afrikaans | af |
1 | albanian | sq |
2 | amharic | am |
3 | arabic | ar |
4 | armenian | hy |
5 | azerbaijani | az |
6 | basque | eu |
7 | belarusian | be |
8 | bengali | bn |
9 | bosnian | bs |
10 | bulgarian | bg |
11 | catalan | ca |
12 | cebuano | ceb |
13 | chichewa | ny |
14 | chinese (simplified) | zh-cn |
15 | chinese (traditional) | zh-tw |
16 | corsican | co |
17 | croatian | hr |
18 | czech | cs |
19 | danish | da |
20 | dutch | nl |
21 | english | en |
22 | esperanto | eo |
23 | estonian | et |
24 | filipino | tl |
25 | finnish | fi |
26 | french | fr |
27 | frisian | fy |
28 | galician | gl |
29 | georgian | ka |
30 | german | de |
31 | greek | el |
32 | gujarati | gu |
33 | haitian creole | ht |
34 | hausa | ha |
35 | hawaiian | haw |
36 | hebrew | he |
37 | hindi | hi |
38 | hmong | hmn |
39 | hungarian | hu |
40 | icelandic | is |
41 | igbo | ig |
42 | indonesian | id |
43 | irish | ga |
44 | italian | it |
45 | japanese | ja |
46 | javanese | jw |
47 | kannada | kn |
48 | kazakh | kk |
49 | khmer | km |
50 | korean | ko |
51 | kurdish (kurmanji) | ku |
52 | kyrgyz | ky |
53 | lao | lo |
54 | latin | la |
55 | latvian | lv |
56 | lithuanian | lt |
57 | luxembourgish | lb |
58 | macedonian | mk |
59 | malagasy | mg |
60 | malay | ms |
61 | malayalam | ml |
62 | maltese | mt |
63 | maori | mi |
64 | marathi | mr |
65 | mongolian | mn |
66 | myanmar (burmese) | my |
67 | nepali | ne |
68 | norwegian | no |
69 | odia | or |
70 | pashto | ps |
71 | persian | fa |
72 | polish | pl |
73 | portuguese | pt |
74 | punjabi | pa |
75 | romanian | ro |
76 | russian | ru |
77 | samoan | sm |
78 | scots gaelic | gd |
79 | serbian | sr |
80 | sesotho | st |
81 | shona | sn |
82 | sindhi | sd |
83 | sinhala | si |
84 | slovak | sk |
85 | slovenian | sl |
86 | somali | so |
87 | spanish | es |
88 | sundanese | su |
89 | swahili | sw |
90 | swedish | sv |
91 | tajik | tg |
92 | tamil | ta |
93 | telugu | te |
94 | thai | th |
95 | turkish | tr |
96 | ukrainian | uk |
97 | urdu | ur |
98 | uyghur | ug |
99 | uzbek | uz |
100 | vietnamese | vi |
101 | welsh | cy |
102 | xhosa | xh |
103 | yiddish | yi |
104 | yoruba | yo |
105 | zulu | zu |
We can see that the library supports a total of 106 languages. Since in this notebook, we would focus on the conversion between English and Chinese, only the below 3 would be useful.
#From the dictionary ``lang_list`` that we created, get the code of:
# english, chinese (traditional), and chinese (simplified)
lang_interested = ["english", "chinese (traditional)", "chinese (simplified)"]
for lang in lang_interested:
print(lang, ":", lang_list[lang])
english : en chinese (traditional) : zh-tw chinese (simplified) : zh-cn
Now we are ready to proceed. Let's translate the above Chinese statements (which are expressed in traditional chinese characters) into English.
#Do the translation
from googletrans import Translator
translator = Translator() #Call the translator inside the library
eng_result = [] #Create an empty list to store the translation results
for example in statements_examples:
#Perform the translation to each statement, dest: destination language, src: source language
#(If you don't specify 'dest' and 'src',
#the library will automatically detect the input language and convert it to english by default)
result = translator.translate(example, dest='en', src = 'zh-tw')
eng_result.append(result.text)
print(eng_result)
['My favorite economic class is ECON4130!', 'My most annoying economic class is ECON4130!']
The translation actually works quite well!
After that, if you want to transfer these results to another notebook for further processing, you can use the module pickle
in Python.
By Pickling (Dump) and Unpickling (Load), we are able to serialize and de-serialize a Python object structure, which you can think of it as "saving" and "loading" a Python object.
import pickle
with open("translation_result.txt", "wb") as p: #Pickling, "wb": "w" means write and "b" means binary mode
pickle.dump(eng_result, p)
#You can have the following lines in another notebook (created in the same directory) to import the list
with open("translation_result.txt", "rb") as p: # Unpickling, "rb": "r" means read and "b" means binary mode
imported_result = pickle.load(p)
imported_result
['My favorite economic class is ECON4130!', 'My most annoying economic class is ECON4130!']
For cases that are as simple as the previous examples, translating seems to work very well. However, in real world-applications, things would not be that perfect. Instead, we should notice that English and Chinese are two very different language systems, which are with their own grammar. Thus, translating may cause information loss.
With growing interests in analysing Chinese data, research efforts have been devoted to build Chinese-based NLP libraries. In this notebook, we will introduce two of them, which are ltp
and snownlp
.
Let's start with ltp
:
#Tokenization on ltp
#ltp
from ltp import LTP
ltp = LTP()
for example in statements_examples:
seg, hidden = ltp.seg([example])
pos = ltp.pos(hidden)
print(example)
print(seg)
print(pos)
print("")
我最喜歡的經濟課就是 ECON4130 了! [['我', '最', '喜歡', '的', '經濟課', '就是', 'ECON4130', '了', '!']] [['r', 'd', 'v', 'u', 'n', 'd', 'nz', 'u', 'wp']] 我最討厭的經濟課就是 ECON4130 了! [['我', '最', '討厭', '的', '經濟課', '就是', 'ECON4130', '了', '!']] [['r', 'd', 'v', 'u', 'n', 'd', 'nz', 'u', 'wp']]
From the table provided in their appendix (https://github.com/HIT-SCIR/ltp/blob/master/docs/appendix.rst):
Tag | Description |
---|---|
r | pronoun |
d | adverb |
v | verb |
u | auxiliary |
n | general noun |
nz | other proper noun |
wp | punctuation |
We can see that ltp
did a perfect job on the word segmentation and the part of speech tagging! How about for snownlp
?
#Tokenization on snownlp (Traditional)
#snownlp
from snownlp import SnowNLP
for example in statements_examples:
s = SnowNLP(example)
print(example)
print(list(s.tags))
print("")
我最喜歡的經濟課就是 ECON4130 了! [('我', 'r'), ('最', 'd'), ('喜', 'v'), ('歡', 'i'), ('的', 'u'), ('經', 'Rg'), ('濟', 'Rg'), ('課', 'Rg'), ('就', 'd'), ('是', 'v'), ('ECON4130', 'y'), ('了', 'y'), ('!', 'w')] 我最討厭的經濟課就是 ECON4130 了! [('我', 'r'), ('最', 'd'), ('討', 'Rg'), ('厭', 'Rg'), ('的', 'u'), ('經', 'Rg'), ('濟', 'Rg'), ('課', 'Rg'), ('就', 'd'), ('是', 'v'), ('ECON4130', 'y'), ('了', 'y'), ('!', 'w')]
The result from snownlp
is terrible! It simply dissects the whole sentence by each single Chinese character, without concerning groups of meaningful vocabularies (which should be the goal of tokenization). For example, we sucessfully get 喜歡 (like)
and 經濟課 (Economic class)
from ltp
, but they are not considered as a group in snownlp
.
As the performance of snownlp
is so terrible, we suspect that there might be a lack of training on Traditional Chinese text data in their model.
Therefore, below we will demonstrate how one can convert Traditional Chinese text into Simplified Chinese characters by using the .han
function of snownlp
. We will also show that using Simplified version of the statements actually helps to improve the result of snownlp
on tokenization by a lot.
#snownlp
from snownlp import SnowNLP
statements_ex_sim = [] #Create an empty list to store the converted statements
#Conversion from Traditional Chinese to Simplified Chinese
for example in statements_examples:
s = SnowNLP(example)
sim_ver = s.han
statements_ex_sim.append(sim_ver)
print("Original statement (in Traditional Chinese) :")
print(example)
print("Simplified Chinese version :")
print(sim_ver)
print("")
Original statement (in Traditional Chinese) : 我最喜歡的經濟課就是 ECON4130 了! Simplified Chinese version : 我最喜欢的经济课就是 ECON4130 了! Original statement (in Traditional Chinese) : 我最討厭的經濟課就是 ECON4130 了! Simplified Chinese version : 我最讨厌的经济课就是 ECON4130 了!
#Tokenization on snownlp (Simplified)
#snownlp
from snownlp import SnowNLP
for example in statements_ex_sim: #Now use the new list "statements_ex_sim" created in last cell
#which stored the simplified Chinese version of the statements
s = SnowNLP(example)
print(example)
print(list(s.tags))
print("")
我最喜欢的经济课就是 ECON4130 了! [('我', 'r'), ('最', 'd'), ('喜欢', 'v'), ('的', 'u'), ('经济', 'n'), ('课', 'n'), ('就', 'd'), ('是', 'v'), ('ECON4130', 'y'), ('了', 'y'), ('!', 'w')] 我最讨厌的经济课就是 ECON4130 了! [('我', 'r'), ('最', 'd'), ('讨厌', 'v'), ('的', 'u'), ('经济', 'n'), ('课', 'n'), ('就', 'd'), ('是', 'v'), ('ECON4130', 'y'), ('了', 'y'), ('!', 'w')]
Performance of snownlp
on tokenization is improved when we switch to use Simplified Chinese. Now expressions like 喜歡(like)
and 討厭(hate)
are grouped. However, still, we can notice that ltp
has done better job than snownlp
in tokenization. For example, 經濟課 (Economic class)
and 就是 (is)
is better to be grouped.
While ltp
is apparently a more professional NLP library in the linguistic aspect, snownlp
offers different useful and interesting functions which are not avaliable in ltp
. Other than transforming Traditional Chinese into Simplified Chinese, two more examples are shown below.
To better demonstrate the following two functions keywords
and summary
on snownlp
, let's use a paragraph instead, below is an extract from the department's website (https://www.econ.cuhk.edu.hk/econ/zh-tw/about-us):
#Special functions on snownlp: keywords and summary
s = SnowNLP(u"香港中文大學經濟學系致力傳授學生理性分析的技巧,鼓勵他們思考經濟問題,\
並提高他們獨立分析經濟行為的能力。為了達到這樣的目標,平衡的課程設計是一大\
要素。本系學生在經濟學理論、數學及統計學等基礎學科上均打下堅實基礎。他們亦\
要將理論知識運用於廣泛的而實際的經濟分析上。本系的課程設置理論與實踐並重,\
給本科生提供了大量不同類型的課程:既有經濟史這樣的質化研究領域課程,亦有計\
量經濟學這樣的量化研究類課程,有餘力的學生甚至可以去選修研究院課程。另外,\
經濟系還提供了一些專門課程幫助學生瞭解香港獨有的問題。") #u: unicode
s_sim = SnowNLP(s.han) #Converting into Simplified Chinese for better performance in snownlp
keywords_sim = s_sim.keywords(5) #Extracting keywords, number in the parentheses indicate teh number of keywords you want
summary_sim = s_sim.summary(5) #Extracting summary, number in the parentheses indicate teh number of summary you want
print(keywords_sim)
print(summary_sim)
['课程', '学生', '经济', '理论', '分析'] ['他们亦 要将理论知识运用于广泛的而实际的经济分析上', '给本科生提供了大量不同类型的课程:既有经济史这样的质化研究领域课程', '经济系还提供了一些专门课程帮助学生了解香港独有的问题', '并提高他们独立分析经济行为的能力', '香港中文大学经济学系致力传授学生理性分析的技巧']
Notice that snownlp
only provides the function of coverting Traditional Chinese into Simplified Chinese, but not the other way round. Therefore, if we want to transform the results that we get from snownlp
back into Traditional Chinese, we will need to use another library opencc
.
#OpenCC: Converting Simplified Chinese back into Traditional Chinese
keywords_trad = []
summary_trad = []
import opencc
converter = opencc.OpenCC('s2hk.json') #s2hk.json: Simplified Chinese to Traditional Chinese (Hong Kong variant)
for k in keywords_sim:
keywords_trad.append(converter.convert(k))
for summ in summary_sim:
summary_trad.append(converter.convert(summ))
print(keywords_trad)
print(summary_trad)
['課程', '學生', '經濟', '理論', '分析'] ['他們亦 要將理論知識運用於廣泛的而實際的經濟分析上', '給本科生提供了大量不同類型的課程:既有經濟史這樣的質化研究領域課程', '經濟系還提供了一些專門課程幫助學生了解香港獨有的問題', '並提高他們獨立分析經濟行為的能力', '香港中文大學經濟學系致力傳授學生理性分析的技巧']
After introducing the features of snownlp
, let's come back to our simple example!
Other than tokenization, detecting and eliminating stop words is also an important task in NLP. These stop words are meaningless and can cause confusion when training models or performing analysis. Here is how we can do it by using stopwordsiso
.
#Stop words on stopwordsiso
import stopwordsiso
from stopwordsiso import stopwords
stopwords_chi_list = stopwords(["zh"]) # Call all the stopwords in Chinese
We can have a look on what is included as stop words in Chinese:
stopwords_chi_list
{'、', '。', '〈', '〉', '《', '》', '一', '一个', '一些', '一何', '一切', '一则', '一方面', '一旦', '一来', '一样', '一种', '一般', '一转眼', '七', '万一', '三', '上', '上下', '下', '不', '不仅', '不但', '不光', '不单', '不只', '不外乎', '不如', '不妨', '不尽', '不尽然', '不得', '不怕', '不惟', '不成', '不拘', '不料', '不是', '不比', '不然', '不特', '不独', '不管', '不至于', '不若', '不论', '不过', '不问', '与', '与其', '与其说', '与否', '与此同时', '且', '且不说', '且说', '两者', '个', '个别', '中', '临', '为', '为了', '为什么', '为何', '为止', '为此', '为着', '乃', '乃至', '乃至于', '么', '之', '之一', '之所以', '之类', '乌乎', '乎', '乘', '九', '也', '也好', '也罢', '了', '二', '二来', '于', '于是', '于是乎', '云云', '云尔', '五', '些', '亦', '人', '人们', '人家', '什', '什么', '什么样', '今', '介于', '仍', '仍旧', '从', '从此', '从而', '他', '他人', '他们', '他们们', '以', '以上', '以为', '以便', '以免', '以及', '以故', '以期', '以来', '以至', '以至于', '以致', '们', '任', '任何', '任凭', '会', '似的', '但', '但凡', '但是', '何', '何以', '何况', '何处', '何时', '余外', '作为', '你', '你们', '使', '使得', '例如', '依', '依据', '依照', '便于', '俺', '俺们', '倘', '倘使', '倘或', '倘然', '倘若', '借', '借傥然', '假使', '假如', '假若', '做', '像', '儿', '先不先', '光', '光是', '全体', '全部', '八', '六', '兮', '共', '关于', '关于具体地说', '其', '其一', '其中', '其二', '其他', '其余', '其它', '其次', '具体地说', '具体说来', '兼之', '内', '再', '再其次', '再则', '再有', '再者', '再者说', '再说', '冒', '冲', '况且', '几', '几时', '凡', '凡是', '凭', '凭借', '出于', '出来', '分', '分别', '则', '则甚', '别', '别人', '别处', '别是', '别的', '别管', '别说', '到', '前后', '前此', '前者', '加之', '加以', '区', '即', '即令', '即使', '即便', '即如', '即或', '即若', '却', '去', '又', '又及', '及', '及其', '及至', '反之', '反而', '反过来', '反过来说', '受到', '另', '另一方面', '另外', '另悉', '只', '只当', '只怕', '只是', '只有', '只消', '只要', '只限', '叫', '叮咚', '可', '可以', '可是', '可见', '各', '各个', '各位', '各种', '各自', '同', '同时', '后', '后者', '向', '向使', '向着', '吓', '吗', '否则', '吧', '吧哒', '含', '吱', '呀', '呃', '呕', '呗', '呜', '呜呼', '呢', '呵', '呵呵', '呸', '呼哧', '咋', '和', '咚', '咦', '咧', '咱', '咱们', '咳', '哇', '哈', '哈哈', '哉', '哎', '哎呀', '哎哟', '哗', '哟', '哦', '哩', '哪', '哪个', '哪些', '哪儿', '哪天', '哪年', '哪怕', '哪样', '哪边', '哪里', '哼', '哼唷', '唉', '唯有', '啊', '啐', '啥', '啦', '啪达', '啷当', '喂', '喏', '喔唷', '喽', '嗡', '嗡嗡', '嗬', '嗯', '嗳', '嘎', '嘎登', '嘘', '嘛', '嘻', '嘿', '嘿嘿', '四', '因', '因为', '因了', '因此', '因着', '因而', '固然', '在', '在下', '在于', '地', '基于', '处在', '多', '多么', '多少', '大', '大家', '她', '她们', '好', '如', '如上', '如上所述', '如下', '如何', '如其', '如同', '如是', '如果', '如此', '如若', '始而', '孰料', '孰知', '宁', '宁可', '宁愿', '宁肯', '它', '它们', '对', '对于', '对待', '对方', '对比', '将', '小', '尔', '尔后', '尔尔', '尚且', '就', '就是', '就是了', '就是说', '就算', '就要', '尽', '尽管', '尽管如此', '岂但', '己', '已', '已矣', '巴', '巴巴', '年', '并', '并且', '庶乎', '庶几', '开外', '开始', '归', '归齐', '当', '当地', '当然', '当着', '彼', '彼时', '彼此', '往', '待', '很', '得', '得了', '怎', '怎么', '怎么办', '怎么样', '怎奈', '怎样', '总之', '总的来看', '总的来说', '总的说来', '总而言之', '恰恰相反', '您', '惟其', '慢说', '我', '我们', '或', '或则', '或是', '或曰', '或者', '截至', '所', '所以', '所在', '所幸', '所有', '才', '才能', '打', '打从', '把', '抑或', '拿', '按', '按照', '换句话说', '换言之', '据', '据此', '接着', '故', '故此', '故而', '旁人', '无', '无宁', '无论', '既', '既往', '既是', '既然', '日', '时', '时候', '是', '是以', '是的', '更', '曾', '替', '替代', '最', '月', '有', '有些', '有关', '有及', '有时', '有的', '望', '朝', '朝着', '本', '本人', '本地', '本着', '本身', '来', '来着', '来自', '来说', '极了', '果然', '果真', '某', '某个', '某些', '某某', '根据', '欤', '正值', '正如', '正巧', '正是', '此', '此地', '此处', '此外', '此时', '此次', '此间', '毋宁', '每', '每当', '比', '比及', '比如', '比方', '没奈何', '沿', '沿着', '漫说', '点', '焉', '然则', '然后', '然而', '照', '照着', '犹且', '犹自', '甚且', '甚么', '甚或', '甚而', '甚至', '甚至于', '用', '用来', '由', '由于', '由是', '由此', '由此可见', '的', '的确', '的话', '直到', '相对而言', '省得', '看', '眨眼', '着', '着呢', '矣', '矣乎', '矣哉', '离', '秒', '称', '竟而', '第', '等', '等到', '等等', '简言之', '管', '类如', '紧接着', '纵', '纵令', '纵使', '纵然', '经', '经过', '结果', '给', '继之', '继后', '继而', '综上所述', '罢了', '者', '而', '而且', '而况', '而后', '而外', '而已', '而是', '而言', '能', '能否', '腾', '自', '自个儿', '自从', '自各儿', '自后', '自家', '自己', '自打', '自身', '至', '至于', '至今', '至若', '致', '般的', '若', '若夫', '若是', '若果', '若非', '莫不然', '莫如', '莫若', '虽', '虽则', '虽然', '虽说', '被', '要', '要不', '要不是', '要不然', '要么', '要是', '譬喻', '譬如', '让', '许多', '论', '设使', '设或', '设若', '诚如', '诚然', '该', '说', '说来', '请', '诸', '诸位', '诸如', '谁', '谁人', '谁料', '谁知', '贼死', '赖以', '赶', '起', '起见', '趁', '趁着', '越是', '距', '跟', '较', '较之', '边', '过', '还', '还是', '还有', '还要', '这', '这一来', '这个', '这么', '这么些', '这么样', '这么点儿', '这些', '这会儿', '这儿', '这就是说', '这时', '这样', '这次', '这般', '这边', '这里', '进而', '连', '连同', '逐步', '通过', '遵循', '遵照', '那', '那个', '那么', '那么些', '那么样', '那些', '那会儿', '那儿', '那时', '那样', '那般', '那边', '那里', '都', '鄙人', '鉴于', '针对', '阿', '除', '除了', '除外', '除开', '除此之外', '除非', '随', '随后', '随时', '随着', '难道说', '零', '非', '非但', '非徒', '非特', '非独', '靠', '顺', '顺着', '首先', '︿', '!', '#', '$', '%', '&', '(', ')', '*', '+', ',', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '>', '?', '@', '[', ']', '{', '|', '}', '~', '¥'}
After that, we should remove stop words from our sample statements before proceeding to further training or analysis. Considering that the tokenization results from ltp
is better, we are going to use them rather than those from snownlp
.
#Removing stop words
#ltp for tokenization
from ltp import LTP
ltp = LTP()
cleaned_statements = []
for example in statements_ex_sim:
seg, hidden = ltp.seg([example])
print('Before cleaning:')
print(seg[0])
words_clean = [word for word in seg[0] if word not in stopwords_chi_list] #Removing words that are in the list
cleaned_statements.append(words_clean)
print ('After cleaning:')
print(words_clean)
print("")
Before cleaning: ['我', '最', '喜欢', '的', '经济课', '就是', 'ECON4130', '了', '!'] After cleaning: ['喜欢', '经济课', 'ECON4130'] Before cleaning: ['我', '最', '讨厌', '的', '经济课', '就是', 'ECON4130', '了', '!'] After cleaning: ['讨厌', '经济课', 'ECON4130']
Since we also know that ECON4130
is a proper noun and should not be useful as we proceed to sentiment analysis, we can drop that out too.
#Manually cleaning data
#Create the list of phases that you know would be useless, in this case it's just "ECON4130"
remove_list = ["ECON4130"] #You can always include multiple of them into the list
final_input = []
for statement in cleaned_statements:
words_clean = [word for word in statement if word not in remove_list]
final_input.append(words_clean)
print("Before cleaning:")
print(statement)
print("After cleaning:")
print(words_clean)
Before cleaning: ['喜欢', '经济课', 'ECON4130'] After cleaning: ['喜欢', '经济课'] Before cleaning: ['讨厌', '经济课', 'ECON4130'] After cleaning: ['讨厌', '经济课']
After all these preprocessing, our ultimate goal is to transfer them into useful information, which one of them would be sentiments.
Again, by using snownlp
, we can perform sentiment analysis on our sample statements. In snownlp
, sentiments are measured from 0 (most negative)
to 1 (most positive)
.
#Performing the sentiment analysis on snownlp
for statement in final_input:
for phase in statement:
s = SnowNLP(phase) #Know the sentiment of each phase
print(phase)
print(s.sentiments)
喜欢 0.6994590939824207 经济课 0.7901315522156727 讨厌 0.5128205128205127 经济课 0.7901315522156727
To get a representative number for each statement instead of having one for each of its phases, we can average out the numbers.
#Average out for the statement
for statement in final_input:
sent_list = []
for phase in statement:
s = SnowNLP(phase) #Know the sentiment of each phase
sent = s.sentiments
sent_list.append(sent)
print("Sentiment for the statement is:")
print(sum(sent_list)/len(sent_list)) #Averge out
print("")
Sentiment for the statement is: 0.7447953230990467 Sentiment for the statement is: 0.6514760325180926