ECON4130 - Natural Language Processing in Chinese

In the previous notebook, we have learnt several basic Natural Language Processing (NLP) techniques to deal with English text data. However, we shall notice that natural language is never just English. There are a wide range of interesting research questions concerning text data of different languages.

In this notebook, we will discuss methods to handle Chinese text data. Even so, you do not need to understand Chinese in order to use this notebook. This is the power of Machine Learning: We teach machines to learn, or, most likely, we use the taught machines, so that we can know answers from them without even learning by ourselves.

Prerequisites

Packages that would be used in this notebook:

1. googletrans

2. ltp (Language Technology Platform)

3. snownlp

4. opencc (Open Chinese Convert)

All the above are applicable to both Simplified Chinese & Traditional Chinese characters

5. stopwordsiso

stopwordsiso is only avaliable for Simplified Chinese characters

Preprocessing

When encoutering Chinese text data, there are (at least) 2 ways to deal with it:

Examples

Let us start with 2 simple examples

Positive statement:

我最喜歡的經濟課就是 ECON4130 了!
("ECON4130 is my favourite economic course!")

Negative statment:

我最討厭的經濟課就是 ECON4130 了!
(" ECON4130 is the economic course that I hate the most!")

In this notebook we are using traditional Chinese characters for demonstration

Method 1: Google translation in Python

Lets have a look on what languages do googletrans support:

We can see that the library supports a total of 106 languages. Since in this notebook, we would focus on the conversion between English and Chinese, only the below 3 would be useful.

Now we are ready to proceed. Let's translate the above Chinese statements (which are expressed in traditional chinese characters) into English.

The translation actually works quite well!

After that, if you want to transfer these results to another notebook for further processing, you can use the module pickle in Python.

By Pickling (Dump) and Unpickling (Load), we are able to serialize and de-serialize a Python object structure, which you can think of it as "saving" and "loading" a Python object.

Method 2: Switch to use Chinese-based NLP libraries developed on Python

For cases that are as simple as the previous examples, translating seems to work very well. However, in real world-applications, things would not be that perfect. Instead, we should notice that English and Chinese are two very different language systems, which are with their own grammar. Thus, translating may cause information loss.

With growing interests in analysing Chinese data, research efforts have been devoted to build Chinese-based NLP libraries. In this notebook, we will introduce two of them, which are ltp and snownlp.

Let's start with ltp:

From the table provided in their appendix (https://github.com/HIT-SCIR/ltp/blob/master/docs/appendix.rst):

Tag Description
r pronoun
d adverb
v verb
u auxiliary
n general noun
nz other proper noun
wp punctuation

We can see that ltp did a perfect job on the word segmentation and the part of speech tagging! How about for snownlp?

The result from snownlp is terrible! It simply dissects the whole sentence by each single Chinese character, without concerning groups of meaningful vocabularies (which should be the goal of tokenization). For example, we sucessfully get 喜歡 (like) and 經濟課 (Economic class) from ltp, but they are not considered as a group in snownlp.

As the performance of snownlp is so terrible, we suspect that there might be a lack of training on Traditional Chinese text data in their model.

Therefore, below we will demonstrate how one can convert Traditional Chinese text into Simplified Chinese characters by using the .han function of snownlp. We will also show that using Simplified version of the statements actually helps to improve the result of snownlp on tokenization by a lot.

Performance of snownlp on tokenization is improved when we switch to use Simplified Chinese. Now expressions like 喜歡(like) and 討厭(hate) are grouped. However, still, we can notice that ltp has done better job than snownlp in tokenization. For example, 經濟課 (Economic class) and 就是 (is) is better to be grouped.

While ltp is apparently a more professional NLP library in the linguistic aspect, snownlp offers different useful and interesting functions which are not avaliable in ltp. Other than transforming Traditional Chinese into Simplified Chinese, two more examples are shown below.

To better demonstrate the following two functions keywords and summary on snownlp, let's use a paragraph instead, below is an extract from the department's website (https://www.econ.cuhk.edu.hk/econ/zh-tw/about-us):

Notice that snownlp only provides the function of coverting Traditional Chinese into Simplified Chinese, but not the other way round. Therefore, if we want to transform the results that we get from snownlp back into Traditional Chinese, we will need to use another library opencc.

After introducing the features of snownlp, let's come back to our simple example!

Other than tokenization, detecting and eliminating stop words is also an important task in NLP. These stop words are meaningless and can cause confusion when training models or performing analysis. Here is how we can do it by using stopwordsiso.

We can have a look on what is included as stop words in Chinese:

After that, we should remove stop words from our sample statements before proceeding to further training or analysis. Considering that the tokenization results from ltp is better, we are going to use them rather than those from snownlp.

Since we also know that ECON4130 is a proper noun and should not be useful as we proceed to sentiment analysis, we can drop that out too.

Sentiment Analysis

After all these preprocessing, our ultimate goal is to transfer them into useful information, which one of them would be sentiments.

Again, by using snownlp, we can perform sentiment analysis on our sample statements. In snownlp, sentiments are measured from 0 (most negative) to 1 (most positive).

To get a representative number for each statement instead of having one for each of its phases, we can average out the numbers.