How many Chinese words are needed for Natural Language Processing?

In the first Chilin blog entry we noted that OOV (Out-Of-Vocabulary) words were a major problem for natural language processing in general. Actually, there are also other important challenges. One is the size of the lexicon and the relevance of the subjects covered by the words. The actual requirements will depend on the target users and usage. In the case of weather forecasting, for example, the number of words needed would be quite small. A weather forecast lexicon only needs to cover place names, temperature ranges, as well as some descriptive words such as “sunny, stormy, cold”, etc. When dealing with chatbots involving, for example, cinema reservations, the normal range would be also small, except that allowance would have to be made for the open-ended titles of movies, but then they can be manually updated each week unless the chatbot is to become a film archive. On the other hand, medical enquires such as making appointments when contrasted with medical consultations would be incrementally complex.

It was also noted in the first Chilin blog entry that 3,000 common Chinese words are needed to get a good grasp of the local newspapers content. Such threshold requirements can also vary greatly. In kindergarten, the number of words involved may be in the hundreds but by the end of primary school the number would go up into the thousands. But in the case of professionals such as doctors, scientist, and well-educated academics, this could go up to several hundred thousand words. So, a competent surrogate computer system to handle the Chinese language will need to have a suitable range of words. In the case of LiVac, the Chinese news media have made use of a vocabulary base of more than 2 million words in the last 21 years. This means, if an encyclopedia of Chinese is produced to cover this 21-year period, it will have much more than a million worthwhile entries. A computer system to do the job would have to satisfy similar requirements. In science and technology, the repertoire would be even greater, and the special domain words would overlap only partially with LiVac.

Modern computer systems can easily manage large lexicons. The challenge for those working on natural language processing in Chinese is to monitor the introduction of new words, and to have access to a lexicon with suitable size. Chilin’s LiVac continuously monitors the pan-Chinese news media by tracking language usage and new words. As noted here, Chilin recently published a list of the top new Chinese buzzwords for 2020. This list is just a sample of the top new terms. LiVac also monitors proper names and their frequency of occurrence into media. With reference to here, Chilin has published a list of top Pan-Chinese Newsmakers for 2020.

For more information on Chilin lexicons and LiVac, please contact us at here.

Abel @Chilin