New words and language processing by computer

Two popular products of LIVAC are the annual New Chinese Buzzwords Roster and the annual Newsmakers Roster. They reflect on the life and times of the Chinese communities in the previous year. The 2020 Chinese New Buzzword Roster has been released.

It is important to realize that the new words can baffle speakers from another Chinese community and provide a much more serious challenge in the computer processing of natural language. Each year around 5% to 6% new words are found in each community’s verbal repertoire and many are not shared with other groups. These are called OOV (Out-of-Vocabulary) words and pose a serious and under-recognized problem in the use of computer to deal with human language.

In the case of the Chinese language, a person needs about 3,000 most common words to get a good grasp of the newspapers she or he reads. Most individuals do not have more than fifty thousand words in their active vocabulary but our LIVAC database has collected more than 1 million words used in Chinese media over 21 years. This means many Chinese words are potential OOV words for the average person. They would include scientific and technical terms, including those related to medicine. Some popular medical terms typically have popular equivalents in the colloquial language, for example:  blood clot in brain for Sub-Dural hematoma. While the computer could easily take on a much bigger vocabulary and could contain all of the words in LIVAC, it would be no bigger than the size of lexicon the computer programmer had put into the system. In chess-playing, the number of moves are finite though the sequences would be very large. However, language as represented by the vocabulary is open-ended and there can always be additional items which cannot be predicted and so are OOV words.

In many applications of computer processing of human language, the limitation of the base vocabulary is often overlooked because of the small and easily tolerated percentage of related OOV words in a relatively imperfect system to begin as in most localized products.

We will return to issues such as these in future. Meanwhile we welcome your feedback in here.

Barry @Chilin HK