Improvement of English-to-Chinese MT of Pharmaceutical Texts

In our last blog, we discussed how Chilin offered English-Chinese parallel sentence pairs on the TAUS Data Marketplace. Our first offerings are related to the Pharmaceutical-Biotechnology domain. 


Does this data actually improve machine translation? The answer is “yes” as we show below. 


Chilin conducted a test using the parallel sentence data and the AutoML Translation tool on the Google Cloud Platform. The study was based on:

  • 8,000 English – Chinese sentence pairs for TRAINING. 
  • 2,000 sentence pairs for VALIDATION.
  • 2,000 sentence pairs for TRAINING.

Once the data is selected and loaded, AutoML makes it relatively simple to train a custom machine translation model. This simplicity is by design, as described in this interview with Francesco Bombassei of Google on Slator.com – Inside Google’s Custom Neural Machine Translation—AutoML Translate. “Under the hood, he explained that AutoML works with transfer learning and neural architecture search. Transfer learning is a way to use machine learning models as a basis for training others.” Domain specific training data is used to improve the base Google NMT system for that specific domain. 


The Results: The Chilin Custom MT showed a BLEU score improvement of almost 5 points. 


The testing was performed on patent based parallel sentences outside of the Training, Validation and Test sets. Here are some sample results:
 

English to Chinese Example #1

Original EnglishA. Google Standard Machine TranslationB. Google MT augmented with Chilin data
MMPs are known to be synthesized as latent precursor enzymes that can be activated by limited proteolysis, but the exact mechanism by which this activation takes place in vivo is largely unknown.已知MMP可以作为潜在的前体酶合成,可以通过有限的蛋白水解作用来激活,但是在体内发生这种激活的确切机制在很大程度上尚不清楚。已知MMP被合成为可通过有限的蛋白水解激活的潜在前体酶,但是在体内发生这种激活的确切机制在很大程度上是未知的。
C. Human Translated Chinese
目前,已知MMPs是作为能通过限制性蛋白水解作用活化的潜在前体酶合成的,但是体内发生该活化作用的确切机制还不为人知。

English to Chinese Example #2

Original EnglishA. Google Standard Machine TranslationB. Google MT augmented with Chilin data
The precipitated glycogen was separated by centrifugation, washed with 70% ethanol, and redissolved in water, and the incorporation of [C] glucose into the glycogen was determined by LSC.通过离心分离沉淀的糖原,用70%乙醇洗涤,然后再溶解在水中,并通过LSC确定[C]葡萄糖向糖原中的掺入。离心分离沉淀的糖原,用70%乙醇洗涤,再溶解在水中,用LSC测定[C]葡萄糖掺入糖原中。
C. Human Translated Chinese
离心分离沉淀的糖原,用70%乙醇洗涤,用水再溶解,用LSC测定掺入糖原中的[14C]葡萄糖。

English to Chinese Example #3

Original EnglishA. Google Standard Machine TranslationB. Google MT augmented with Chilin data
There was no significant difference in the frequency of apoptosis in tumor cells in the treated xenografts, and no clear effect on angiogenesis as measured by microvascular density (MVD) via immunohistochemical staining for the endothelial cell marker, CD31.经治疗的异种移植物中,肿瘤细胞凋亡的频率没有显着差异,并且对内皮细胞标记物CD31进行免疫组织化学染色,通过微血管密度(MVD)测得对血管生成没有明显影响。在经处理的异种移植物中,在肿瘤细胞中的凋亡频率没有显着差异,并且如通过微血管密度(MVD)通过对内皮细胞标记物CD31的免疫组织化学染色所测量的,对血管发生没有明显的影响。
C. Human Translated Chinese
在被治疗的异种移植物中肿瘤细胞的细胞凋亡频率无显著差别,如通过对内皮细胞标记CD31进行免疫组织化学染色的微血管密度(MVD)所测量的那样发现对血管发生无显著影响。

While similar, we have found that the enhanced model produces output that is more accurate and more readable, as one would expect with a 5-point BLEU score improvement.  

 Questions: In the above three examples, which of the A, B, and C translations is the best? Is human translation always the best, and error-free? We welcome your views! -Click here- 

If you would like details on the test and the results, contact us. 
 

In our next blog, we will test a Chinese to English model. 

L. Cady