
Cantonese is a low-resource language with only a relatively small English-Cantonese parallel corpus. However, this common approach is not suitable for Cantonese machine translation. As the training objective, the model takes a sentence in the source language as input and outputs the corresponding sentence in the target language. To train such a model, researchers need to prepare a bilingual parallel corpus from the source language to the target language. Current State of Neural Machine TranslationĬurrently, the most advanced approach to machine translation is neural machine translation, in which Transformer-based models are the most widely adopted model architecture. Therefore, there is an urgent need to improve machine translation from English to Cantonese. Such a result is offensive to Cantonese speakers and confusing to other users. This is because, in the Mandarin translation, the correct quantifier 块 ( kuài) is used, but the quantifier 块 ( kuài) corresponds to both 蚊 ( man1, for money) and 粒 ( nap1, for sugar cube) in Cantonese.Īll the examples above demonstrate that existing commercial English-to-Cantonese machine translation systems produce unsatisfactory results, and even in some cases, the system cannot fully translate Mandarin into Cantonese, resulting in some words being translated into Cantonese and some words remaining in Mandarin, producing an incongruous hybrid of Mandarin and Cantonese.

In the above example, Baidu Translate unexpectedly chose the wrong quantifier 蚊 ( man1) instead of 粒 ( nap1) to describe a sugar cube.


Source: The motor has a broken piston.Although being closed-source and there is no way to view their internal architecture, it is clear that both translators first translate Cantonese into Mandarin, and then utilise a rule-based system to translate Mandarin into Cantonese, by simply taking a look at the translation results:

However, the results produced by these two translators are not satisfactory. Both Baidu Translate and Bing Translate have launched their English-to-Cantonese commercial machine translation services. Motivation Current State of English-to-Cantonese Machine TranslationĬantonese is a language spoken by 85.6 million people worldwide. Python generate.py atomic-thunder-15-7.dat Results Model
