"Neural machine translation uses deep learning models to translate languages, overcoming limitations of rule-based approaches. The transformer model, with its self-attention mechanism, has revolutionized neural machine translation by allowing the model to focus on relevant parts of the source sentence when generating the translation. This has led to significant improvements in translation accuracy and fluency."- Gemini 2024
Common machine translation tasks
There are thousands of human languages that may share some structural similarities, but are still different in many ways.
A key challenge in translating text from a source language to a target language is that the two languages may not agree in terms of the order or number of words required for an accurate translation.
This example illustrates just some of the problems that may arise
He saw a black cat under a ladder
Il a vu un chat noir sous une échelle.
This trivial example has several differences in sentence structure.
saw
a vu
a
un
black cat
chat noir
une
Another example deals with word count/order in negation
I am not tired.
Je ne suis pas fatigué.
These and other linguistic typologies contribute to translation challenges. Including:
World Atlas of Language Structures (WALS) provides typological structures of languages.
The standard architecture for machine translation is a sequence-to-sequence model, or more precisely, an encoder-decoder network architecture.
Before we can use our encoder-decoder, we need to train a model.
A deeper dive: Transformers-based Encoder-Decoder Models
An ongoing research question is how to perform quality translations when a source or target language does not have a large corpora of parallel training texts available.
Solving this low-resource problem requires creative approaches. Here we list some common techniques.
In data augmentation we aim to generate new synthetic data based on available natural data. In the techniques that follow, it is import to consider the language pair in question and avoid over-augmentation. Choosing the wrong technique or augmenting too much may lead to non-sensical training data.
# Back-translation with Hugging Face transformer pipelines from transformers import pipeline target = pipeline('translation', model='Helsinki-NLP/opus-mt-en-fr') source = pipeline("translation", model='Helsinki-NLP/opus-mt-fr-en') sentence = 'This is an English sentence.' translated = target(sentence)[0]['translation_text'] back_trans = source(translated)[0]['translation_text'] print(sentence) print(translated) print(back_trans) # Outputs # This is an English sentence. # C'est une phrase anglaise. # It's an English phrase.
In addition to data augmentation, back translation can be used to highlight translation errors.
In bilingual translation a model is trained to translate from one language to another. A model that is instead trained on parallel sentences in many languages is a multilinguage model.
Benefits low-resource languages with similar higher resource languages
Evaluation concerns in machine translation
Evaluation options
Pros and Cons of Human vs. Automated Evaluation of Machine Translation
The best approach to evaluation often involves a combination of human and automated methods. Automated evaluation can be used for initial screening and large datasets, while human evaluation can be used for more in-depth analysis and final judgment.
Shortcomings
Further Reading: BLEU on Google Cloud