Lost in Translation: Why Native Language NLP Wins Out Over NLP Built on Machine-Translated Text


Share on LinkedIn

The vodka is good, but the meat is rotten. According to internet lore, that’s what you get when you translate the spirit is willing, but the flesh is weak into Russian and back again. Maybe it’s apocryphal, but it makes a good point. Meaning and nuance get lost in translation, particularly when a computer does the translating (twice). Even the best machine translation (MT) services, which do a decent job of the essentials, will struggle with nuance, ambiguity, idioms, intent – such as irony, humor, or satire – and anything requiring context. Especially with less well-documented languages.

When you’re working with Natural Language Processing (NLP), your results are predicated on your AI parsing massive corpora of text. Flaws in your data – along with too-small datasets – have a huge impact. Poor quality data is machine learning’s number one enemy, affecting both how your AI is trained and the decisions it makes when it’s pushed into production. When your entire database consists of content translated from a target language into English by a machine, problems will arise – and compound over time.

Multilingual NLP: Where everyone is a Babel fish

We live in a global, multilingual world where large-scale success means reaching markets outside our home country or language community. As a result, NLP solutions increasingly must leverage data in languages other than (American) English. So more and more NLP companies are offering polyglot NLP solutions. Or so they say. Their “multilingual” solutions often simply parse data fed through an AI translator used to transform non-English data into English data.

There are reasons for this, of course. Rather than building a separate NLP model for every language your company wants to support, this approach lets you build just one. And given that there are some 6,900 languages globally (plus dialects and regional variants), we can see the appeal. We get it: solid corpora for non-English languages are generally harder to come by, especially for “low-resource” or minority languages – which is most of them. And it can be expensive or impractical to bring on board a native-language speaker to mark up data or correct problems in a machine’s translations. And building a ton of models is time-consuming and expensive.

There’s just one problem: Google Translate and the like are getting pretty good, but they’re no native speaker. Translated French just isn’t quite French, especially when a computer is doing the translating. As a result, context gets lost, nuances missed, and cultural differences get overlooked. It’s a bit like doing a complex math equation and putting in a rounded number value for Pi part-way through – you’ll kind of get the right result, but not really.

And when you have an AI chewing through huge amounts of data learning, analyzing and drawing trends, miscommunications, misunderstandings, and flat-out mistranslations compound and magnify.

I said, do you speak my language?

Things only get worse when you realize that when AIs are trained on non-English languages translated through an MT service, they’re often being trained on language data that has been translated through an MT service that has been trained on…you get the (recursive) picture.

Take the Cebuano Wikipedia, which was almost entirely written by a single bot using Google Translate. Or the Scots Wiki, which was primarily written by an American teenager in a writing style vaguely resembling a Scots accent. Training your bot on language created by a bot (or teenage jokester) and not necessarily corrected or vetted by native speakers can result in ever-compounding biases and inconsistencies.

Even if you’re working with a solidly documented and supported language with a robust online presence, translation has its pitfalls. For example, the accuracy of sentiment analysis dropped significantly when Dutch was first translated into English and then analyzed – even when the translation was perfect. And let’s face it, if you’ve ever clicked “translate this Tweet,” you know that it often can’t even identify the correct language, let alone the right words. For some highly structured projects, this might be fine. But “near enough is good enough” doesn’t fly when your business’s future decision-making rests on the data your NLP spits out.

Keep it close to home: native-language NLP is the way to go

Creating multiple native-language models is more time-consuming and resource-intensive than forging “one model to rule them all,” but it delivers significantly better results. Unless your project involves very pragmatic, factual language and tasks such as closed questions, you’ll likely see your native-language models outperform your translated ones.

We’ve found in our work that our results are significantly better when we process in a native language rather than through machine translators. This is especially true in low-resource languages or specialized domains such as pharma and healthcare. Thus it’s essential to focus on developing highly trained models for a (growing) handful of languages rather than relying on the translation approach. It’s a method that takes time, care, and extensive input from ML experts, linguists, and domain-level experts but delivers better results than the current AI translators.

If you’re working in the NLP space, it’s time to think beyond English and create native-language models that will better serve your clients – and AI more generally. If your business is investing in NLP solutions, make sure you know whether your data is coming straight from the horse’s mouth or via the interpretive lens of an MT and what that means for your project. Because “results may vary” isn’t something you want to apply to your bottom line.


Please enter your comment!
Please enter your name here