Marta Costa-Jussà, researcher: “Language has many subtleties that AI cannot capture” | Technology

In 2022, Meta unveiled a revolutionary machine translator capable of handling 200 languages. Translations are done in real time and with an effectiveness far above average. “To give an idea of ​​the scale of the program, the 200-language model analyzes more than 50 billion parameters. We have trained it using the Research SuperCluster, one of the fastest supercomputers in the world,” said the company’s CEO and founder, Mark Zuckerberg, when it was presented.

Behind this pioneering development is Marta Costa-Jussà (Sabadell, 42 years old), a researcher from the FAIR team (Facebook Artificial Intelligence Research), one of the most powerful laboratories in the world in artificial intelligence (AI). Costa-Jussà is one of the thirty scientists —among whom there are engineers like her, but also linguists, data scientists, sociologists or ethics experts— who have developed this model called NLLB-200 (acronym for No Language Left Behind: In English, no language is left behind. The Catalan is one of the coordinators of an article recently signed with her colleagues in the magazine Nature in which they explain the details of their tool.

Costa-Jussà has been working at FAIR since 2022. A Telecommunications Engineer from the Universitat Politècnica de Catalunya (UPC), she earned her PhD at the same centre and then did postdoctoral stays in Paris, São Paulo, Mexico City, Singapore and Edinburgh. Always focused on her topic: machine translation. When she settled in Barcelona, ​​where she had finally obtained a permanent position at the UPC, she received an email from Meta. They wanted her for their NLLB-200 project. “It caught me just when I had managed to get where I had always wanted to be, but after doing the interviews, I didn’t hesitate: the team was great and the project was very interesting,” she explains by video call from Paris, where she has lived since then. In addition to research, Costa-Jussà enjoys telling stories to her three children, which led her to publish last year a young adult novel in which he mixes adventures and dissemination about AI.

Ask. What is special about your translator compared to others?

Answer. We have developed the first real-time translation system that works in 200 languages. The beauty of it is that translations can be made between any pair of languages ​​from those 200, without having to go through English, as is usually the case. And the quality of the translation is the best that can be obtained today. Even today, after two years, our system is used as a reference in many scientific articles.

P. How did they do it?

R. In short, the system works by processing parallel translations. Let me explain. You have documents in many language pairs, aligned at the sentence level. For example, I have a sentence in Catalan and its corresponding translation in English or Mandarin. When you have a large number of these texts, you insert them into a deep learning neural model and the algorithm extracts patterns. From there, the system learns to generalize. Then an extraordinary process occurs: a kind of knowledge emerges after having seen so much data, and that allows, for example, direct translations from Catalan to Yoruba, even if we do not have parallel texts in those two particular languages, and therefore the system could not have learned that translation. This is possible because the tool learns to generalize between pairs of texts and to extrapolate it to other cases for which it has no examples.

P. How is this done?

R. With a lot of data, a lot of computational power, and a mathematical algorithm capable of combining all of this. Basically, you have an input sentence, from which you make a mathematical representation. You transform the sentences into mathematical vectors, and those mathematical vectors are transformed into output sentences. Everything goes through a highly multidimensional space. Obviously, you need a lot of computational power because, for the system to be able to generalize, it needs millions and millions of parallel sentences. Our original contribution has been to develop a tool capable of digesting all those examples.

P. He says that they need millions of parallel sentences. But what happens when there are no such extensive corpora, as in Swahili or other less digitized languages?

R. We have scoured the internet and developed an algorithm that is capable of parallelize texts, to find among the open data on the Internet which texts are translations of others. This data extraction phase is automatic. Apart from that, as you say, there are language pairs for which we do not have a corpus, and we have had to develop it ourselves: we have paid translators to translate certain phrases for certain languages.

P. Where did you get your linguistic corpus from? Did you only use open sources?

R. One of the things I like about FAIR is that our research is open, and you can see our sources. It is specified in the article and in our repository: European Parliament, UN… These are available sources that the translation community has been using for a long time. Wikipedia has parallel texts, but we use parallel sentences. All in all, we have learned a lot from there.

P. What is the next step?

R. Now we want to make the jump to translating text to text. We are also already working on voice-to-voice translators, which we introduced last year. They not only translate, but also maintain your tone of voice and your expressiveness. At the moment, it covers 100 input languages ​​and around thirty output languages.

P. How far can they go? Will they ever overcome language barriers?

R. These systems are very useful in many situations, for example if you are lost in China and no one speaks English. But we offer translation, not interpretation. The magic of interpreters is that they take your message, summarize it and translate it into another language with complete fluency. We are still far from interpretation. Language has many subtleties and emotions that we cannot cover at the moment.

P. In recent months, multimodal generative AI tools capable of recognizing objects in their environment through computer vision have been presented. What prospects does this open up for machine translation?

R. Yes, we are moving in that direction, towards systems that are completely multimodal (that process text, images, video and audio). We have that with Llama 3 (Meta’s latest generative AI model). Knowledge of the world, of cultures, of specific vocabulary, of context… that is what interpreters have, but not machines. Our translations are limited to the text or voice that we insert.

P. Are there plans to add more languages?

R. We have published guidelines for inserting new languages ​​into the model, which is open. We don’t necessarily have to do it ourselves, the scientific community can do it. We make sure that whoever wants to can do it.

You can follow THE COUNTRY Technology in Facebook and X or sign up here to receive our weekly newsletter.

Hot this week

Happy Birthday Wishes, Quotes, messages, Facebook WhatsApp Instagram status, images and pics (Updated)

From meaningful Birthday greeting pics to your family and friends. happy birthday images, happy birthday gif, happy birthday wishes, happy birthday in spanish happy birthday meme, belated happy birthday, happy birthday sister, happy birthday gif funny, happy birthday wishes for friend

150+ Birthday Quotes, Wishes and Text Messages for Friends and Family (Updated)

Whatsapp status, Instagram stories, Facebook posts, Twitter Tweet of Birthday Quotes, Wishes and Text Messages for Friends and Family It is a tradition to send birthday wishes and to celebrate the occasion.

Merry Christmas Wishes, messages, Facebook WhatsApp Instagram status, images and pics | theusaprint.com

Merry Christmas 2024: Here are some wishes, messages, Facebook, WhatsApp and Instagram stats and images and pictures to share with your family, friends.

Vicky López: from her signing on the beach of Benidorm to making her senior debut at 17 years old | Soccer | ...

“Do you play for Rayo Vallecano?” that nine-year-old girl...

Related Articles

Popular Categories