SomosNLP: The long march of a group of volunteers to achieve a Hispanic ChatGPT: “A model trained in Spanish would be incredible” | Technology

0
115

“The question is ‘give me a typical recipe from Peru’, and then I’ll give it to you,” says María Grandury, founder of the volunteer organization SomosNLP. Grandury describes a banal action for new chatbots made with artificial intelligence (AI). That simple request, however, includes enormous prior work, most of it automated, but much of it also human.

That question needs at least three basic elements: first, a database that includes Peruvian recipes in Spanish, which come from the internet. Second, a list of questions and answers that allows the model to learn what to answer when asked about a Peruvian recipe. And third, a control that allows you to review the answer and say if it is correct.

This simple three-step explanation hides a huge variety of options, where financing is key. The big companies in Silicon Valley and English dominate everything by far. What is done from other languages? It is tried, at different levels. Spanish should also be a dominant language, but in reality it is not so dominant. The challenge of making a machine learn to answer any question in Spanish (not a handful, focused on a single topic) is enormously complex.

The key first step is to gather massive amounts of text to train what is called a foundational model. “We don’t have a lot of text, but there is more in the last three years, the community has been growing and initiatives have emerged from the Government,” says Grandury. It refers above all to Alia, a model promoted by the Government of Spain and of which Minister José Luis Escrivá said in EL PAÍS that “it will open the doors to a new generation of technological products enriched with the vast linguistic heritage of Spanish and the co-official languages.” from Spain”.

The model is a part that requires a lot of original data but also a lot of computing. That is why the Government’s agreement with the Barcelona Super Computing Center and with IBM is essential. But with that there is only one model capable of writing text from an initial phrase, not a question. But the chats that have become popular are precisely questions and answers. These instructions do not exist in Spanish, at least public. That is where SomosNLP comes into play (NLP stands for Natural Language Processing in English), which tries to gather resources so that the presence of Spanish improves in AI: “Of the databases with instructions, only those that are in public are public. English. What is usually done is to take them and translate them,” says Grandury. “What we are going to do is surely create the largest open corpus of instructions in Spanish so far,” she adds.

Grandury, 26 years old and from Ponferrada, already has experience in the thorny path of setting up a viable model in Spanish. After graduating in mathematics and physics at the University of Oviedo and working briefly in Berlin, he signed for Clibrain. In the summer of 2023 Clibrain “wanted to be the world reference for AI in Spanish” and its co-founder, Elena González-Blanco, was “the world reference for AI in Spanish,” according to press headlines. They even released a model with a name as Spanish as Lince. Today Clibrain has closed.

“Lince worked well, it needed to be made more accessible, for example with an interface. Although that is also expensive, having it available for people to use,” says Grandury, referring to the computing need required for a model to be available on the Internet to answer user questions.

France already has its leader

Meanwhile, France has achieved with Mistral a national company that competes on a global scale. “He champion “European AI ​​sets its sights on US technology giants,” titled the New York Times in April. Its chief executive, 31-year-old Frenchman Arthur Mensch, a former Google employee, said: “These models shape our cultural understanding of the world, and French values ​​and American values ​​differ in subtle but important ways.”

The gap in financing remains enormous: OpenAI has achieved investments worth $13 billion; Mistral, 540 million. Mistral’s model is in English, but there is apparently an effort to put more content in French: at least 19th-century French literature, which is no longer copyrighted, according to the NYTimes.

Grandury met with people from Mistral shortly after launching his model. “They didn’t count much anymore. I asked them if they had trained with text in French or Spanish. ‘It could be,’ they told me,” without clarifying any details. “People don’t talk,” he adds.

French President Emmanuel Macron receives Mensch. The Spanish president, Pedro Sánchez, announced the new Alia model and met with the Spanish Darío Gil, vice president of IBM. In the absence of powerful companies, well-placed Spaniards can be of help. It probably helps in France that one of the “fathers of AI” is Frenchman Yann LeCun, chief AI scientist at Meta.

The advantage of English is that the internet is in English. Spanish and French must seek and negotiate with lots of institutions to feed their models, as do smaller languages ​​such as the co-official Spanish or pre-Columbian languages ​​in Latin America.

Somos NLP does not have the capacity to train these models, but it can mount voluntary efforts, such as a hackathon, to gather pairs of general questions and answers. What motivates hundreds of volunteers to make these efforts to improve AI in Spanish? “You join a large international community of people with your same interests and you know that, while you are learning and gaining visibility, you are contributing your grain of sand to a common goal: collaborating with the preservation of your language and culture,” Grandury says.

About 20 teams of five people created 200,000 instructions in a few days. It is feasible to create questions and answers with code from databases of specific topics. “There are pdfs, websites on legal or refugee issues in conversations in open Telegram groups. When you have a lot of data, you can automatically create pairs of questions and answers about that text. Then you send it to a writing space and now humans, the people from each team, check to see if they make sense. It is much faster because you no longer have to read and search for a question and its answer,” says Grandury. Humans are like language teachers of the chatbotswho point out errors and successes and correct them so that they improve their answers.

SomosNLP’s goal is to create 10 million original questions and answers in Spanish. “It would still be a third of the largest synthetic corpus in English,” Grandury says. At SomosNLP, work is currently strictly voluntary. Only some sponsorships for their actions, such as the use of servers or prizes, come, among others, from the company Hugging Face, built around a community that works on AI in an open way.

The large companies in Silicon Valley do not reveal how they do this process. In January 2023 it became known that OpenAI had paid thousands of workers in Kenya so that they would write down answers that were too toxic and the chatbot would learn not to give them. But there are hardly any more details: “We don’t know to what extent they automate the creation of questions,” says Grandury. “Then there is a lot of human part where we do not even manage the same amount of data. “Imagining how many people are registered there is unthinkable.”

Meta has just released its new model Llama 3. In a document titled Our responsible approach to Meta AI and Meta Llama 3, the company spends 3,000 words explaining steps, often in collaboration with humans, so you don’t give politically incorrect answers. But they do not tell how they have done the entire previous process.

Why not use ChatGPT in Spanish

A repeated question is why not use the models that already exist and that respond well in Spanish. In addition to strategic, cultural and open issues, it is difficult for a model originally created in English to know how to distinguish dialect variants of Spanish.

“The trick would be not to take a multilingual model and adapt it, but to take one that is trained in Spanish and then adapt it with data in Spanish, it would be incredible,” says Grandury. How would the differences be noticed? “There are more subtle things in language, for example how you express whether you have a C1 or C2 or if you use colloquial or more elaborate expressions.”

The immediate objective of the model promoted by the Government is to give companies and organizations something very Spanish for their specific needs: it is easy to refine a model so that it responds only to work-related accidents in Spain, car insurance or enrollment in such a university. “The trend is to go towards specialized models: a legal one, for example, so that you also learn to speak more with that type of language,” says Grandury.

Although a final goal is to move towards a general chat like ChatGPT, it will not be an easy path: “We are not going to do it alone,” he clarifies just in case.

You can follow The USA Print in Facebook and x or sign up here to receive our weekly newsletter.

Subscribe to continue reading

Read without limits

_