ChatGPT’s launch in November 2022 amazed the world with how well it wrote in any language. This success hid the fact that a model that can answer any question hides more values behind that correct grammar or syntax. As time goes by, more and more works appear that highlight the importance of training models with different languages and values: “We need the technical infrastructure to encourage the training of AI models with French and European cultural data,” says a French government report from March, who insists that without its own AI, Europe risks “losing control of the future.”
It is not surprising that the French government attaches importance to cultural data. “When one talks about Spanish models, one is referring to the linguistic aspect, but language models include a geographic position, values. Models like ChatGPT have values similar to those of a man in his 30s, white, who went to university, born on the West Coast of the United States,” says Luciana Benotti, a computational linguist at the National University of Córdoba (Argentina).
To broaden this Anglocentric perspective, the Spanish government announced its Alia language model project. At least 20% of the total texts with which it will be trained will correspond to languages spoken in Spain, while ChatGPT and its competitors do not reach 5% in Spanish. This will make it more reliable for Spanish speakers, since typical problems such as biases will be corrected: the use of the masculine and feminine genders is different in Spanish compared to English, for example.
Chile’s National Artificial Intelligence Center is also working on “a large open model of language by Latin Americans for Latin Americans” currently called LLM Latino. Although the computing power is lower than in the Spanish model, the goal is similar, more focused on the region. There are associations of volunteer specialists who are also working to achieve better corpora and resources in Spanish.
The Alia model is more accessible and useful for Spanish speakers than those trained primarily in English: “There is a huge gap between the amount of resources and language models for English and for Spanish. Supporting each other as Spanish-speaking countries will help us advance more quickly,” says Dunstan. But from Spain, the language is still seen as something different: “The RAE includes 80% of words from Spain and 20% from Latin America, meaning we are under-represented,” says Jocelyn Dunstan, a researcher at the Pontifical Catholic University of Chile.
The weight of Spanish
Latin America has tended to view technological innovations from afar. But with this new development, it has a basic tool that is close to home: Spanish. “Here we are never the main market. People think that the power of ChatGPT is incredible because it gives them, for example, a menu with calories and they think it can solve everything,” adds Dunstan, and tells of a project with the Rapa Nui language with ChatGPT, in which it seemed to speak it, but it was erratic or invented phonemes.
One way to understand the gap between what is happening in the United States and Latin America in this sector is the association that brings together academics who work in computational linguistics. They are all in the NACL (North American Computational Linguists). At the last NACL meeting there were about 50 Latin American researchers and another 50 Latino Americans, out of a total of about 2,000 participants.
This overwhelming difference obviously influences the fact that the language most analyzed in scientific articles is English. “When a natural language processing article works only in Spanish, it is very difficult for it to be accepted at a top-level conference. It is expected to be a multilingual study and to include English, Italian, French, and others. This requirement does not apply to English, where the amounts of text are also enormous. People who work in English can do so only in that language and no one complains,” says Dunstan.
Cheap and old data
Benotti works with the Vía Libre Foundation and with international funding from the Mozilla Foundation to explain how the biases and risks of these models work depending on their origin and training: “Since the models are trained with large volumes of cheap and old data from the internet, they often absorb existing prejudices. This can lead to results that reinforce stereotypes such as ‘Mapuche people are drunk’ or ‘women go to the kitchen’. There is a lot of work in our area of research to reduce these biases and to alienate these models from a perspective of values from the global north,” explains this linguist.
With the variants of Spanish in Latin America, it often happens that they are popularly less well-established. Some may find it odd that a language model would use them without taking context into account: “We are very used to standard Spanish being what is right, and using these more regional words is frowned upon. Using more neutral language seems to give it greater authority and knowledge,” says Benotti.
In recent years, research has been carried out on how these models respond to questions from different fields, such as what kind of words they use or what they understand about the different dialects of a large language or the details of smaller languages. This is an incipient work. Dunstan has just finished, for example, an article with researchers from the BSC in which they have looked at whether the models developed in Spain are useful for the Chilean context, but in something very specific: oncological language. They saw that it could be used, but with a warning: oncological reports tend to be written in a more calm manner than in other specialties. “This does not imply that the text for emergencies or with abbreviations will work the same,” says Dunstan.
You can follow THE COUNTRY Technology in Facebook and X or sign up here to receive our weekly newsletter.