May 18 2023 GM

From TCU Wiki
Glitter Meetups

Glitter Meetup is the weekly town hall of the Internet Freedom community at the IF Square on the TCU Mattermost, at 9am EDT / 1pm UTC. Do you need an invite? Learn how to get one here.

Languages Left Behind: Automated Content Analysis in Non-English Languages

Aliya Bhatia and Gabriel Nicholas from the Center for Democracy & Technology will talk about their forthcoming research into the limits and capabilities of automated content analysis in languages other than English.

We will discuss how the lack of non-English data for automated content analysis systems can have significant geopolitical implications, like content moderation, and how academics, tech companies, and civil society are trying to close the computational gaps between English and other languages.

  • Aliya Bhatia is a policy analyst at the Center for Democracy & Technology (CDT)'s Free Expression team, which works to promote users’ free expression rights in the United States and around the world.
  • Gabriel Nicholas is a Research Fellow at CDT, where his research focuses on automated content moderation and data governance. He is also a joint fellow at the NYU School of Law Information Law Institute and the NYU Center for Cybersecurity.


Could you introduce yourself a bit more to the folks and tell us about the work you've been doing?
  • Gabe: I'm Gabe and I'm a Research Fellow at the Center for Democracy & Technology. I'm a software engineer by training and work on issues around AI, content moderation, and competition between tech platforms
  • Aliya: I'm Aliya Bhatia and I'm a policy analyst at the Center for Democracy & Technology. I work on free expression issues here at CDT which includes looking at content moderation, child safety, ensuring our online services are accessible to all trying to access and share information. I work predominantly on US issues but also sometimes global, particularly looking at India
What exactly are automated content analysis systems and how would one encounter them on a day-to-day basis?
  • Aliya says: As mentioned our paper is out on Tuesday (23rd) and it looks at a specific type of machine learning technology that promises to be able to analyze text from multiple languages. We encounter automated content analysis systems almost daily! From systems that rank results in our search engines, to systems that sort or filter comments on a newspaper's website, to of course tools that moderate, promote, demote, and even take down content on social media feeds
  • Gabe adds: To put it more dryly our paper, we define "content analysis" as "the inference and extraction of information, themes, and concepts from text"
And from the outset, what have you noticed about the strengths and limitations of these systems? How are they trained in languages?
  • So long long ago (i.e., the late 2010s), a developer wanted to build a content analysis system — say to detect whether reviews on a business are positive or negative — they would create an algorithm specifically to do that task. In this case, they'd take lots of data on positive and negative reviews, train an algorithm on it, and have program that can do that one task. For a long time, this is how content analysis systems were built — one task at a time.
  • Now however, those systems are often built using language models (the same core technology as ChatGPT, PaLM, and others that have filled the news). Language models scan large volumes of text data and learn the patterns contained within. They can then use those patterns to be "finetuned" on different tasks. The advantage is that they can leverage a sort of "understanding" of language. If you're building your system for detecting positive and negative business reviews the old fashion way and a word that doesn't appear in your data set comes up, it'd have no idea what to do. That's not the case with a language model, which is already trained to recognize a bunch of language.
  • For instance, there's a smelly restaurant near my house. A language model would do better at recognizing that the word "smelly" is a bad thing, even if no other business has been described as smelly.
  • Beyond just the use of LLMs, automated content analysis systems are critical for online services to analyze and manage content at scale. Platforms are deciding everyday to host or not host content based on what they think their audience wants to see and keep the platform usable and valuable. I.e. removing spam or hate speech etc.
  • But at the same time automated content analysis systems can fall short at times because they can't discern tone and intentionality like humans can! So like Gabe says, it may be able to understand a word but may not know whether "smelly" was used ironically or jokingly. This is particularly concerning when these systems are used to moderate content automatically say by taking down images of violence which can be posted to glorify violence or to document violence by human rights advocates.
When it comes to languages, which are more dominant? And why does the availability of training data and language-specific software tools vary so widely across different languages?
  • Gabe responds: To answer the first question — it certainly is English! By a pretty wide margin. There's not only multiple orders of magnitude more data available than in any other language, but that data is better organized (into datasets on different topics, into datasets that can be used to test the capabilities of different AI tools) and higher quality (less often the machine translated or misclassified text that is common for other languages), and just a huge amount of research goes into English language AI.
  • Here's a great link that illustrates the gap clearly
  • And these are only papers that mention the language in them! Most papers don't even mention what language they're about because people assume they're about English. It's called the "Bender Rule" and you can learn more about it here
  • But definitely s Gabe mentioned English is extremely high resourced, which is perhaps not very surprising.
  • It's really a matter of two forces working together: one is colonialism. English is the official and de facto official language for many of the world's countries because of British imperialism which means many of the world's documents are in English and have been digitized first and countries around the world continue to produce English-language text which then creates a preponderance of text to train models. There were also several forces during the time of British imperialism which actively discouraged production of text in "native" languages.
  • The second force is this new phase of American tech hegemony where American tech companies have achieved global power and have perpetuated the same dominance of the English language by creating and scaling English-language products and services for a global audience.
The question this then raises for me is the kind of implications these limitations can have. Aliya, you mentioned it can hamper research/ scrutiny. What else have you both noticed in terms of the impact?
  • Aliya responds: Well one is that the dominance of the English language means that most models are trained and tested predominantly on English language data at the expense of other languages. But I say it hampers research and scrutiny also because the assumption that English is a stand-in for language-agnostic tools or is language-neutral may shift the incentives or disincentivize the development of other language-specific tools or research.
  • There are other limitations we find with the development of models that are trained predominantly on English that we uncover in our paper too.
  • Gabes adds: As we mentioned before, language models learn from the data they train on, so they're also limited by that data. Our work focuses on a new type of language model called "multilingual models" that train on dozens or hundreds of languages and learn connections between them. That way, developers claim, they can learn to understand languages they have less data for — say Amharic — by uncovering connections to languages they have more data for — say English.
  • But that approach still has limitations! If a language model's understanding from one language comes from another, that means it also imports the biases and world views of that language as well. That limits how well it works in more local- and context-specific applications, like, as you mentioned, Astha , content moderation.
  • Aliya builds on these reflections: If a model has not seen enough instances of text in a certain language it is likely to fall short in conducting content moderation which is such a deep context-specific task. Companies claim that these models can be repurposed to moderation related tasks by learning language-agnostic rules or even language-specific semantic and synctatic rules, but they still fall short if a model has not seen enough examples of text in a specific language, especially examples of how people actually speak. As Gabe mentioned, these universal rules of language could very much be values and assumptions encoded into the English language and English language training datasets that we're then applying onto different languages which may have very different norms. For example, "uso" meaning dove in the Basque language is not a symbol of peace like in other languages. In fact it is a derogatory word. So if a model was trained to believe that dove in English is associated with peace across all languages, the model could fall short in analyzing Basque specific uses of the word.
  • Even when models are trained on text from a certain language, because so much of the available training data is in English, they may be exposed to poorer quality training data in languages other than English. Researchers show that these training datasets often include gibberish or pornographic or offensive material or poorly machine translated content. This means these models are less likely to be exposed to words that native speakers actually use when they speak online. All of this hampers these models' ability to moderate content effectively
On the one hand, it's quite something to think that one model can train in hundreds of different languages. But given those limitations you mention around biases and cultural differences, are there ways that academics or tech companies or civil society groups (like CDT!) are trying to close the computational gaps between English and other languages?
  • Gabe: So on the one hand, you can close the gap between "high" and "low" resource languages by going out and collecting more data. That's both easier said than done, and also, it raises really hairy questions about privacy and consent from those language communities — do language speakers really want AI working better in their language? What are the larger ramifications of that?
  • Language models also have to deal with something of a zero-sum game called "the curse of multilinguality" — the more different languages they train on, the worse they perform in each one, since they fail to learn each language's specific nuances. So even increasing the amount of data a model is trained on in say, Tagalog, might hurt its performance in say, English and Hindi.
Could you also talk about the current and existing models of software tools that ARE collecting non-English data?
  • Aliya says that she sees these two questions as combined! Lots of groups are thinking of bridging the computational gaps but as Gabe said it's super hard. Some of the communities leading the charge are individual natural language processing research communities around the world that are grappling with the tradeoffs that Gabe just articulated. Some include ARBML, working in digitizing Arabic-language text, thinking about what representation of such a vast and polyglossic language looks like (polyglossic means a language with multiple different variations essentially). Masakhane works in African languages. IndoNLP in Indonesian languages. AmericasNLP, in American indigenous languages. So these communities are using a mix of tools and methods to digitize these languages with deep language and community specific knowledge and grappling with the tradeoffs as well. Ultimately our paper also defers to these groups saying these are the groups who have the knowledge and experience to actually close these gaps, digitize text, annotate it in a way that is really rooted in an understanding of how language communities speak a language, and are thinking of proprietorship and ownership of language in a novel way and they should be funded and deferred to in the development of these tools that work in multiple languages
We can end with you both sharing with us what CDT's new technical primer will look at. How can it help folks working in the digital rights community?
  • CDT's primer will explain how these "multilingual large language models" work, how they are used, and what their limitations are. It's great for any digital rights person looking at automated systems in languages other than English.
  • The paper will be out this upcoming Tuesday, and as Aliya said, we're having a launch event on Wednesday at 10:00 AM ET that you should all attend! We'll be going into a lot more detail there and talking to NLP experts who can give the inside scoop on these systems.
  • You can contact us at and