Close Menu
The AI Book
    Facebook X (Twitter) Instagram
    The AI BookThe AI Book
    • Home
    • Categories
      • AI Media Processing
      • AI Language processing (NLP)
      • AI Marketing
      • AI Business Applications
    • Guides
    • Contact
    Subscribe
    Facebook X (Twitter) Instagram
    The AI Book
    Daily AI News

    The Dire Defect of ‘Multilingual’ AI Content Moderation

    23 May 2023No Comments4 Mins Read

    [ad_1]

    Three parts Bosnian text. Thirteen parts Kurdish. Fifty-five parts Swahili. Eleven thousand parts English.

    This is part of the data recipe for Facebook’s new large language model, which the company claims is able to detect and rein in harmful content in over 100 languages. Bumble uses similar technology to detect rude and unwanted messages in at least 15 languages. Google uses it for everything from translation to filtering newspaper comment sections. All have comparable recipes and the same dominant ingredient: English-language data.

    For years, social media companies have focused their automatic content detection and removal efforts more on content in English than the world’s 7,000 other languages. Facebook left almost 70 percent of Italian- and Spanish-language Covid misinformation unflagged, compared to only 29 percent of similar English-language misinformation. Leaked documents reveal that Arabic-language posts are regularly flagged erroneously as hate speech. Poor local language content moderation has contributed to human rights abuses, including genocide in Myanmar, ethnic violence in Ethiopia, and election disinformation in Brazil. At scale, decisions to host, demote, or take down content directly affect people’s fundamental rights, particularly those of marginalized people with few other avenues to organize or speak freely.

    The problem is in part one of political will, but it is also a technical challenge. Building systems that can detect spam, hate speech, and other undesirable content in all of the world’s languages is already difficult. Making it harder is the fact that many languages are “low-resource,” meaning they have little digitized text data available to train automated systems. Some of these low-resource languages have limited speakers and internet users, but others, like Hindi and Indonesian, are spoken by hundreds of millions of people, multiplying the harms created by errant systems. Even if companies were willing to invest in building individual algorithms for every type of harmful content in every language, they may not have enough data to make those systems work effectively.

    A new technology called “multilingual large language models” has fundamentally changed how social media companies approach content moderation. Multilingual language models—as we describe in a new paper—are similar to GPT-4 and other large language models (LLMs), except they learn more general rules of language by training on texts in dozens or hundreds of different languages. They are designed specifically to make connections between languages, allowing them to extrapolate from those languages for which they have a lot of training data, like English, to better handle those for which they have less training data, like Bosnian.

    These models have proven capable of simple semantic and syntactic tasks in a wide range of languages, like parsing grammar and analyzing sentiment, but it’s not clear how capable they are at the far more language- and context-specific task of content moderation, particularly in languages they are barely trained on. And besides the occasional self-congratulatory blog post, social media companies have revealed little about how well their systems work in the real world.

    Why might multilingual models be less able to identify harmful content than social media companies suggest?

    One reason is the quality of data they train on, particularly in lower-resourced languages. In the large text data sets often used to train multilingual models, the least-represented languages are also the ones that most often contain text that is offensive, pornographic, poorly machine translated, or just gibberish. Developers sometimes try to make up for poor data by filling the gap with machine-translated text, but again, this means the model will still have difficulty understanding language the way people actually speak it. For example, if a language model has only been trained on text machine-translated from English into Cebuano, a language spoken by 20 million people in the Philippines, the model may not have seen the term “kuan,” slang used by native speakers but one that does not have any comparable term in other languages. 

    [ad_2]

    Source link

    Previous ArticleRed Hat unveils Ansible Lightspeed to empower AI-driven IT automation
    Next Article Anthropic secures $450M in Series C funding from Google, Salesforce and others
    The AI Book

    Related Posts

    Daily AI News

    Adobe Previews New GenAI Tools for Video Workflows

    16 April 2024
    Daily AI News

    Exciting Updates From Stanford HAI’s Seventh Annual AI Index Report

    15 April 2024
    Daily AI News

    8 Reasons to Make the Switch

    15 April 2024
    Add A Comment
    Leave A Reply Cancel Reply

    • Privacy Policy
    • Terms and Conditions
    • About Us
    • Contact Form
    © 2026 The AI Book.

    Type above and press Enter to search. Press Esc to cancel.