Lumiera
Posts
🔆 How Many Languages Does Your Tech Speak? Linguistic Diversity & LLMs

🔆 How Many Languages Does Your Tech Speak? Linguistic Diversity & LLMs

Breaking down Natural Language Processing, Microsoft and Apple back away from OpenAI board, and how does AI pronounce your name.

Lumi Era
July 18, 2024

🗞️ Issue 27 // ⏱️ Read Time: 7 min

Hello 👋

👀 Have you ever had AI technology say your name or the name of a friend and butcher it beyond recognition? You are not alone. This week, Team Lumiera looks at the intersection of language and technology, exploring how Large Language Models interpret and generate human language through Natural Language Processing.

Linguistic diversity in AI goes beyond just understanding different languages. It involves comprehending various dialects, accents, and forms of communication influenced by economic diversity and disabilities that affect speech and writing. Keep reading to learn more!

In this week's newsletter

What we’re talking about: Linguistics in Large Language Models (LLMs) - discussing its foundations, implications, and the potential it holds for businesses and communities worldwide.

How it’s relevant: Today, most AI systems are trained primarily on English data, potentially excluding a significant portion of the global population. This current lack of linguistic diversity in AI has consequences such as excluding people from healthcare due to language barriers.

Why it matters: As we develop more linguistically diverse NLP, we're opening up possibilities such as smoother global teamwork, preserving indigenous language and culture, building tech that works for people with different dialects and accents, and enabling better customer service and localisation for global companies.

Big tech news of the week…

🖥️ US Senate Committee on Science, Commerce, and Transportation convened a hearing to discuss the implications of AI on data privacy and protection. Thursday's session touched on growing concerns from both industry and experts about the patchwork of state laws for AI and data privacy that continues to expand in the absence of national standards.

🌍 SenseTime SenseNova 5.5: China’s first real-time multimodal AI model. The upgraded SenseNova 5.5 boasts a 30% improvement in overall performance compared to its predecessor, SenseNova 5.0, which was released just two months earlier.

⚖️Microsoft and Apple back away from OpenAI board, amid regulatory scrutiny. Microsoft has dropped its seat as an observer on the board of OpenAI and Apple has backed out of joining OpenAI’s nonprofit board. Microsoft states the reasons to be that its role observer is no longer needed, however, some see this as a reaction to growing concern among competition regulators.

The Building Blocks: Natural Language Processing and AI

Large Language Models (LLMs) can tackle a wide range of tasks, from translation to sentiment analysis. This versatility is made possible by Natural Language Processing (NLP): The engine behind how Large Language Models (LLMs) interpret and generate human language. In our last newsletter, we covered the concept of chunking and vector databases. This week we build on that foundation to understand how Natural Language Processing works.

Words and phrases become numerical vectors
Similar concepts cluster together in this mathematical space
This allows LLMs to grasp relationships between ideas and context

This approach enables LLMs to tackle a wide range of language tasks that many of us use in our day-to-day lives.

source: Financial Times

While this image shows a 2D representation for simplicity, real vector embeddings typically exist in hundreds of dimensions. Each dimension captures some aspect of meaning. Words with similar meanings or usage patterns tend to cluster together. Check out this interactive article to dive deeper into the mechanics!

A Challenge for Linguistic Diversity in AI

While the potential of linguistically diverse AI is immense, there is one significant hurdle: the disparity in language resources available for training LLMs. This gap poses challenges for creating truly inclusive AI systems. Here are some key insights from a report by the Center for Democracy & Technology.

Quality and Quantity Mismatch: Languages with less available data often suffer from lower-quality data. For many low-resource languages, up to 95% of web data is mislabeled. Most AI systems are trained primarily on English data, despite there being approximately 7,000 languages spoken worldwide.
Global Inequity: Languages most affected by poor data quality are disproportionately those written in non-Latin scripts (e.g., Urdu, Japanese, Arabic) and those spoken regions such as Africa, Latin America and the Caribbean, Asia, and South America.
Limited Data Sources: For low-resource languages, available data often comes from a narrow range of sources like Wikipedia, religious texts, or official proceedings. This lack of diversity can lead to biased or unrepresentative language models.
Speakers vs. Resources Mismatch: The resources available for a language rarely correlate with its number of speakers. For instance, Hindi, Bengali, and Indonesian, each with hundreds of millions of speakers, are considered medium-resource languages in AI development.
Digital Divide: Despite over 600 million internet users across Africa, nearly all African languages remain low-resourced in AI development. This includes the lack of infrastructure needed for AI systems, meaning that many people can't access AI-driven services and opportunities.

Context and Cultural Nuances

Listen to Pelonomi Moiloa, CEO and co-founder of Lelapa AI a South African startup building artificial intelligence tools tailored to local languages, talk about linguistic diversity and LLMs, and the importance of contextual knowledge and cultural awareness when developing LLMs - building with communities, instead of for communities.

In this video, Pelonomi asks the audience:

Have you had to change the tone of your voice, or accent, in order for AI technology to understand what you have to say?
Have you ever had AI technology say your name or the name of a friend and butcher it beyond recognition?
Have you ever texted someone in your family and autocorrect has made you send an embarrassing message?

As the above questions show, many important nuances of human communication get lost in technology.

Developing linguistic diversity in AI goes far beyond simple translation. Today's LLMs are grappling with:

Context: Understanding meaning based on cultural and situational cues
Idioms: Making sense of expressions that aren't meant to be taken literally
Tone: Picking up on the differences between formal and casual language
Cross-lingual concepts: Handling ideas that don't translate directly between languages

Take the differences between European Portuguese and Brazilian Portuguese—small on paper, but significant in cultural weight. Consider how our shift to remote work has moved most interactions online, stripping away the nuances of in-person communication. AI, primarily trained on text data, struggles to capture these subtleties, potentially affecting sensitive workplace interactions like giving feedback.

❝

“Talk to people, listen deeply and learn from every voice. Ask bold questions, connect with people from different countries, industries, and backgrounds. Make sure your AI systems are not just smart, but inclusive and truly helpful.”

The Newsroom’s Jenny Romano reflects on language, LLMs and journalism

Lumiera's Take: Navigating the Complexities

At Lumiera, we see linguistic diversity in AI as more than a technical challenge—it brings value to the human experience, is a critical factor in addressing key societal issues, and is a business opportunity. We also see how AI could be used to enhance understanding across the diverse landscape of human languages and cultures.

The challenges we covered underscore why we need a multifaceted approach to AI development; one that blends strategic leadership that prioritises linguistic diversity, subject matter experts with linguistic expertise and cultural awareness, and a strong focus on ethical considerations.

Until next time.
On behalf of Team Lumiera

Emma - Business Strategist
Sarah - Policy Specialist
Allegra - Data Specialist

Lumiera has gathered the brightest people from the technology and policy sectors to give you top-quality advice so you can navigate the new AI Era.

Follow the carefully curated Lumiera podcast playlist to stay informed and challenged on all things AI.