Europe is a continent rich in linguistic and cultural diversity, home to more than 200 languages spoken across its territory. While this represents a profound cultural treasure, it also poses a significant challenge in the realm of artificial intelligence, particularly in the training of large language models (LLMs).
In a recent blog post, Microsoft President Brad Smith highlighted the issue, stating: “As the world becomes increasingly digital, much of Europe’s linguistic and cultural diversity risks being left behind.” The main reason? A vast majority of the internet’s content is in English, and much of it reflects an American perspective. In other words, the primary data sources feeding LLMs are overwhelmingly biased toward one language and one cultural context.
The Risk of Linguistic Inequality
Smith warns that “an artificial intelligence that does not understand the languages, histories, and values of Europe cannot fully serve its citizens, its businesses, or its future.” The concern goes beyond mere cultural representation—it touches commercial relevance, inclusion, and innovation equity.
Microsoft’s own products, such as its Windows operating system, support more than 90 languages, including co-official regional languages like Basque, Catalan, Galician, Valencian, and Luxembourgish. Still, representation in user interfaces does not necessarily translate into adequate training data for AI models.
According to data from Common Crawl cited by Microsoft, some official EU languages like Danish, Finnish, Swedish, and Greek represent less than 0.6% of all web content. This underrepresentation makes it harder for LLMs to learn these languages effectively, let alone capture the regional and cultural nuances that make them unique.
The Cost of Limited Data
Smith emphasizes that while general-purpose, large-scale language models can technically manage multiple languages, they often “miss linguistic subtleties, cultural context, and regional depth” required for truly inclusive applications. In other words, AI models trained on limited data are:
- Less accurate
- More prone to hallucinations and factual errors
- Weaker in vocabulary handling
- More biased
A compelling example comes from the open-source model Llama 3.1. Microsoft points out a significant performance gap of more than 25 points when comparing the model’s capabilities in English versus Latvian. This disparity showcases the extent to which multilingual performance still lags behind.
Script Complexity: A Technical Hurdle
Another key issue raised by Smith is the structural complexity of certain scripts. “Cyrillic characters, the Greek alphabet, and the cursive script of Arabic each have distinct properties,” he notes. Standard tokenizers—tools that break text into smaller components for AI models—often segment these writing systems suboptimally. This reduces the model’s ability to:
- Accurately learn spelling
- Understand long-range context
- Preserve grammatical structure
This becomes a major hurdle for expanding AI capabilities in languages using non-Latin scripts, particularly those spoken across Eastern Europe and the Middle East.
Microsoft’s Commitment to Multilingual AI
To address this imbalance, Microsoft has announced the deployment of dedicated teams from its innovation hubs in Strasbourg, France. Their mission: support the development of truly multilingual LLMs tailored to the linguistic diversity of Europe.
This initiative is part of Microsoft’s European Digital Commitments, launched earlier this year. It includes efforts to enhance AI development and promote digital sovereignty in Europe. Some of the measures include:
- Expanding Azure’s AI capabilities to support multilingual data processing
- Developing LLMs with regional contexts in mind
- Collaborating with local partners and research institutions
- Ensuring European data stays within Europe through sovereign cloud solutions
These cloud solutions encompass:
- Public Sovereign Cloud
- Private Sovereign Cloud
- Partner Local Clouds
Each of these ensures that sensitive European data is stored and processed according to regional data sovereignty requirements.
The Road Ahead
The European AI landscape is at a crossroads. As LLMs continue to revolutionize sectors from education and healthcare to law and commerce, linguistic inclusion is no longer optional—it is essential. Without robust multilingual support, the continent risks reinforcing digital inequality between dominant and minority languages.
Microsoft’s strategic pivot toward inclusive AI development in Europe marks a significant step in the right direction. But the journey will require collaboration between governments, academia, tech companies, and local communities. Only through such alliances can Europe ensure that its linguistic richness is reflected, respected, and reinforced in the digital age.
In the words of Brad Smith: “A multilingual Europe needs a multilingual AI.”