The Next Billion AI Users Won't Speak English. That's a Bigger Problem Than You Think.

Photo: Unsplash

The Language Gap

The Next Billion AI Users Won't Speak English. That's a Bigger Problem Than You Think.

AI systems are trained primarily on English text and perform substantially worse in other languages — and the languages where this gap is worst are the ones with the most users
languageglobal-southartificial-intelligencemultilingualinclusion

The internet transformed information access. It also replicated, nearly perfectly, the existing global hierarchy of whose information was worth transmitting. In 2023, roughly 56% of all websites were in English — a language spoken natively by approximately 5% of the world’s population. French, German, and Russian together accounted for another 18%. Mandarin Chinese — spoken natively by more people than any other language on earth — accounted for about 1.5%.

AI is doing the same thing. And we’re letting it happen again, at speed, with much higher stakes.

The Measurement Problem

The performance gap between English and other languages in large language models is real, measurable, and larger than most people working in AI publicly acknowledge.

The MMLU (Massive Multitask Language Understanding) benchmark, one of the standard measures of model capability, shows GPT-4 scoring around 86% accuracy in English and dropping to approximately 62-71% in languages like Arabic, Bengali, and Swahili depending on the specific task and measurement methodology. A 15-20 point accuracy gap sounds technical and abstract until you think about what it means in practice: a model that gives correct, useful answers roughly 86% of the time becomes one that gives correct, useful answers perhaps 65% of the time. In a medical context, in an educational context, in a legal context — that gap is the difference between useful and dangerous.

Google’s Gemini models score 91.1% on MMLU in English. For Hindi, a language spoken by over 600 million people as a primary language, the comparable scores across available benchmarks are substantially lower — typically by 15-25 percentage points depending on task type. This is not because Google hasn’t tried. It is because the fundamental input to these models — training data — is systematically skewed.

The web has vastly more English text than any other language. Common Crawl, one of the primary internet text datasets used for training, contains roughly 45% English content even after various filtering attempts. Swahili, spoken by 200 million people across East Africa, represents a fraction of a percent of Common Crawl. The model can only learn what relationships exist in the training data. If you show it 100x more English examples of any given linguistic phenomenon, it will handle English 100x better on whatever internal benchmark makes training efficient.

Who Actually Gets Hurt

This is where the analysis has to stop being polite.

The languages with the worst AI performance are not the languages of populations that already have good access to information infrastructure. They’re the languages of populations at critical development junctures — where the gap between having access to accurate, responsive information systems and not having it can determine health outcomes, educational trajectories, economic opportunity.

Swahili (East Africa, ~200 million speakers). Bengali (Bangladesh, India, ~230 million speakers). Hausa (West Africa, ~70-80 million speakers). Tamil (India, Sri Lanka, ~80 million speakers). Amharic (Ethiopia, ~22 million speakers but the official language of Africa’s second-most-populous country). These are not obscure linguistic edge cases. These are the primary languages of hundreds of millions of people who are precisely the population that proponents of AI as a democratizing technology claim it will help.

Consider the specific applications: an AI system that helps a Bangladeshi farmer understand soil nutrient recommendations, or a Swahili-speaking nurse navigate drug interaction information, or an Amharic-speaking student access college application guidance. These are not hypotheticals — they’re the actual use cases that organizations like Translators Without Borders, Digital Green, and various CGIAR affiliates are trying to build. They’re running into the language gap at every step.

What Would Actually Fix This

Fixing the language gap is technically understood. The problem is not that nobody knows how to do it.

You need more high-quality training data in the target languages. You need linguists and native speakers to curate and annotate that data, because raw internet text in low-resource languages is often noisy, code-switched, or domain-specific in ways that don’t generalize. You need evaluation benchmarks in those languages, built by native speakers, that test what actually matters — not translations of English-language benchmarks, which carry over English-specific assumptions about how tasks should be structured. And you need deliberate architectural choices during training that weight multilingual performance rather than optimizing purely for English accuracy.

None of this is cheap. The annotation work alone — which requires actual linguistic expertise, not just bilingual speakers — costs money that is significantly harder to justify to investors than the next benchmark improvement in English. The calculus is simple and brutal: English-speaking users, especially professional users in developed markets, have higher willingness to pay. Optimizing for them produces better return on investment in the near term. Optimizing for Swahili speakers in Kenya produces better outcomes for Kenyans and generates essentially zero near-term revenue for an American AI company.

Who’s Working on It (And With What)

The organizations seriously working on low-resource language AI represent a small slice of total AI research investment.

Masakhane, a community-driven research effort headquartered in South Africa, has been building NLP datasets and benchmarks for African languages since 2019. Their work is legitimately good — they’ve produced datasets for over 50 African languages, contributed to open-source model development, and published research that has influenced how the field thinks about low-resource multilingual training. They operate largely on grant funding and academic partnerships.

AI4Bharat at IIT Madras has been doing similar work for Indian languages — building voice datasets, machine translation models, and text corpora for languages like Tamil, Telugu, Kannada, and Odia. Their IndicBERT and IndicBART models are legitimate contributions to the state of the art for their target languages. They are funded primarily by government grants and are working at a tiny fraction of the resources available to any of the major AI labs.

Meta has been the most significant exception among large AI companies — their No Language Left Behind (NLLB) project, released in 2022, produced translation models covering 200 languages, including many that had no machine translation capability at all. This is real work and worth acknowledging. It is also primarily a translation system, not a general-purpose language model, and it was trained on a different objective than the generation-focused models that are actually being deployed as user-facing AI products.

The pattern is consistent: the serious work on multilingual equity is happening at underfunded academic institutions and nonprofits, while the commercially significant investment continues to flow toward improving English-language performance.

The Systematic Deprioritization

The most honest way to state this: it’s not that AI companies are malicious about the language gap. It’s that the incentive structure they operate under does not reward fixing it.

Revenue comes from enterprise customers and individual subscribers in high-income markets. High-income markets speak English, German, Japanese, French, and Mandarin (though even Mandarin is significantly under-served relative to its speaker population). The performance gap in Hausa doesn’t show up in NPS scores from American enterprise customers. It doesn’t come up in analyst calls. It doesn’t affect quarterly earnings.

The exception would be regulation — if governments in high-speaker-population, low-resource-language countries required AI products to meet performance standards in local languages as a condition of market access, the calculus would change. Nigeria, Ethiopia, Bangladesh, Tanzania all have significant tech sectors and regulatory ambitions. Whether they will move toward language performance requirements for AI products is unclear; the regulatory capacity and the political will to enforce such requirements are both uncertain.

In the absence of regulatory pressure or a significant shift in where AI companies seek revenue, the gap will persist. The models will get better at Swahili, incrementally, because the leading models are also getting better at everything else incrementally, and low-resource language performance improves as a side effect of general scaling. But the gap won’t close. It will remain proportional to what it is today, which means that the population of people who can use AI effectively will map very closely to the population of people who already have better access to information, education, and economic opportunity than the rest of the world.

Code-Switching and the Hidden Degradation

There’s a specific failure mode that gets almost no attention in discussions of multilingual AI performance: code-switching.

In communities that are bilingual or multilingual — which describes a substantial fraction of the low-resource language speaker population — natural language use mixes languages within a single utterance. A Swahili speaker in Nairobi might write a message that shifts between Swahili, English, and Sheng (a local urban dialect that itself blends multiple languages) within a single paragraph. This is how people actually communicate in these communities. AI systems trained on clean, monolingual text handle it very poorly. The model encounters text that doesn’t match its learned patterns for either language clearly, and performance degrades substantially below even its baseline performance in either language individually.

The practical implication: in exactly the communities where AI could provide the most value — multilingual urban centers in developing economies, where educational and economic opportunity is most unequal — the AI works least well, because the actual communication patterns of real users in those communities differ most from the training data.

This is not an exotic edge case. It’s the modal use case for hundreds of millions of people.

The Actual Irony

The standard narrative about AI is that it will democratize access to capability. The lawyer who costs $500 an hour becomes an AI assistant anyone can afford. The tutor who costs $100 an hour becomes a free chatbot. The information that used to require institutional access now lives in a public interface.

This narrative is substantially true for English speakers. For the next billion potential AI users — in Sub-Saharan Africa, in South Asia, in parts of Southeast Asia — it’s backwards. The AI assistant that works well in English works notably worse in Swahili or Bengali or Hausa. The gap between “people who benefit from AI” and “people who don’t” risks tracking almost perfectly onto existing inequalities in information access, economic opportunity, and institutional capability.

That’s not democratization. That’s acceleration of an existing gap with a shiny new interface on top.

The technology exists to do better. The datasets can be built, the models can be fine-tuned, the benchmarks can be created. The question is whether the incentive structure of the AI industry will ever produce the will to actually build them at the scale that’s needed — or whether this will be one more area where the technology’s benefits accumulate primarily to people who already have advantages.

The answer, at current trajectory, is not encouraging.