The Language Gap That AI Hasn't Closed

Photo: Unsplash

Language & AI

The Language Gap That AI Hasn't Closed

AI language tools in Yoruba, Bengali, Swahili, and Tamil exist — but the quality gap between these and English-language tools is wide enough to constitute a form of technological exclusion.
language-aimultilingualinequalitynlpdevelopment

There are approximately 8,000 languages spoken on earth. GPT-5 handles about 50 of them reasonably well. The gap between “reasonably well” and “as well as English” is large for most of those 50, and for the 7,950 others, the situation ranges from barely functional to completely absent.

This is not a temporary problem on the verge of solution. It is a structural feature of how large language models are built, and understanding why it persists — despite genuine effort from major labs — is necessary for evaluating what AI access actually means for the two-thirds of the world’s population who primarily speak a language that AI handles poorly.

Large language models learn from text. Their capabilities in any given language are roughly proportional to the quantity and quality of text in that language in the training corpus, subject to some architectural factors around cross-lingual transfer. The distribution of text on the internet is wildly unequal: English represents something like 56 percent of all indexed web content despite being the first language of roughly 15 percent of the global population. Chinese, Spanish, and Arabic are distant but meaningful presences. Yoruba, which has approximately 45 million speakers and is the third most spoken language in Nigeria, represents a fraction of a percent of web content.

The consequence is that an AI model trained on the internet’s text learns English thoroughly and Yoruba barely. The architectural innovations that allow some cross-lingual transfer — the fact that a model trained heavily on English can perform some tasks in Yoruba better than the Yoruba training data alone would suggest — partially compensate for this disparity, but they do not eliminate it. The cross-lingual transfer is strongest for languages that are structurally similar to well-represented languages and weakest for languages with different morphology, syntax, and writing systems.

Swahili benefits somewhat from its relatively large web presence and from structural relationships with Arabic. Bengali, with over 230 million speakers, has a larger model presence than its importance might suggest because of deliberate investment by researchers at institutions like AI4Bharat. Tamil, with 78 million speakers across India and Sri Lanka, has active academic and industry efforts to improve model quality. These are languages where the quality gap to English is real but diminishing through deliberate effort.

The languages where the gap is most severe and the effort most limited are the ones that are spoken by populations that have the least bargaining power in the global AI ecosystem. Hausa, with 150 million speakers across West and Central Africa. Amharic, the working language of the Ethiopian government and a mother tongue for roughly 30 million people. Oromo, with 40 million speakers in Ethiopia and Kenya. Tigrinya. Igbo. Kinyarwanda. Each of these languages has speakers who need information, access to services, and the ability to communicate with institutions — needs that AI could potentially address but currently addresses poorly.

The reason for the gap is not malice or indifference on the part of AI labs, though the incentive structures are certainly not aligned with addressing it. Training data for low-resource languages is scarce because text production in these languages is low, because the text that exists is often not digitized, and because the annotation pipelines that produce high-quality training data (human evaluators rating AI outputs, providing correction, generating preference data) require speakers of the language who can be paid for their time, which requires organizational infrastructure that does not exist for most of these languages.

Masakhane, a community-led research initiative focused on African language AI, has done more than any single company to address this problem, building datasets, training models, and publishing benchmarks for dozens of African languages. The work is real and important. The scale is constrained by the volunteer and grant-funded structure that community efforts depend on. A comparable effort funded at the level of a major AI lab’s research budget would look very different.

The quality gap has consequences that are measurable and often go unmeasured. Medical AI tools deployed in contexts where the primary communication language is not one of the well-supported major languages perform worse than in English contexts. A diagnostic decision-support tool that works well for an English-speaking patient’s description of symptoms and is mediocre for a Wolof-speaking patient’s description is not equally useful to both patients — it is actively providing worse service to the person who already has less access to healthcare.

The same dynamic applies to educational AI, agricultural advice tools, legal information services, and any other AI application that depends on language understanding. If the AI works well in the user’s language, it is a tool. If it works poorly, it is a worse-than-nothing alternative — an AI that confidently gives wrong answers in a user’s language may be more harmful than no AI at all, because users who trust the interface may not calibrate their skepticism appropriately.

This is not a hypothetical risk. Reports from African AI deployment projects have documented cases where AI tools translated back into English for evaluation showed consistent errors that reflected the degraded quality of the underlying language model rather than errors in the underlying task. The evaluation layer (English) was not surfacing the quality problems that the deployment layer (Hausa or Amharic or Tigrinya) was producing.

The market incentive structure does not naturally produce solutions to this problem. AI companies invest in language coverage proportional to the commercial opportunity — the willingness and ability of users to pay for the service in that language. English speakers in wealthy countries represent the largest commercial opportunity per user. The market therefore invests in English. Chinese speakers represent a large and growing commercial opportunity and investment follows. Hindi, Arabic, and Spanish have commercial markets significant enough to attract investment.

Below that tier, the commercial case is weak, and the investment follows. This is not a failure of market logic. It is market logic operating correctly within a very narrow definition of value.

The institutions that could address this with non-market logic — governments of low-income countries, development banks, international NGOs, bilateral aid programs — have generally not prioritized AI language infrastructure as a development investment at the scale the problem requires. There are exceptions: the government of Rwanda has made AI a stated priority and has invested in Kinyarwanda language AI development. India’s government has funded multilingual AI efforts at a scale that reflects the political salience of Hindi and regional languages. But these are partial responses to a systemic problem.

The technical picture is not entirely bleak. Several developments are changing the cost curve for low-resource language AI in ways that may shift the economics over the next few years. Multilingual foundation models — models designed from the start to handle many languages simultaneously — achieve better cross-lingual transfer than models that attempt to add language support after the fact. The research community’s investment in multilingual model architectures has produced models that can serve low-resource languages at quality levels that were not achievable five years ago, even without massive increases in low-resource language training data.

Retrieval-augmented generation — a technique that allows models to look up information from external databases rather than relying solely on training data — reduces the language-specific quality problems that arise from sparse training data, because the retrieval can work across languages while the generation is handled by the language model separately. For specific, information-dense applications (medical symptom lookup, legal information, government service navigation), this architecture can deliver acceptable quality in lower-resource languages that would be inadequate for open-ended generation.

Voice interfaces are separately important: many low-resource language communities have higher literacy in spoken language than written, and voice AI tools reduce the writing-system barrier that text-based tools impose. Speech recognition for low-resource languages has benefited significantly from OpenAI’s Whisper model (released in 2022) and its successors, which demonstrated that multilingual speech recognition could work at reasonable quality levels for dozens of languages with modest training data.

None of this resolves the fundamental inequality. A Yoruba speaker using AI in 2027 has a significantly worse experience than an English speaker using the same underlying technology. The gap is smaller than it was in 2022. It has not closed.

Whether the market, aided by targeted public investment and community research initiatives, can close it over the next decade is the development question that deserves more attention than it currently receives.