AI Quality

When the Model Speaks Your Language Badly

A confidently wrong AI response in your native language is more dangerous than no response at all — the quality gap in multilingual AI is a safety issue as much as an equity one.

By Jakub Jirák Apr 26, 2027 6 min read

multilingual-aiai-qualitylanguagesafetydevelopment

There is a specific failure mode in multilingual AI that is worse than the failure mode of no AI at all. When a model has enough training data in a language to generate fluent text but not enough to reason accurately in that language, it produces responses that are stylistically convincing and factually unreliable. The confident tone of a well-trained language model carries over even when the underlying knowledge is thin. Users who are calibrated for the English-language version of a model — where fluency and accuracy are reliably correlated — may not recalibrate for the Tamil or Hausa version, where they are not.

This is not hypothetical. It is a documented failure mode that researchers studying multilingual AI have been describing since at least 2021, when analyses of GPT-3’s multilingual performance found that the model produced more confident-sounding hallucinations in low-resource languages than in high-resource ones. The problem persists in more capable models because the underlying cause is the training data distribution, and better architecture cannot fully compensate for sparse training data.

The risk is highest in high-stakes domains where users are likely to act on AI-generated information without independent verification. Medical information is the clearest case. A user asking a multilingual AI chatbot about symptoms, drug interactions, or treatment options in Swahili or Bengali is likely to trust a confident response — because the response sounds authoritative and because the alternative is no information at all, or information that requires navigating language barriers to access.

Documented cases of medical misinformation from multilingual AI are difficult to aggregate because they are scattered across incident reports from healthcare workers, anecdotes shared in communities, and research papers that use controlled testing rather than field observation. What the controlled testing shows is consistent: models perform worse on medical accuracy in lower-resource languages, often by substantial margins, and the performance drop is not signaled to users through any change in the model’s apparent confidence or fluency.

The specific failure mode in Indic languages — where AI models trained primarily on English internet text learn to generate grammatically correct text in Hindi, Telugu, or Bengali without accurately understanding the cultural and clinical context that makes medical information useful — has been documented by researchers at AI4Bharat, the Indian language AI initiative at IIT Madras. Their benchmarks show that flagship models from major labs perform 15 to 25 percentage points worse on medical question answering in Indian languages compared to English, despite presenting their answers with similar linguistic confidence.

The legal domain has a similar risk profile with different specific failure modes. Legal AI tools that provide information about tenant rights, labor rights, or family law face the challenge that the law differs by country and jurisdiction, and that AI models trained primarily on English legal text may generate responses that are legally accurate for one jurisdiction while being incorrect for another. A user in Nairobi asking about tenant eviction rights in Swahili may receive information that is accurate for a different East African jurisdiction, or that reflects Kenya’s formal law rather than the customary practices and procedural realities that actually govern tenant-landlord disputes in informal settlements.

The problem is more fundamental than localization. Legal AI accuracy requires not just language capability but jurisdiction-specific knowledge, and jurisdiction-specific legal knowledge for small jurisdictions is sparse in any AI training corpus. Kenya’s legal system is documented in English, and an English-language AI query about Kenyan law will receive a more accurate answer than a Swahili-language AI query about Kenyan law, even though Swahili is Kenya’s national language. The irony — that AI tools are better at the language of the former colonial power for queries about the legal system of the formerly colonized country — captures something real about the training data distribution and its consequences.

Several approaches to this problem are being pursued with varying degrees of success. The most direct is simply building better training data for low-resource languages — a labor-intensive process that involves finding, digitizing, and cleaning text in the target language, building annotation pipelines with speakers of the language, and creating evaluation benchmarks that allow quality to be tracked over time. Masakhane has done this for African languages; AI4Bharat has done it for Indic languages; similar efforts exist for Southeast Asian languages through the SEACroW and the SEALD projects.

The data work is necessary but insufficient on its own. Even with substantially larger training corpora, the inherent challenge of low-resource language AI is that the feedback loops that improve model quality — users interacting with the model, identifying errors, providing corrections — are weaker for low-resource languages because the user bases are smaller and the evaluation infrastructure is less developed.

Retrieval-augmented generation offers a partial solution for specific domains. Instead of relying on the model’s parametric knowledge (what was encoded during training), RAG systems retrieve relevant documents from a database and use the model to synthesize them. For legal information, this means connecting the model to a database of the actual law — the specific statutes, case law, and regulations of the specific jurisdiction — and using the model to translate that into accessible language rather than to recall from training. The quality depends on the quality of the document database, not on the language balance in the training corpus.

The governance question — who is responsible for the harms caused by low-quality multilingual AI — has not been cleanly resolved. AI companies have terms of service that limit their liability for incorrect information. Regulators in most developing countries lack the capacity to enforce quality standards on AI products deployed by foreign companies. Users bear the practical risk of acting on bad information without adequate recourse.

The European Union’s AI Act, which came into force in 2024, requires that high-risk AI systems (which include certain medical and legal applications) demonstrate conformity with quality standards before deployment. The practical effect in Europe, where the regulatory capacity to evaluate compliance is greater, will be different from the practical effect in markets where the same products are deployed without the same regulatory scrutiny. The same AI product might face quality requirements in France that are not enforced in Côte d’Ivoire, even though French is spoken in both countries and the same model serves both markets.

This regulatory arbitrage is not unique to AI — it describes the general condition of international technology deployment, where products are designed for regulatory environments with enforcement capacity and deployed in environments without it. The harm in this specific case falls on users who have less power to demand quality and less access to recourse when quality fails.

The honest assessment of the multilingual AI quality problem is that it is improving slowly, unevenly, and without the urgency that its safety implications warrant. The improvement is real — models have genuinely gotten better at Indian languages, at African languages, at Southeast Asian languages — and the trajectory is positive. The pace is slower than the pace of deployment, which means that users in low-resource language communities are adopting AI tools faster than the quality of those tools is improving for their specific use cases.

The safety framing — that confidently wrong AI in your language is worse than no AI — is the most important contribution that researchers in this field have made to the policy conversation. It reframes the question from “how do we provide AI access to more people?” to “how do we ensure that AI access provides benefit rather than harm?” The second question requires higher standards and more investment than the first.

Achieving it requires treating multilingual AI quality as a safety issue, with the resourcing and oversight that safety issues receive, rather than as an equity issue, with the more modest resourcing that equity issues typically attract. The populations most affected by poor multilingual AI quality are the populations least positioned to demand that treatment. Someone needs to demand it for them.

When the Model Speaks Your Language Badly

Google Gemini: First real comparison we ran between Gemini and the rest

Single-Threading Your Brain: Why Doing One Thing at a Time Is the Last Competitive Advantage

Windsurf: Keeping the editor quiet until you ask for help

macOS and the Invisible Hand of Productivity

MCP servers: Exposing read-only resources before we dared expose actions