Photo: Unsplash
When the DMV Gets a Large Language Model
The deployment of large language model-powered chatbots in government services began in earnest in 2023 and accelerated substantially through 2024 and 2025. By the beginning of 2027, some version of an LLM-assisted interface is available at most federal agencies in the United States, at the majority of UK local councils, across most of the Australian government’s citizen-facing services, and at scale throughout the EU member states with developed digital government infrastructure. Singapore’s virtual civil servant, Ask Jamie, has been running since 2014 and was upgraded to GPT-4 class infrastructure in late 2023. India’s MyGov chatbot handles more than 600,000 queries monthly.
The premise is sound: most queries to government contact centers are repetitive, answerable, and do not require human judgment. “What documents do I need to renew my driving license?” “When is the recycling collection in my area?” “How do I apply for Universal Credit?” These questions have deterministic answers. They should not require a human being to spend five minutes on the phone confirming information that is available on a website. Automating this frees human staff for the queries that actually require human engagement.
The problem is not the premise. The problem is that LLMs fail in government contexts in specific, predictable ways that differ from the failure modes most critics anticipated — and that have already caused measurable harm to the people least able to absorb it.
The hallucination problem is not what you think
The failure that received the most attention in early LLM deployment discussions was hallucination — the tendency of language models to generate confident-sounding but false information. In consumer contexts this produces annoyance and distrust. In government contexts, the concern was that an LLM chatbot would tell someone they were entitled to a benefit they weren’t eligible for, or that a form was the right one when it wasn’t, or that a deadline was different from what it actually was.
This has happened. In 2024, the New York City agency that runs the city’s AI-powered tenant advisory chatbot, MyCity, was found by journalists to have been telling small business owners and landlords to violate city housing laws — advice directly contrary to the statutes the system was supposed to explain. The city had deployed the system without adequate domain-specific grounding, relying on the LLM’s general knowledge of housing concepts rather than retrieval-augmented generation anchored in the actual municipal code.
But this kind of factual error, while real, has turned out to be easier to detect and fix than expected. It shows up quickly in testing, or in early user complaints, and can be addressed by proper grounding against authoritative documents. The systems that have been deployed carefully — with RAG against verified regulatory text, with human review of outputs during deployment, and with regular audits — produce factual errors at rates that are substantially lower than the feared worst case.
The failures that have turned out to be more persistent and more harmful are subtler.
The complexity cliff
Government services have a bimodal query distribution. The simple queries (what are the office hours, where do I send this form) are genuinely automatable and LLMs handle them adequately. The complex queries — someone in a complicated benefits situation, someone whose case falls into an exception to the standard rule, someone whose circumstances span multiple agencies and multiple systems — are often the queries where human judgment matters most and where the stakes of failure are highest.
LLMs in government contexts have a tendency to confidently process complex queries by identifying the most common pattern that resembles the query and answering that pattern rather than the specific question. Someone who asks about their Universal Credit claim when they are a self-employed person with variable income and caring responsibilities will often receive an answer that is accurate for a standard employed claimant but wrong for their situation. The answer sounds authoritative. The person follows it. The advice turns out to be incorrect for their circumstances. The harm is not in the form of a dramatic hallucination but in the form of a plausible-sounding error that is difficult for the recipient to identify as wrong.
The Population Health Management chatbot deployed across several NHS integrated care boards in England in 2025 produced this pattern at scale. Patients with comorbidities asking about medication interactions received answers that were accurate for either condition in isolation but did not account for the combination. The system had been tested on single-condition queries and performed well; it had not been adequately tested on the complex multi-condition cases that represent a substantial fraction of NHS patient burden.
The trust architecture problem
There is a more fundamental issue that technical improvements cannot fully address: the relationship between a citizen and a government service is not the same as the relationship between a user and a consumer product. When a consumer chatbot gives bad advice, the person can ignore it, complain to customer service, or choose a different product. The option to exit exists.
Citizens interacting with government services often cannot exit. There is one HMRC, one Social Security Administration, one state driver licensing authority. When these services tell you something, the power differential between the advice-giver and the recipient is not comparable to a consumer relationship. Government advice carries an implicit authority that commercial chatbot advice does not. People assume that what a government system tells them is correct in ways that they do not assume when dealing with a commercial service.
This means that the error tolerance for government AI systems should be substantially lower than for commercial AI systems — but most deployed government systems are evaluated using the same or similar benchmarks used for commercial applications. The calibration is wrong.
Finland’s approach has been more careful than most. When the Finnish Immigration Service deployed an AI assistant for visa and residence permit queries in 2024, it explicitly framed the system to users as a “starting point” rather than an authoritative answer, directed users to human staff for any query that involved their specific case, and designed the system to escalate to human review for any query mentioning exceptional circumstances. The system handles simple queries and routes complex ones rather than attempting to answer them. The volume handled by human staff has not decreased substantially. The AI is a pre-filter, not a replacement.
This is less impressive as an efficiency story. It produces better outcomes. The efficiency-at-all-costs framing that drives most government AI procurement is doing a lot of damage that shows up in places that efficiency metrics don’t capture: in wrong decisions followed, in benefits missed, in people who don’t know enough to know they were wrongly advised.
The language and literacy gap
One failure mode that hasn’t received adequate attention is the interaction between LLM outputs and the literacy levels and first-language distribution of government service users. Government services serve the whole population, not the technology-comfortable demographic that dominates AI benchmark testing. A system that performs well with fluent English-language queries and technology-literate users will not necessarily perform well with users who are asking in second languages, who are elderly, who have cognitive disabilities, or who express their queries in vernacular or dialectal language.
The UK’s Home Office piloted an AI-assisted asylum seeker support chatbot in 2025. Internal evaluation data obtained by the charity Detention Action under Freedom of Information showed that the system had an effective comprehension failure rate of approximately 35% for users whose first language was not English — not failure to find an answer, but failure to correctly interpret what was being asked. The system would respond to the nearest English-language interpretation of an ambiguous query and the user, not knowing what the system had understood the query to be, would receive an answer to a question they hadn’t asked.
The populations most likely to interact with government services in non-standard language are also among those for whom the consequences of misadvice are highest: asylum seekers, recent immigrants, elderly residents of non-English-speaking communities. The AI systems being deployed in their service are not adequately evaluated on the queries those users actually ask.
None of this argues against using AI in government services. It argues for deploying it with genuine attention to who the users are, what the failure modes are for those users specifically, and what the stakes of failure are in the specific service context. These questions should precede procurement. They mostly don’t.