Photo: Unsplash
The Intelligence Illusion: Why What AI Does Is Not Thinking
Here is the thing about the word “intelligence”: it’s doing enormous, mostly invisible damage to every serious conversation about AI. Not because it’s imprecise — lots of useful words are imprecise. Because it’s imprecise in a specific direction that makes problems harder to see and easier to ignore.
When we call a language model “intelligent,” we smuggle in decades of assumptions about what intelligence involves — intentionality, understanding, the capacity to be wrong in ways that matter. A system that is “intelligent” can be expected to behave sensibly in novel situations. It has goals. It models the world. It can, in principle, explain its reasoning. It fails in ways we can predict, because intelligent beings fail in ways that have internal logic.
Large language models do none of these things. What they do is significantly weirder, and significantly more limited, and understanding that difference matters enormously for everything from product decisions to safety policy to the regulatory frameworks being written right now.
Let’s start with what actually happens when you prompt a language model. A transformer architecture processes a sequence of tokens. Each token is a probability distribution over what might come next, conditioned on everything that came before. The model was trained on text — enormous quantities of text, essentially the written output of human civilization for the past several decades — to minimize the prediction error on that next token. That’s the whole mechanism. There is no world model being consulted. There is no reasoning process running in the background. There is a statistical function, learned over an enormous corpus, that produces plausible-sounding continuations.
Calling this “thinking” is like calling a very elaborate autocomplete “writing.” Technically adjacent. Substantively misleading.
The confusion isn’t accidental. It traces back to Alan Turing, who in 1950 famously asked “Can machines think?” and then immediately sidestepped the question by proposing the Imitation Game — what we now call the Turing Test. Pass that test, and we’ll call it thinking. Turing was a brilliant man, and this was a brilliant dodge. He knew the underlying question was philosophically intractable, so he operationalized it as a behavioral test.
The problem is that the behavioral test was never actually a proof of anything. It was a conversational endpoint. Turing himself acknowledged, in the same paper, that he was proposing an operational definition that avoided the philosophical question rather than answering it. The entire field of AI has been haunted by that substitution ever since — replacing “what does intelligence actually require?” with “does this thing look intelligent enough to fool someone?” The two questions are not equivalent, and treating them as equivalent has produced a century of confused thinking.
John Searle’s Chinese Room argument from 1980 is relevant here, though not in the way it’s usually deployed. The argument: imagine a person locked in a room with a comprehensive rulebook for responding to Chinese symbols in Chinese. People outside pass in Chinese questions. The person follows the rules mechanically, passes back Chinese answers. From outside, it looks like Chinese comprehension. From inside, there’s no comprehension — just symbol manipulation according to rules.
Critics of the Chinese Room spend enormous energy on the “systems reply” — maybe the room as a whole understands even if the person inside doesn’t — and other rebuttals. Fine. Those philosophical objections deserve to be taken seriously. But they miss the more important operational point: Searle was identifying that behavioral equivalence is not understanding equivalence. A system that produces correct outputs by one mechanism is not doing the same thing as a system that produces correct outputs by another mechanism. The difference matters, particularly when the first mechanism fails in ways the second doesn’t.
LLMs fail at things that a genuine understanding-based system wouldn’t fail at. They “hallucinate” — produce fluent, confident nonsense — because next-token prediction doesn’t consult truth, it consults the probability distribution over what usually follows the preceding tokens. They lose track of simple logical constraints across long contexts. They’re easily confused by surface rephrasing of problems they “solved” in their training data. A model that correctly solves a math problem stated one way will fail on an identical problem stated differently, not because the math is harder, but because the token statistics differ. These aren’t bugs to be patched. They’re direct consequences of what the mechanism actually is.
The “stochastic parrot” framing — coined by Emily Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell in their 2021 paper — is closer to correct than its critics want to admit. The critics complained it undersells what LLMs can do. That’s a fair complaint about the framing’s comprehensiveness. But it captures something essential: these systems are doing sophisticated pattern matching over vast linguistic corpuses, and the outputs are not grounded in the kind of causal model of the world that would make “understanding” a coherent description. The parrot doesn’t know what “cracker” means. It knows what utterances follow the utterance “Polly wants a.” The scale of the pattern matching in a large language model is orders of magnitude larger. The fundamental nature of the operation is similar.
Here’s where the imprecision becomes genuinely dangerous. When you believe a system is intelligent, you start assigning it properties it doesn’t have. You assume it will generalize sensibly to situations outside its training distribution. You assume its explanations reflect its actual processing. You assume that if it says something with confidence, confidence is a reasonable signal for correctness.
None of these assumptions hold for LLMs. But they routinely inform product deployments, and the consequences are not academic.
An AI system gets deployed in a medical context because it produces impressive outputs on the cases someone thought to evaluate. Nobody systematically mapped its failure modes, because “intelligent systems” are expected to have human-like failure modes — gradual, detectable, roughly proportional to task difficulty, with errors that make interpretable sense in retrospect. LLM failures don’t work that way. They’re sharp and discontinuous. A model that handles 999 similar questions correctly will hallucinate badly on the 1000th for reasons that have nothing to do with the difficulty of the question. The surface features of the failure look like an intelligent system having a bad day. The underlying cause is something else entirely.
The MD Anderson Cancer Center’s partnership with IBM Watson illustrates this at scale. Watson was described as a cognitive computing system. Hospital administrators and clinicians believed they were getting something that understood oncology — that had, in some meaningful sense, internalized the relevant medical knowledge and could reason about patient cases the way a knowledgeable physician would. What they were getting was a sophisticated information retrieval system that had been trained on clinical notes and academic papers. When it confidently recommended treatments that conflicted with established clinical practice — which it did, documented in the 2017 audit — the problem wasn’t that Watson was dumb. It’s that treating Watson as a medical reasoner, rather than as a pattern-matching system over medical text, was a categorical error that made the failure both predictable and invisible until it had cost $62 million.
What would “understanding” actually require? This is a question the AI field has mostly avoided since the connectionist revolution of the 1980s, because the answer is philosophically contentious and doesn’t generate tidy research programs. But some candidate requirements are worth stating, because avoiding the question doesn’t make it less important.
Understanding probably requires something like a causal model of the world — not just correlational patterns, but some representation of why things happen, what would be different if conditions changed, what interventions produce what effects. This is what Judea Pearl has been arguing for decades, and the argument is well-founded. A system that knows that A and B always co-occur does not thereby know whether A causes B, whether B causes A, whether they share a common cause, or whether the co-occurrence is entirely spurious. Current LLMs have sophisticated correlational knowledge and something that looks, sometimes, like causal reasoning. Whether it is causal reasoning in any robust sense remains genuinely unclear, and the evidence from systematic failures suggests it mostly isn’t.
Understanding probably requires intentionality — the capacity to represent things as being about something beyond the immediate token stream. When I think about a chair, my thought has a subject matter: actual chairs, or the concept of chairs, or the chair in my office. When an LLM generates text about chairs, it’s producing tokens conditioned on the probability of token sequences in the training data. Whether there’s anything it’s “about” in the relevant sense is the hard problem of intentionality, and claiming that LLMs have solved it because their outputs are coherent is circular.
Understanding probably requires the kind of error correction that comes from a system checking its outputs against a model of reality, not just against a learned distribution over text. This is why LLMs hallucinate in ways that knowledgeable humans don’t: a human who “knows” oncology and gets a clinical case wrong can typically recognize the error when shown counter-evidence, because their knowledge is connected to something beyond linguistic patterns. An LLM that generates an incorrect drug interaction doesn’t have the internal structure to detect that it’s wrong. It can only generate the next probable token.
Current LLMs have sophisticated approximations of some of these properties, which is exactly what makes them both useful and dangerous. Useful because the approximations work extremely well on tasks that resemble training-distribution inputs. Dangerous because you can’t reliably tell when you’re outside the distribution until after the failure.
The policy consequences of this conflation are significant and underappreciated.
Debates about AI regulation cluster in two places: near-term harms (bias, hallucination, privacy) and long-term existential risk (superintelligence, misalignment). Both framings are downstream of assumptions about what AI systems are. The existential risk framing presupposes systems with goals, values, and the capacity for strategic deception — properties that current LLMs don’t have and that require a different architecture than next-token prediction to produce in any robust sense. The near-term harm framing is more grounded but still often anthropomorphizes in ways that obscure the actual mechanism.
The regulatory questions that would actually address the specific failure modes of current systems — liability for hallucinated outputs in professional contexts, audit requirements before high-stakes deployment, minimum evaluation standards that reflect how the system actually fails rather than how intelligent systems would fail — get less attention because they’re less cinematic. They require understanding the mechanism. That’s technically harder than having strong feelings about robot takeover scenarios.
There’s a version of this argument that gets made carelessly, and I want to be precise about where I’m not going. I’m not saying LLMs are simple. They’re not — the emergent capabilities that appear at scale are genuinely surprising and not well understood. I’m not saying they’re useless. They’re transformatively useful for specific task classes. And I’m not saying the question of machine consciousness is settled, because it isn’t.
What I’m saying is simpler and more specific. The word “intelligence” is load-bearing in our conversations about these systems, and it’s carrying assumptions that the systems cannot support. Every time a researcher says a model “understands” a task because it scores well on a benchmark, they’re making a claim that doesn’t follow from the evidence. Every time a company deploys a model in a high-stakes context based on impressive demo performance, they’re importing assumptions about failure modes that the mechanism doesn’t warrant.
Turing was right that the philosophical question is hard. He was wrong — or at least, the field was wrong to take him as settling the question — to treat behavioral performance as sufficient evidence for the underlying capacity.
The next time someone tells you AI is becoming intelligent, ask them what they mean by that, precisely. If they can’t answer without falling back on “well, look what it can do,” you’ve found the problem. What it can do is impressive. What it is doing, mechanically, is something quite different from thinking. That difference is not academic. It’s the gap where most AI disasters will happen.