Photo: Unsplash
The Last Human Skill: Why Judgment Cannot Be Automated Away
The history of automation is, in large part, a history of humans being wrong about what only humans can do. In the nineteenth century, it seemed obvious that weaving complex patterns required human creativity. The Jacquard loom proved otherwise. In the 1950s, it seemed obvious that playing chess required human intelligence. Deep Blue and its successors proved otherwise. In the 2000s, it seemed obvious that translating poetry required human sensitivity to language. Neural machine translation hasn’t solved the problem but has reduced it dramatically. Each time a capability was relocated from the “uniquely human” column to the “automatable” column, people adjusted their definition of what made humans special.
We are in another such moment. Large language models write code, draft legal briefs, compose music, and generate medical diagnoses that in controlled studies are sometimes indistinguishable from expert human work. The “uniquely human” column is shrinking fast. But there is one capacity that keeps appearing in the discussions as the last redoubt: judgment.
The word gets used loosely. It’s worth being precise about what we mean, because the precision matters for whether the claim is right or wrong.
Judgment, in the sense that matters here, is not expertise. Expertise is the accumulated knowledge of domain-specific patterns, and expertise can absolutely be automated — it is being automated at speed. A radiologist’s expertise in reading chest X-rays has been substantially replicated by convolutional neural networks. A financial analyst’s expertise in reading balance sheets has been encoded into models that outperform human analysts on certain metrics. When people say “doctors exercise judgment,” they often mean expertise, and that’s different.
Judgment, properly understood, is something more specific: the capacity to navigate situations where the relevant features are unclear, where the rules conflict, where values are in tension, and where the consequences of error are asymmetric and context-dependent. It is decision-making under genuine uncertainty, not probabilistic uncertainty — not “I don’t know the exact probability” but “I’m not even sure which probability distribution to apply, or which outcome even counts as success.”
A judge sentencing a first-time offender exercises judgment when she weighs retribution against rehabilitation, general deterrence against the particular human in front of her, the statutory guidelines against her sense of proportionality. A surgeon exercises judgment when she decides mid-operation to change her approach because something about the patient’s presentation suggests a complication the pre-operative scans didn’t show. A general exercises judgment when he decides to disobey an order because he believes the order reflects an incomplete understanding of the battlefield situation. In each case, the decision cannot be decomposed into a rule or a set of rules that would generalize to other situations. The judgment is embedded in the particularity of the case.
This is not merely a practical limitation of current AI systems. It reflects something structural about what AI systems are and how they work. The frame problem, first identified in AI research in the late 1960s by John McCarthy and Patrick Hayes, captures part of what’s at stake. The problem is this: when you act in the world, how do you know which facts are relevant to your decision? In a formal system, this question has to be answered explicitly — you have to specify the frame, the set of things that matter. But in reality, relevance is context-dependent in ways that are nearly impossible to specify exhaustively in advance.
Human judges, surgeons, and generals navigate the frame problem through something we might call situational awareness: a holistic, embodied sense of which features of a situation matter. This sense is not just a fast heuristic applied to a large pattern database. It involves genuinely noticing something, bringing it into focus, and making a judgment about its significance — and that “making a judgment about significance” is itself a judgment, not a computation.
Chess computers are instructive here, though not in the way that’s usually discussed. The canonical story is that Deep Blue defeated Kasparov in 1997, which proved that machines could match human strategic thinking. But this gets the story wrong in an important way. Deep Blue did not play chess the way humans play chess. Humans evaluate positions through something like aesthetic intuition — sensing that a position feels strong or feels wrong before they can articulate why. Deep Blue searched an enormous game tree at extraordinary speed, evaluating positions through a hand-coded evaluation function. The outcome was the same, but the process was categorically different.
More revealing is what happens when chess computers face genuinely novel situations — situations outside the distributions of their training data. Humans adapt by reasoning from principles, by considering what kind of situation this resembles and what implications follow. Computers don’t reason from principles; they pattern-match against learned evaluations. When the position has no close analogs in their training data, their performance degrades in ways that often look qualitatively different from human errors. Human errors in novel situations tend to be principled mistakes — errors of reasoning from wrong premises. Machine errors in novel situations tend to be more chaotic, reflecting the absence of any framework for dealing with something genuinely new.
This matters enormously for AI system design, because the situations that require real judgment are disproportionately the novel ones — the cases that don’t fit any established template. High-stakes, low-frequency, high-novelty situations are precisely where judgment is most needed and where automated systems are most likely to fail in dangerous ways. The cases where AI confidently applies a learned pattern to a situation that only superficially resembles its training data are the cases where human judgment is most valuable as a check, and often where it is least likely to be deployed because the AI’s confidence discourages second-guessing.
There is also the values dimension of judgment, which is perhaps even more fundamental than the novelty dimension. Judgment requires knowing what you’re optimizing for, and knowing this in a way that allows you to recognize when explicit objectives have been specified incorrectly. A system told to “maximize patient outcomes” needs to know that this metric doesn’t capture everything that matters — that a patient might prefer a worse expected outcome with a lower variance, that dignity and autonomy matter independent of measurable health indicators, that a physician’s obligations to other patients constrain what she can do for this one.
AI systems can have values specified to them, but they cannot question their value specifications in the way that humans can. They cannot notice that an objective has been poorly formulated and refuse to pursue it on those grounds. They can be trained to flag anomalies or express uncertainty, but this is different from genuinely recognizing that the game itself is wrongly designed. The philosophical term for this is “meta-level reasoning” — reasoning about the framework of reasoning — and it is perhaps the deepest sense in which judgment resists automation.
This has practical implications for how we should design AI systems. The argument is not that AI can’t help with judgment-laden decisions — it clearly can, by providing better information, identifying relevant precedents, flagging considerations that a human might overlook. The argument is that the final act of judgment, the integration of competing considerations in light of values that the judge actually holds, cannot be fully delegated. This is not because AI systems lack compute or data. It’s because the judgment involves taking responsibility for a decision in a way that requires a genuinely responsible agent — someone who can be held accountable, who has stakes in the outcome, who exists in a web of relationships and obligations that give the decision its moral weight.
Automation has consistently reduced the scope of decisions that require human judgment, and this trend will continue. But reducing scope is not the same as eliminating the need. The cases that remain — the genuinely hard ones, the novel ones, the values-laden ones — are precisely the cases where the cost of error is highest. The last human skill might not be any particular domain of expertise. It might be the capacity to take responsibility for decisions that machines can help analyze but cannot, in any meaningful sense, make.
There is a version of this story that ends in despair: machines get better, the domain of judgment shrinks to a vanishing point, and we eventually conclude that what we called judgment was just pattern matching on a longer timescale. That conclusion should not be dismissed. It is possible. But it is also possible that judgment names something real — not a mystical property of biological neurons, but a genuine functional capacity that emerges from embodiment, from having stakes, from existing in a social world with real consequences. If that’s right, then the question isn’t whether to automate judgment but how to build systems that augment it rather than bypass it. The answer to that question will shape not just our technology but what kind of decisions we think we are responsible for making.
The accountability argument deserves particular weight. Judgment, as a concept, is bound up with the idea of responsibility. When a judge sentences someone, we hold the judge accountable for the sentence — not the legal texts she consulted, not the previous cases she drew on, but the judge herself, who exercised discretion and made a choice. When an algorithm makes a consequential decision, the question of accountability becomes murky in ways that matter practically. Who is responsible when an algorithmic parole recommendation turns out to be wrong — the algorithm’s developers, the court that adopted it, the county that purchased it? The diffusion of responsibility that accompanies algorithmic decision-making is not merely a legal inconvenience. It represents a fundamental change in how we relate to consequential decisions, and that change has costs that are easy to undercount.
What would it mean to design AI systems that genuinely augment rather than bypass judgment? It would mean building systems that inform human decision-makers rather than replacing them, that make the basis for their recommendations transparent and contestable, and that actively preserve the human decision-maker’s ability to override and learn from the interaction. It would mean being more selective about which decisions we allow to be fully automated — reserving the domain of genuine judgment for humans while allowing AI to handle the tasks that are genuinely reducible to optimization. That line is not always clear, and drawing it will require exactly the kind of contextual, values-laden deliberation that constitutes judgment. In other words, deciding how much to automate judgment is itself a judgment call — one that cannot, by its nature, be delegated to the machines whose deployment is the subject of the decision.





