Why 'Alignment' Is the Wrong Word for What We're Actually Worried About

Photo: Unsplash

Language Shapes Thinking

Why 'Alignment' Is the Wrong Word for What We're Actually Worried About

The AI safety field named its central problem in a way that systematically obscures what the problem actually is
ai-safetyalignmentlanguagegovernancepolicy

The word “alignment” was chosen because it sounds like an engineering problem. You have a target — human values — and you have a system — an AI — and your job is to align one to the other. The framing implies that the target is known, fixed, and well-specified. It implies that once you develop the right technical methods, alignment is achievable. It implies that the hard part is the engineering.

Every one of these implications is wrong.

“Human values” is not a coherent object. Whose values? Americans? Which Americans — the 34% who supported overturning Roe, or the 61% who opposed it? The billion-plus people who believe blasphemy should carry legal penalties, or the people who believe blasphemy laws are themselves a human rights violation? The question of whose values an AI should be aligned to is not a technical question with a technical answer. It is a political question — one of the hardest political questions a society can face — and calling it “alignment” systematically obscures this.

Think about it from the perspective of the word’s internal logic. “Alignment” presupposes that there is an axis along which alignment can be measured — that there is a north, and you need the compass pointing there. But for any non-trivial value question, there is no single north. There are a billion different norths, each belonging to a person or a culture with its own history, in genuine conflict with other norths. Calling the problem “alignment” treats the existence of a shared target as a given. It’s not a given. It’s the entire problem.

The AI safety field coalesced around this framing in the mid-2010s, largely through the influence of Nick Bostrom’s “Superintelligence” (2014) and the cluster of researchers at MIRI (Machine Intelligence Research Institute), CHAI (Center for Human-Compatible AI), and later OpenAI’s safety team. The core problem they were concerned with was real and worth worrying about: powerful AI systems that pursue objectives misaligned with what their operators actually want could cause serious harm, up to and including catastrophic harm. This concern was legitimate. The vocabulary they chose to describe it was not.

The problem with the vocabulary shows up immediately when you ask: aligned to what? The field’s answer, broadly, is “human values” or “human preferences” or “what humans would want on reflection.” These phrases do work in a philosophy seminar — you can spend a semester dissecting what “on reflection” means in the context of preference theory. They do no work in the real world. They presuppose that there exists some set of values that represents humanity’s interests in a coherent, aggregable way. There is no such thing. What exists is billions of people with partially overlapping and frequently contradictory values, embedded in cultures with different histories, with different stakes in different AI applications, with different amounts of political power to make their preferences matter.

The framing’s most damaging effect is on who the field attracts and who it repels.

“Alignment” reads, to a technically minded person, like a problem that calls for technically minded people. It reads like something you solve by thinking carefully about reward functions, preference learning, inverse reinforcement learning, constitutional AI, RLHF. These are real techniques with real value. But they’re techniques for implementing a solution to a political problem that hasn’t been solved — and can’t be solved by engineers. The engineers keep building better implementation machinery, but no one is building the political machinery to define what actually needs to be implemented.

The people who are actually needed for this problem — political philosophers, democratic theorists, international lawyers, governance specialists, anthropologists studying how different cultures relate to authority and technology — are not drawn to the word “alignment.” The word sounds like it’s already been solved conceptually, and what remains is execution. Political scientists know that the conceptual part is the hard part. They don’t apply.

Look at the institutional landscape. The major AI safety organizations — Anthropic’s alignment team, DeepMind’s safety team, OpenAI’s superalignment effort (or what remains of it), ARC Evals, Apollo Research — are overwhelmingly staffed by people with backgrounds in machine learning, mathematics, and philosophy of mind. The governance teams tend to be small, underfunded, and treated as adjacent rather than central to the mission. This staffing pattern is not an accident. It’s the direct consequence of naming the problem in a way that makes it sound like a CS problem.

Stuart Russell, who has thought about this more carefully than most, reframes the problem as building AI systems that are “uncertain about human preferences” — systems that defer to humans because they don’t presume to know what humans want. This is a better framing, but it still sidesteps the political core. Whose humans? Which preferences? Deferred to through what mechanism? A system that defers to its principal hierarchy is only as good as the principal hierarchy, and “the AI company’s leadership” is not an answer most of the world finds acceptable.

The EU AI Act — whatever its flaws — actually does something useful by naming the political structure explicitly. It says: AI systems used in certain contexts must meet certain requirements, those requirements will be determined by specific political processes, enforcement will be done by specific regulatory bodies. This isn’t “alignment.” It’s governance. It’s messier, slower, and less philosophically elegant than alignment theory. It’s also the actual answer.

The governance frame makes explicit that the question “what should AI do in this situation” is answered by a political process, not a mathematical one. Different countries will have different answers. Different contexts will have different answers. The answers will change over time. This is not a bug — it’s the normal operation of democratic legitimacy applied to a new technology.

The alignment field sometimes responds to this by retreating to a subset of values: “alignment to broadly beneficial values” or “alignment that avoids catastrophic outcomes.” The retreat is sensible — there probably are some things most people agree are bad, like AI systems that kill people without authorization, and those are a reasonable starting point. But the retreat quietly abandons the ambitious framing and doesn’t acknowledge that it’s doing so. If “alignment” means “avoid the very worst outcomes by consensus,” that’s governance. Call it governance. The word alignment implies more than that, attracts people who think they’re solving more than that, and creates confusion about what actually needs to be done.

Here’s a specific failure case that illustrates the problem. Large language models deployed commercially will, in various situations, be asked questions where the “correct” answer is politically contested: questions about abortion, about the existence of God, about whether a particular political party’s economic platform is sound, about historical atrocities. What should the model say?

The alignment literature’s answer is something like: “the model should express calibrated uncertainty and present multiple perspectives without taking a position.” This sounds sensible but is itself a political choice — a liberal-proceduralist political choice that privileges epistemic pluralism over substantive engagement. Many people and cultures do not share this value. A devout Catholic user asking about abortion doesn’t want “calibrated uncertainty” — they want an answer consistent with their understanding of moral reality. The alignment answer assumes a political framework for how these questions should be handled that is not universally shared.

The governance answer is: these questions are politically contested, and the answers to how AI handles them need to be decided through political processes with democratic legitimacy. Different jurisdictions may make different choices. This requires politics, not mathematics. The word “alignment” has been doing active work to prevent this obvious conclusion from being reached.

The name also shapes funding. “AI alignment research” sounds fundable to technically oriented philanthropists — and the field is substantially funded by technically oriented billionaire philanthropists: Dustin Moskovitz, Jaan Tallinn, the Open Philanthropy ecosystem. “AI governance research” sounds like political science, which technically oriented philanthropists are less inclined to fund. The name choice has directed hundreds of millions of dollars toward technical approaches to a problem that is primarily political.

This doesn’t mean the technical work is worthless — RLHF is genuinely useful, interpretability research is genuinely important, red-teaming and evaluation matter. But these are tools for implementing decisions that need to be made through political processes. Building better hammers while refusing to engage with the architectural decisions is how you get a very efficient construction of the wrong building.

The field needs a better name. “AI governance” is closer to right, though it undersells the philosophical depth of the questions involved. “Value specification” is accurate but impenetrable. What it definitely isn’t is “alignment” — a word that implies the target exists, is knowable, and just needs to be hit with the right mathematical tool.

The fact that the field named its central problem incorrectly forty years from now will look like a significant intellectual misstep — one that was visible at the time to anyone who thought carefully about what “human values” means and took the political philosophy seriously. The alignment community includes many people who think carefully. The word persists anyway, which tells you something about how much of the field is actually thinking about the problem versus working on technically tractable subproblems that feel safe and publishable.

There’s a comparison worth making to economics in the 1970s. Economists had developed sophisticated mathematical models of market behavior that were technically impressive and internally coherent. The models assumed rational agents, efficient markets, and preferences that could be treated as fixed and measurable. These assumptions were wrong in ways that economists knew were wrong — behavioral anomalies had been documented for decades — but the mathematical framework was so productive for producing publishable papers and attracting students that the wrong assumptions persisted for far longer than they should have. It took the financial crisis of 2008 to force a genuine reckoning.

The alignment field doesn’t need a crisis to force its reckoning. The problem is right there in the name. “Human values” does not name a thing that exists in a form that mathematical alignment techniques can target. Deciding that it does, and building a field on that decision, is an error. Better to name the error now and build a field that can actually solve the problem — which requires political scientists and democratic theorists in the room from the start, not added as an afterthought once the engineers have built the hammer and are looking for a nail.