AI Governance

Which Safety Frameworks Actually Prevented Harm

Separating frameworks that changed outcomes from frameworks that changed press releases.

By Jakub Jirák Apr 2, 2029 7 min read

ai-safetygovernancepolicyaccountabilityrisk-management

There is a question I’ve been trying to answer honestly for the past eighteen months, and I’m going to try to answer it here: which AI safety frameworks actually prevented harm, and which ones existed primarily to create the impression that harm was being prevented?

The honest version of this question is uncomfortable to ask because most AI safety work — including work done by people I respect and institutions I think are valuable — contains elements of both. A framework can be partially genuine and partially performative. The performative elements often evolve into genuine ones over time, as institutions develop muscle memory and accountability structures force real engagement with findings. Conversely, a framework that started as genuine safety work can calcify into ritual as the pressure to ship accelerates and the institutional knowledge of why the framework existed in the first place diffuses out of the organization.

So the answer isn’t a simple taxonomy of good frameworks and bad frameworks. But some things can be said with reasonable confidence.

The Honest Accounting of What Worked

Structured uncertainty quantification worked. Not perfectly, and not everywhere, but organizations that built explicit uncertainty communication into their systems — surfacing when a model’s outputs fell outside its reliable operating range — had measurably better incident records on the categories of harm that come from misplaced user trust. The specific mechanism matters: a model that says “I’m not confident in this” is useful. A model that applies a generic disclaimer to 40% of its outputs trains users to ignore the disclaimer. The organizations that got this right built calibrated confidence signaling that meant something, which required tracking the correlation between stated uncertainty and actual error rates across deployment domains. Most didn’t do this. The ones that did had better outcomes.

Staged deployment with genuine kill switches worked. Again, the word “genuine” is doing serious work there. Every major AI deployment post-2026 claims to have staged rollout protocols. A genuine staged rollout protocol means that the criteria for advancing from one stage to the next are specified in advance, measured against actual production data, and interpreted by people with the authority and willingness to halt deployment when criteria aren’t met. Several major deployments that subsequently caused significant harm had staged rollout protocols — the protocols just didn’t have teeth. The kills switches existed in the documentation but not in the organizational incentive structure. The difference is auditable in principle; deployment timelines and stage-advance decisions leave records.

Pre-registration of expected behaviors worked in the domains where it was applied seriously. This is a framework borrowed from clinical trial methodology: before you deploy, you specify in writing what you expect the system to do and not do, and you commit to measuring against those specifications. When Anthropic began requiring internal pre-registration for significant capability expansions in late 2026, the effect was to make optimistic assumptions about system behavior costly — if you pre-registered that a capability would be reliable to 99.5% and it came in at 94%, that gap had to be explained and accounted for. Organizations that adopted genuine pre-registration developed better predictive models of their own systems as a side effect, because the practice forced them to be explicit about what they actually knew.

Continuous behavioral monitoring with distributional comparison — matching production behavior against evaluation behavior in real time — worked, though it required substantial infrastructure investment that most smaller players couldn’t afford. The organizations that built this infrastructure caught emergent failures earlier than those that didn’t. The mean time from behavior change to detection dropped from weeks to days in organizations with mature monitoring, which made the difference between catching a problem before it affected millions of users versus after.

What Didn’t Work

Voluntary self-reporting commitments didn’t work. I want to be precise about what I mean. Voluntary self-reporting frameworks — commitments made by AI developers to share safety-relevant information with each other and with regulators — generated essentially zero safety-relevant information sharing of consequence. What they generated was a documented practice of sharing, which is different. The information that was shared was information that organizations were comfortable sharing: evaluation methodology, general capability descriptions, broad safety philosophy. The information that mattered — specific failure modes, incident reports, internal red-team findings — was not shared, because sharing it created competitive and legal risk.

This was entirely predictable from first principles. Voluntary self-reporting in industries with competitive incentives has never produced meaningful information sharing on the questions that matter. The history of financial services, pharmaceutical development, and aviation before the creation of mandatory reporting regimes all demonstrate the same pattern. The AI industry somehow convinced itself that it was different. It wasn’t.

Third-party audits with access limitations produced audit theater. The problem was structural: auditors were given access to evaluation datasets and API-level access to deployed systems. They were rarely given access to training data, fine-tuning datasets, RLHF reward models, or the internal incident records that would have allowed them to verify whether the system’s behavior in controlled testing matched its behavior in production. An auditor who can only see what the audited organization chooses to show has an inherent limitation that no amount of auditor expertise can overcome. Several organizations received clean safety audits from reputable firms within months of incidents that the audit process should have caught.

Red-teaming without escalation mechanisms was worse than useless, because it created false confidence. Red-team findings that go into reports that nobody reads are not safety work. They are documentation that safety work occurred. Several organizations maintained extensive red-team archives that would have identified deployment risks if anyone had acted on them. The findings weren’t wrong. The organizational structure that processed them was broken. Post-incident analyses at these organizations consistently found that relevant red-team findings predated the incident — sometimes by years.

Responsible scaling policies without automatic triggers were systematically violated. A responsible scaling policy that says “we will not deploy systems with capability X without additional safety measures” is only as good as the mechanism for detecting capability X and the enforcement mechanism for pausing deployment. Several published RSPs were effectively unenforceable because capability thresholds were defined in ways that made it easy to argue that a given system didn’t meet them, even when the spirit of the threshold was clearly crossed.

The Surprising Middle Cases

Two frameworks had effects that surprised me, in different directions.

Constitutional AI, and the broader approach of training models on explicit principle sets, had effects that were more robust than critics expected but less robust than proponents claimed. Critics argued that you can’t teach values through training, that models would learn to produce outputs that looked aligned without the underlying structure that generates genuinely aligned behavior. This was partially right: constitutional training did produce some degree of surface alignment that didn’t generalize well to novel situations. But it also produced something unexpected — models that were easier to steer in deployment, that responded more reliably to operator instructions about intended behavior. Whether that’s alignment or sophisticated instruction-following is a philosophical question. The practical effect was that constitutionally-trained models had better incident records in the specific domains their constitutions addressed.

The surprise in the other direction: behavioral testing on synthetic adversarial datasets turned out to be remarkably poor at predicting production behavior. The assumption was that if you could generate enough clever adversarial inputs and test against them, you’d catch most of the ways a system could fail. This assumption was wrong, not because adversarial testing is bad but because the adversarial inputs that humans generated didn’t match the adversarial inputs that users and adversarial actors generated in production. Production adversarial inputs tended to be cruder, more specific to narrow use cases, and more effective than the synthetic ones — because real adversaries had the patience to iterate on what worked. The adversarial testing that actually improved production robustness was testing that used production logs of actual adversarial attempts, not synthetic datasets.

The Structural Argument

The framework that would have prevented the most harm is the one that was never built in the United States and only partially built in Europe: mandatory incident reporting with standardized taxonomy, independent investigation authority, and findings that fed back into deployment requirements. Every other safety framework is working around the absence of this.

Aviation safety in its current form is built on mandatory accident and incident reporting, independent investigation by the NTSB (and its international equivalents), and a regulatory feedback loop where investigation findings directly drive airworthiness requirements. It took aviation roughly forty years and several thousand deaths to build this system. AI governance is trying to build the equivalent in a decade and, so far, failing to do it before the harms accumulate rather than after.

The EU AI Act created elements of this — mandatory incident reporting for high-risk systems, requirements for conformity assessment, some investigation authority. But the enforcement infrastructure lagged the regulatory framework by two years, and the definitional fights over what constituted a “high-risk” system consumed enormous political energy that could have gone into actually building incident investigation capacity.

What the post-2025 incident record demonstrates, more than any other single lesson, is that safety frameworks designed by safety-conscious engineers inside AI companies, without external enforcement teeth, will systematically miss the organizational and competitive dynamics that drive the most significant failures. This is not a criticism of the people who built those frameworks. It’s a description of their structural limitations.

The next generation of safety governance — which is actively being designed right now, with input from the incident record — needs to start from the assumption that commercial incentives will systematically push against safety investment, and design accordingly. That means mandatory rather than voluntary disclosure, independent investigation rather than self-reporting, and enforcement mechanisms that create costs for safety failures that exceed the costs of the safety investment required to prevent them.

Some organizations will find this threatening. They should. That’s how you know it might work.