The Predictive Policing Reckoning

Photo: Unsplash

Criminal Justice AI

The Predictive Policing Reckoning

A decade of deploying AI in law enforcement has produced a body of evidence that most police departments are choosing to ignore.
criminal-justicepredictive-policingalgorithmic-biasai-governancepublic-safety

In 2013, the Santa Cruz Police Department in California became one of the first municipal forces in the United States to deploy PredPol, a predictive policing software that used historical crime data to generate hotspot maps updated daily, directing patrol resources toward areas the algorithm identified as high-risk for property crime. By 2019, Santa Cruz had banned it — not because of a federal mandate, not because of a court order, but because the city council looked at the evidence and concluded it wasn’t working and was making things measurably worse for specific communities.

Santa Cruz is the exception. As of 2026, some variant of predictive or AI-assisted policing is deployed in law enforcement agencies serving roughly 40% of the US population. The evidentiary situation has not significantly improved since 2019. The political situation has, in most jurisdictions, not changed at all. Understanding why requires understanding what predictive policing is actually measuring and why the validation problem is essentially unsolvable with the data that exists.

What the algorithm is actually predicting

The name “predictive policing” implies that the algorithm predicts where crime will occur. This is not precisely what it does. It predicts where crime that gets reported and recorded will occur — which is a meaningfully different thing.

Most crime is not reported. Of reported crime, a significant fraction is not recorded in ways that generate the data these systems use. What goes into the training data is the historical distribution of police presence and police attention, which is not the same as the historical distribution of actual criminal activity. Areas with heavy historical policing generate more data, more arrests, more records — not because more crime occurred there, but because more enforcement happened there. The algorithm learns this distribution and recommends more enforcement in the same places. Crime rates appear to confirm the prediction because increased enforcement in an area generates more arrests in that area. The validation looks circular because it is circular.

This was documented carefully by the AI Now Institute in 2018 and confirmed repeatedly since. The Chicago Strategic Subject List (formerly known as the “heat list”), which assigned risk scores to individuals rather than places, was audited in 2020 and found to have essentially no predictive validity for violent crime. People who scored in the top 1% of the risk distribution committed serious violent offenses at a rate barely distinguishable from people who scored in the 50th percentile. The algorithm was predicting prior contact with law enforcement, not future violence. Those are correlated by construction, not by fact.

Palantir, which supplies predictive and intelligence analysis systems to dozens of police departments under contracts that are generally confidential, has consistently declined to share validation data with independent researchers. The Los Angeles Police Department used a Palantir system for what it called “predictive” investigative work for years under a contract structure that prevented the department from publishing internal assessments of effectiveness. When ProPublica and the Los Angeles Times finally obtained internal documents in 2021, the assessments were largely absent — the department had never systematically evaluated whether the system was doing what it claimed to do.

The COMPAS problem and its successors

The most extensively studied AI system in criminal justice is COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a recidivism prediction tool used in sentencing decisions across multiple US states. In 2016, ProPublica published an analysis showing that COMPAS was twice as likely to falsely flag Black defendants as high risk (predicting recidivism when it didn’t occur) and twice as likely to falsely rate white defendants as low risk (failing to predict recidivism when it did occur). The company behind COMPAS, Equivant, disputed the methodology. Academic statisticians disagreed about whose framing was more appropriate. Meanwhile, judges kept using it in sentencing.

The Wisconsin Supreme Court, in State v. Loomis (2016), upheld the use of COMPAS in sentencing over the defendant’s objection that he had no meaningful ability to challenge a score whose methodology was proprietary. The court found that COMPAS was one factor among many and that the sentence would have been the same without it. Whether that’s true in any individual case is unknowable, which is precisely the problem — the defendant can’t know, the judge can’t know, and the appellate court can’t know.

Since 2016, the number of AI tools used in pretrial detention decisions, bail recommendations, parole assessments, and sentence recommendations has proliferated substantially. Most are proprietary. Most have not been independently validated. Most are deployed under contracts that prevent the agencies using them from publishing performance data. The legal challenges to date have mostly failed on due process grounds, because courts have not yet settled whether algorithmic opacity violates constitutional requirements in criminal proceedings (though the Supreme Court agreed to hear a case on this question in late 2026, which may change the landscape in 2027).

The face recognition disaster

Facial recognition is a distinct technology from predictive analytics, but the accountability failures are structurally similar and the consequences are more dramatically visible. The Detroit Police Department used facial recognition to generate investigative leads in at least three cases that resulted in wrongful arrest. In each case — Robert Williams in 2020, Michael Oliver in 2021, Porcha Woodruff in 2023 — the facial recognition algorithm identified a Black man as matching crime scene footage when the actual match quality was questionable and no corroborating evidence was sought before arrest.

Williams was arrested in his driveway in front of his children. Oliver spent 11 days in jail. Woodruff, eight months pregnant, was held for 11 hours and subsequently miscarried. The Detroit Police Department, after the third incident, finally introduced a policy requiring some human verification before arrest on a facial recognition match — three arrests too late, and only after the American Civil Liberties Union had been pressing the issue publicly for years.

The accuracy differential across demographic groups is well-documented. NIST’s Face Recognition Vendor Testing program has consistently shown that most commercial facial recognition algorithms perform significantly worse on darker-skinned faces, older faces, and female faces than on the demographic profile that dominates training data. The false match rate for darker-skinned women in some systems is 34 times higher than for lighter-skinned men. Police departments using these systems to generate arrest leads are, in practice, deploying a technology that is substantially less reliable for identifying Black suspects than white ones — which, in a context of over-policing of Black communities, compounds existing inequities rather than correcting them.

What reform actually looks like

The cities and states that have made real progress — not just passed symbolic resolutions but actually changed practice — have done specific things. They’ve banned particular applications entirely (San Francisco, Portland, Boston have banned government use of facial recognition; the EU’s AI Act restricts real-time biometric surveillance in public spaces). They’ve required algorithmic impact assessments before deployment. They’ve mandated publication of validation data as a condition of procurement. They’ve created oversight bodies with genuine investigative authority, not just advisory capacity.

New York City’s Local Law 49 of 2021 required the NYPD to publish an annual report on automated systems used in law enforcement. The first two reports were criticized as incomplete and evasive. The third, published in 2024, was more substantive partly because the oversight body had developed the technical capacity to push back on obfuscation. Oversight without technical capacity is theater; oversight with technical capacity is slow and unglamorous but occasionally produces change.

The reckoning that needs to happen is not primarily technical. The tools are not sophisticated enough to do what they claim to do, and the validation evidence is absent or damning. The reckoning is political: law enforcement agencies have adopted these systems because they create the appearance of objective, data-driven decision-making that is harder to challenge than human judgment. That appearance of objectivity is the feature, not a bug. Dismantling it requires admitting that the emperor has been making arrest decisions based on a vendor’s proprietary methodology that has never been independently shown to work, and that the communities absorbing the cost of that decision have been the ones with the least political power to object.

The evidence has been available for a decade. The question is whether it will be acted on before a Supreme Court ruling forces the issue.