The Replication Crisis Gets an AI Problem

Photo: Unsplash

Scientific Integrity

The Replication Crisis Gets an AI Problem

When AI generates the hypothesis, designs the experiment, and analyzes the results, who checks the work?
replication-crisisai-researchscientific-integritypeer-reviewresearch-methodology

Science has been living with a replication crisis for over a decade. The crisis had a specific origin story: in 2011, a team at Bayer Healthcare tried to reproduce 67 preclinical oncology studies and succeeded with only 20-25% of them. In 2015, the Open Science Collaboration replicated 100 psychology studies and found only 36-39% held up. The problem was real, it was documented, and the response — more pre-registration, more open data requirements, more statistical power analysis — was genuine if slow.

Now add AI.

The replication crisis was, at its root, a story about incentives. Researchers were rewarded for publishable positive results and punished (subtly, structurally) for null results. This created a landscape of false positives — not through fraud, mostly, but through the entirely rational practice of running multiple analyses and publishing the one that worked, inflating effect sizes by forgetting to report all your pilot attempts, and stopping data collection at convenient moments when the p-value dipped below 0.05. The crisis was a crisis of human behavior inside a broken incentive structure.

AI doesn’t have incentives in the human sense. But it is trained on a literature that embodies those incentives, and it operates inside institutions that still have them. The result is something more subtle than AI fraud, and in some ways more dangerous.

How AI Inherits the Bias

Consider how a large language model trained on scientific literature learns about biochemistry. It reads papers. Those papers, as established by a decade of replication research, systematically over-represent positive results, under-represent effect sizes, and encode the methodological shortcuts that plagued the pre-replication-reform era. A model trained to generate scientific hypotheses learns what “promising” looks like from this literature. It has no access to the file drawer — the uncounted experiments that didn’t work, the pilots that were abandoned, the graduate student projects that quietly ended.

This is not hypothetical. A 2026 analysis by van der Maas and colleagues at the University of Amsterdam systematically compared AI-generated hypotheses in social psychology with the pre-registered replication rates of manually derived hypotheses in the same field. The AI-generated hypotheses, when tested, replicated at rates roughly consistent with the original (pre-reform) social psychology literature — around 40%. Human-generated hypotheses at institutions that had adopted pre-registration and open science practices were replicating at closer to 65%. The AI was learning from the old playbook because the old playbook was what it had to learn from.

The Automation Pipeline Problem

A more acute version of the problem appears in fully automated research pipelines — systems that not only suggest hypotheses but design experiments, generate synthetic data for training further models, and produce draft manuscripts with minimal human intervention. Several pharmaceutical companies and at least four academic groups are running versions of these pipelines in materials science and computational chemistry.

The efficiency gains are real and substantial. An automated pipeline at a major European chemical company reportedly screened 340,000 candidate catalysts in the time a conventional team would have screened perhaps 200. The problem is that the screening criteria were themselves model-generated — the system was filtering for what it had learned to predict was promising, which meant it was filtering for patterns that looked like prior discoveries.

This is circular in a specific way. If your screening model is trained on reactions that were worth reporting in the literature, and your hypothesis generator proposes candidates that match the pattern of prior reported reactions, you have created an AI that efficiently re-discovers variations on things that have already been found, while efficiently ignoring the weird, the unexpected, and the genuinely novel — which are, not coincidentally, the things that tend not to appear in training data because they tend not to have been previously reported.

The best materials discoveries of the past two years — a handful of genuinely anomalous structures that don’t fit prior categories — came from groups that explicitly used AI to flag outliers, to find the candidates that didn’t look like prior work. This is possible but requires a specific methodological commitment that runs against the natural pressure to optimize for known success patterns.

Peer Review Under Pressure

The peer review system is visibly struggling. This is not new — it was struggling before AI — but AI has changed the nature of the struggle.

The volume problem is one dimension. If a group using AI-assisted research can produce ten papers per year where they previously produced two, then the pool of papers requiring review grows faster than the pool of qualified reviewers. Nature and Science have acknowledged receiving submission volumes in 2026 that are 40-60% higher than 2022 levels, and their desk rejection rates have risen accordingly. The papers that get through to review are, if anything, harder to evaluate because they involve computational methods that fewer reviewers can fully assess.

The expertise problem is another. A reviewer asked to evaluate an AI-assisted drug discovery paper needs to be competent in medicinal chemistry, computational molecular dynamics, and machine learning. This person exists but is rare and busy. In practice, papers are being reviewed by people who can evaluate some of these dimensions but not all, and the portions they cannot evaluate are precisely the portions where AI-specific errors — training data leakage, overfitting to benchmark datasets, incorrect uncertainty quantification — are most likely to hide.

The reproducibility problem, in this context, takes a new form. Traditional replication involves another lab attempting the same experiment. AI-generated results often cannot be replicated without access to the same model, the same training data, or the same computational infrastructure. If a team uses a proprietary model version that no longer exists, or a dataset that was never publicly released, the result is in principle unverifiable. The movement toward model cards, datasheets for datasets, and computational reproducibility standards is real and important — it is just lagging the deployment of the systems it is trying to govern.

Where This Lands

There is a version of this story that ends badly: AI accelerates the publication of unreplicable results, these results propagate into clinical trials and industrial processes, and several high-profile failures eventually catalyze a response. There is a version that ends well: the same computational infrastructure that generates research at scale also enables much cheaper replication (you can re-run a computational experiment far more easily than re-running a wet lab experiment), and the community converges on standards that distinguish hypothesis generation from validation.

The field is, honestly, in between. The pre-registration movement that emerged from the first replication crisis has an AI-era equivalent in the form of registered reports with computational components — papers where the analysis pipeline is pre-committed before the results are known. Several journals have adopted this for AI-assisted research, and the early evidence suggests it does reduce the positive result rate, which means it reduces the publication bias, which is exactly what it should do.

What it cannot do is eliminate the deeper problem: that AI systems trained on biased literatures will generate biased hypotheses, and that the most efficient research pipelines will exploit whatever patterns are most common in training data rather than whatever patterns are most likely to reflect reality. Science has always been a negotiation between speed and rigor. AI has not resolved that tension. It has sharpened it.


The replication crisis was, in the end, a story about what we reward. Nothing about the introduction of AI has changed what we reward. Until that changes, the crisis will continue — faster, more automated, and wearing better-looking charts.