The Hidden Math That Explains Why AI Keeps Getting Smarter
In 2020, a team of researchers at OpenAI published a paper with a title that did not hint at its significance: “Scaling Laws for Neural Language Models.” The paper was forty-three pages of dense mathematics, charts, and careful empirical analysis. It was read by a small number of specialists, mostly ignored by everyone else, and it changed everything about how the most consequential technology of our era was built.
The core finding sounds simple: if you train a language model on more data, with more compute, and make the model bigger, its performance improves. Predictably. Reliably. According to a specific mathematical relationship.
This is less obvious than it sounds.
Before Kaplan and his colleagues published their findings, the prevailing assumption in machine learning research was that scaling up models was a reasonable but not guaranteed path to improvement. Researchers knew that bigger models often performed better, but “often” is very different from “always, by a predictable amount, according to a law.” Science runs on laws. Laws allow prediction. Prediction allows planning. The difference between “scaling up usually helps” and “scaling up helps by this specific mathematical relationship” is the difference between intuition and engineering.
What the scaling laws found was power law relationships: the loss of a model — its error rate, how badly it predicts the next token in a sequence — decreases as a smooth function of the number of parameters in the model, the amount of training data, and the amount of compute used. Double the parameters, and loss falls by a predictable factor. Double the data, and loss falls by another predictable factor. The exponents in these power laws are small — typically somewhere between 0.07 and 0.35 depending on what you are scaling — which means that you need to scale by large multiples to get meaningful improvement. But the relationship is reliable.
Why was this shocking to researchers who had spent years working on these systems? Because nothing in the history of engineering suggested it should be true. Complex systems generally do not behave this cleanly. Software systems almost never improve smoothly as you make them bigger — more code generally means more bugs, more interaction effects, more opportunity for things to go wrong. Neural networks, before the deep learning era, were understood to have diminishing returns and various failure modes that emerged as they grew larger. The discovery that language models specifically behaved according to clean power laws, across many orders of magnitude of scale, was genuinely surprising.
The practical implication was enormous. If you knew the relationship between compute and capability, you could predict what a model trained with a given amount of compute would be capable of before you built it. OpenAI, Google, and other frontier labs were suddenly not doing research in the traditional sense — running experiments to discover whether something new was true — but engineering in a new sense: scaling up a known relationship to extract predicted capability improvements. GPT-3, with 175 billion parameters, was not a research experiment whose results were uncertain. It was an engineering prediction: this size model, trained on this much data, should achieve these benchmark scores. And it did.
GPT-4 followed the same logic. By the time OpenAI announced it, the frontier AI labs had internalized the scaling laws deeply enough that they could plan model development years in advance, make capital allocation decisions based on predicted capability improvements, and in some cases publicly commit to capability targets that they had not yet achieved because the math said they would. The race that followed was not primarily a research race — a competition to discover new things — but a capital race: who could mobilize enough compute, enough data, and enough engineering talent to stay on the scaling curve.
This reframing has consequences that ripple through everything. It partly explains why the major AI companies have been so willing to spend such extraordinary sums of money. If capability reliably improves with scale, then the question is not “will this investment pay off?” but “can we afford to fall behind on the scaling curve?” The latter question has a much more urgent answer, because falling behind on the curve means a competitor achieves capabilities you do not have, and those capabilities translate into products, revenue, and further investment capacity. The scaling laws transformed AI research into something more like a resource extraction race: the prize goes to whoever can mobilize the most resources fastest.
But scaling laws have limits, and understanding those limits is now one of the central questions in AI research. The original Kaplan et al. paper found clean power law behavior over several orders of magnitude of compute and model size. It did not claim that the relationship would hold forever. And there is evidence — genuinely contested, with serious researchers on both sides — that the models being trained today are approaching regions where the returns to simple scaling are diminishing.
The evidence for diminishing returns is partly empirical: successive generations of models have delivered less dramatic qualitative improvement per dollar of compute than earlier generations did. GPT-3 to GPT-4 felt like a large leap. The incremental improvements since then have felt smaller, though this perception is confounded by the difficulty of measuring qualitative capability improvements on tests that the models are now very good at.
A separate and important refinement came from DeepMind in 2022, in a paper introducing the Chinchilla model. The Chinchilla researchers argued that previous scaling laws had been misapplied: labs were training models that were too large for their data budgets, and the optimal strategy was to scale both model size and data quantity together at a specific ratio. A smaller model trained on more data could match a larger model trained on less data, and because smaller models are cheaper to run for inference, the smaller-but-better-data-trained model was more economical in practice. This finding shifted the industry’s scaling strategy and renewed confidence in the scaling paradigm — it was not breaking down, but it had been applied inefficiently.
What comes after scaling? This is the question that everyone in frontier AI is asking and no one can fully answer. One answer is that the next dimension of scaling is inference compute: spending more compute at query time, having models think longer before answering, essentially allocating more computational resources to each individual problem rather than just to training. This is the intuition behind “chain-of-thought” reasoning and behind the o1-class models that OpenAI introduced, which allocate inference compute dynamically based on problem difficulty. There is preliminary evidence that scaling inference compute follows its own power laws, opening a new dimension on the scaling curve.
Another possibility is that the scaling laws describe a specific regime — text prediction on internet-scale data — and that breaking into genuinely new capability requires different approaches: new modalities, new training objectives, new architectural innovations that change the underlying relationship between compute and capability. This would mean that the smooth, predictable progress of the past half-decade is giving way to a more uncertain period where the next breakthrough is genuinely uncertain rather than being predictable from the current curve.
Either way, the scaling laws have already done their work. They transformed AI from an unpredictable research frontier into something closer to a planned industrial process. They created the conditions for the extraordinary capital investment that has reshaped the entire technology industry. They explained, to anyone paying attention, why the AI capabilities seen in the past few years were not magic or accident but the predictable consequence of a mathematical relationship being relentlessly exploited by organizations with access to enough capital.
The math was always there. Someone just had to find it. And once found, it could not be unfound.




