Compute Economics

Training vs. Inference: Why AI's Energy Debate Is Focused on the Wrong Thing

Everyone talks about the carbon cost of training GPT-sized models. The real story is what happens after launch.

By Jakub Jirák Jul 2, 2029 5 min read

inferencetrainingai-energycomputedata-centers

When a large language model completes a single training run, the energy cost is staggering by any individual measure. A frontier model trained on the scale of those released in 2027 and 2028 consumes somewhere between 50 and 500 gigawatt-hours of electricity — the lower end for smaller frontier models, the higher end for the largest multimodal systems trained on multiple data center clusters over months. These numbers circulate widely in technology journalism, inspire concern, and occasionally inspire press releases about renewable energy commitments. They are also, in a structural sense, not the main event.

The main event is inference. Every time you send a message to an AI assistant, generate an image, run a document through a summarization pipeline, let an automated agent browse the web on your behalf, or interact with any of the thousands of AI-integrated products now embedded in business software — that’s inference. It draws on an already-trained model, processes your input, and produces output. Each individual inference event is cheap. The sum of all inference events globally is not.

Current estimates, drawing on hyperscaler disclosure data and the EU’s mandatory data center reporting framework (in effect since 2026), put inference at roughly 83–87% of total AI electricity consumption. Training is the remainder. This ratio has been relatively stable since around 2026, despite the enormous growth in training compute, because inference has grown faster still.

Why does this matter? Because the interventions appropriate for a training-dominated energy problem are completely different from those appropriate for an inference-dominated one.

If training were the primary driver, the right policy levers would be things like: requiring frontier model developers to purchase matching renewable energy (same hour, same grid) for training runs; setting efficiency benchmarks for training hardware; perhaps even creating a registry of large training runs with associated carbon disclosures. These are relatively tractable interventions. A training run is a discrete event with a known location, time, and energy consumption. You can audit it.

Inference is not like that. It’s continuous, distributed across hundreds of data centers globally, responding to real-time demand that fluctuates by orders of magnitude between a quiet Tuesday morning and a viral news event that causes everyone to simultaneously ask an AI for commentary. Inference efficiency is fundamentally about capacity planning, hardware utilization, batching strategies, and model compression — and most of these are operational details that aren’t visible in any public disclosure.

The efficiency of inference has improved substantially. The energy required per token generated has dropped by roughly 60% since 2023, driven by improved hardware (custom silicon from every major hyperscaler, specialized inference chips from a half-dozen startups that have since been acquired), better quantization and distillation techniques that reduce model size without proportionate capability loss, and improved batching that spreads fixed overhead costs across more simultaneous requests. These are real gains.

But again: total inference energy went up anyway, because the volume of inference grew faster than the efficiency improved.

There’s a specific pattern in how AI companies communicate about this that is worth understanding. Sustainability reports tend to feature per-unit efficiency metrics prominently: energy per query, energy per token, energy per dollar of revenue. These metrics are improving. They are presented as evidence of environmental progress. They are not evidence of declining total impact, because the unit volume isn’t disclosed alongside the efficiency metric.

This is a known pattern in corporate sustainability communication. Airlines did something similar for years with fuel efficiency per seat-mile while overall jet fuel consumption grew. Fast fashion brands have done it with water consumption per garment while total production volumes increased. The per-unit metric is real; the presentation of it as a proxy for total impact is misleading.

The EU’s AI Environmental Disclosure Regulation, which took effect in January 2028, requires absolute consumption reporting rather than intensity-only reporting. US disclosure requirements remain voluntary for most AI operators, with the exception of facilities over 100 MW of IT load, which must now report to the DOE under rules finalized in 2027. The result is that European AI operations are now substantially more transparent than American ones — which has some interesting implications for the accuracy of the global estimates we’re working with.

The inference problem has an additional complication that rarely appears in the energy conversation: the nature of what inference is being used for has changed significantly since 2023.

Early commercial inference was mostly user-facing: chat, search, image generation. These workloads have a natural ceiling — humans have finite attention and finite hours in the day. A user sends a hundred queries, maybe a thousand if they’re a very heavy user. This is bounded consumption.

Agentic AI systems are not bounded in the same way. An automated agent tasked with research, business process management, or software development may run millions of inference calls to complete a multi-day task, most of them invisible to any human. Enterprise AI agents now run on the order of 10 to 100 times more inference compute than the same organization’s user-facing AI deployment — and this ratio is growing as organizations deploy more autonomous systems. The inference compute required to support a single AI agent doing week-long autonomous work can exceed the inference compute of several hundred active human users.

This shift toward agentic workloads is the part of the AI energy story that is most systematically underestimated in current projections. The IEA’s 2028 report acknowledges this uncertainty explicitly, noting that agentic systems represent “a qualitative shift in inference demand characteristics that current modeling frameworks were not designed to capture.”

Concretely, then, what should a realistic policy framework for AI inference energy look like?

First, mandatory absolute consumption reporting across all major markets. The per-unit efficiency metric needs to be accompanied by total volume, so that efficiency claims can be evaluated in context. This is straightforward and overdue.

Second, temporal and locational matching requirements for renewable energy procurement. The practice of buying annual renewable certificates to offset real-time consumption on fossil-heavy grids needs to end. Several EU jurisdictions have already moved in this direction; US federal policy has not.

Third, meaningful energy efficiency standards for inference hardware sold into commercial markets — analogous to appliance efficiency standards, but for server-class equipment. The efficiency variance between the best and worst inference hardware currently available is large enough that standards could meaningfully shift market incentives without restricting capability development.

None of these interventions require restricting AI development or capping training runs — the instinct toward which I remain genuinely skeptical, because it treats the wrong variable as the primary control. Inference is where the consumption is. Inference is where the policy should focus.

The training conversation is not unimportant. But it has absorbed attention that would be better spent on the less dramatic, less photogenic, and considerably larger problem of continuous inference at planetary scale.