The Gap Between Demo and Reality

How the People Who Build AI Actually Use It (It's Very Different from How They Describe It)

The engineers at AI labs use their own tools in ways that would horrify most of the people buying them

By Jakub Jirák Jul 5, 2026 8 min read

artificial-intelligenceengineersproduct-developmenttrustenterprise-ai

There is a peculiar thing you notice when you spend time with ML engineers at the serious labs — the people who actually built GPT-4, Claude, Gemini, the systems being sold to enterprises as transformative autonomous capabilities.

They use these tools carefully. Skeptically, even. They verify outputs. They don’t trust confident-sounding answers without checking sources. They know, with precision, what the failure modes are and where the edges of reliable performance sit. When they use an LLM for code generation, they read every line before committing it. When they use it for research summaries, they go back to the primary sources for anything that actually matters. They treat AI as a drafting assistant that requires editorial oversight, not an authority that can be trusted directly for anything with real consequences.

This is not hypocrisy. It’s expertise. They know what the system is doing at a mechanical level, and that knowledge produces appropriate caution.

The products built by these same engineers are sold to enterprises as systems capable of autonomous operation at scale. The gap between those two things — how the builders use the tools and how the tools are marketed — is where most AI disasters will come from.

Let me be specific about what caution looks like at the level of people who understand these systems deeply.

A senior ML engineer at a frontier lab using an LLM for code assistance doesn’t paste in a function description, accept the output, and commit it. They review the logic carefully. They check that the imports are real libraries — hallucinated imports are a documented failure mode; models sometimes confidently generate import nonexistent_library with plausible-sounding package names. They think about edge cases the model might have missed, because LLMs optimize for the common case in training data and the failure modes live at the edge. They run tests and look at the actual outputs under conditions the LLM didn’t reason about explicitly. For anything touching security, authentication, or system architecture, they’re often not using LLM suggestions at all, because the failure surface is too unpredictable given the risk profile.

For research? They’ll use an LLM to generate an initial map of a literature domain — what papers exist, what the main debates are, what terminology is used. Then they go read the papers themselves. They absolutely do not rely on LLM summaries of specific papers for anything that matters. The hallucinated citation problem is too real and too expensive to get wrong. A model that confidently describes a 2019 Nature paper’s findings, where the paper exists but the described findings are a fabrication assembled from nearby training data, produces a failure that looks identical to correct output until you check the actual paper. And not everyone checks.

The AI researcher Andrej Karpathy — who spent years at OpenAI, including leading the autopilot team at Tesla — has written publicly about using LLMs as “simulators” or “approximate reasoners” whose outputs are useful starting points requiring human verification. This framing is carefully chosen by someone who helped build the underlying technology. He’s not saying the tools are bad. He’s saying they require a specific posture from the user — one that involves maintaining skepticism and verifying claims against external ground truth.

Now look at how these systems are sold.

“Your AI employee, working 24/7.” “Autonomous agents that handle complex multi-step workflows without human intervention.” “AI that understands your business context and makes decisions independently.” The marketing language across virtually every enterprise AI vendor clusters around autonomy, reliability, and the reduction of human oversight requirements.

This is not a small misrepresentation. It’s categorical. The difference between “useful tool that requires expert human oversight to function reliably” and “autonomous system capable of independent operation” is not a marketing nuance. It’s the entire question of where risk lives and who is responsible for outcomes when things go wrong.

When the people who build these systems use them with heavy human oversight, and the marketing language implies minimal human oversight is required, the gap is structural and it doesn’t get fixed by adding a disclaimer to the bottom of a sales deck.

Why does this happen? Because the economic incentives don’t reward honesty about failure modes. If Company A’s sales team accurately describes the rate at which their model hallucinates in domain-specific enterprise contexts, and Company B’s sales team doesn’t disclose this, Company A loses contracts to Company B. Systematically. Over time, accurate disclosure is selected against in the market. This is a classic information asymmetry problem, and it plays out exactly as economic theory predicts: the sellers with the most optimistic claims get the deals, and the costs of those claims land on buyers rather than sellers.

The problem is compounded by the organizational structure of how enterprise AI gets bought.

The decision-makers — CTO, CDO, VP of AI Strategy — have seen the demos. The demos are genuine; frontier models are impressively capable in controlled demonstrations on representative tasks. What the demos don’t show is performance variance across the actual distribution of tasks the system will face in production. They don’t show what happens when inputs are ambiguous, poorly formatted, or outside the distribution the vendor used for evaluation. They don’t show the failure modes that appear only at scale, under conditions that weren’t in the demo script.

The people who understand these failure modes — ML engineers, AI researchers — are rarely in the room when the enterprise contracts get signed. They’re not the decision-makers. There’s a structural information asymmetry: the people with the most relevant technical knowledge are not the people making the buying decisions, and the vendors have strong incentives to ensure that asymmetry persists.

I’ve talked to engineers at companies that have deployed LLM systems in consequential roles — reviewing medical documentation, screening legal filings, analyzing financial reports. These engineers privately describe their own systems with a level of caution that would alarm anyone who had read the external product descriptions. The phrases recur: “We never let it make final decisions on anything that actually matters.” “There’s always a human review step, which we don’t advertise.” “We’ve been careful about which use cases we actually run it on — the ones where it’s reliable.” This is sensible engineering practice from people who understand the failure modes. It is completely invisible to the clients being sold the autonomous vision.

There is a telling signal in how the AI companies structure their own internal quality processes.

Anthropic, which makes Claude, has an alignment team and a trust and safety team staffed by humans who review model outputs, identify failure modes, run adversarial evaluations, and continuously monitor for ways the system can be induced to fail. OpenAI has a safety systems team. Google DeepMind has internal red-teaming operations. These teams exist because the organizations building these systems understand that the systems require sustained human oversight to function safely — not just at deployment time, but continuously.

None of that internal oversight infrastructure is sold to enterprise clients as part of the standard deployment package. The client gets the model, some documentation, an API, and a support contract. The human oversight apparatus that makes internal deployments safer is a cost the client is expected to build themselves — without necessarily knowing that they need to, because the sales process didn’t explain the operational requirements clearly enough to make that need obvious.

This isn’t a conspiracy. It’s a market failure with a specific, identifiable structure. The information about what’s required to deploy AI safely is systematically withheld from buyers because disclosing it would reduce sales and slow deployment timelines. The costs of not disclosing it land on buyers rather than sellers. Standard vendor moral hazard, applied to technology where the consequences of overconfident deployment can be large, diffuse, and long-delayed.

The fix is not technically complicated to describe. Require vendors to disclose accuracy rates on representative task distributions for the domains they’re selling into. Require disclosure of known failure modes. Require disclosure of the minimum human oversight recommended for safe deployment in high-stakes contexts. Something like a nutritional label — not perfect, because standardization across diverse use cases is genuinely hard, but better than nothing.

The reason this hasn’t happened isn’t technical. It’s political. The major AI vendors have lobbied against specific disclosure requirements, arguing that the technology changes too fast for regulatory mandates to keep up. There’s a real point here — evaluation frameworks do become stale quickly in a fast-moving field, and poorly designed mandates could create compliance theater rather than real transparency. But the alternative, which is the current status quo, is enterprise deployments made on the basis of vendor claims that are systematically more optimistic than the empirical performance of the systems justifies.

The gap between how builders use these tools and how the tools are sold is not a secret. The engineers at the labs know it. The researchers studying AI deployments know it. The people writing case studies of AI failures know it. The only people who don’t reliably know it are the enterprise buyers signing the contracts — and that information asymmetry is not an accident. It’s the business model.

There’s a further dimension worth stating. The same engineers who use these tools cautiously, with verification at every step, are also the ones who eventually get called in when an enterprise deployment fails. They’re the people who do the post-mortem. And every time, they find the same thing: the deployment assumed levels of reliability that the system doesn’t have, and assumed levels of human oversight that the deployment didn’t include. The post-mortem findings are clear. They change nothing about the next enterprise sales cycle, because the post-mortem findings never make it into the next sales deck.

Part of what makes this hard to fix is that the people best positioned to close the gap — the engineers who understand the systems — don’t control the sales process and often don’t have organizational authority to constrain deployment decisions. A senior ML engineer who tells a potential enterprise client “here’s the honest failure rate on your use case, and here’s the minimum human oversight you’ll need to make this safe” is undermining the sales pitch. The incentive structure inside the companies selling AI doesn’t reward that kind of honesty, even from people who genuinely want to provide it.

The fix, if one comes, will not come from inside the companies. It will come from enterprise buyers who start insisting on evaluation evidence before signing contracts, from regulators who require disclosure, and from the accumulation of visible, attributable failures that make the gap impossible to ignore. That process takes time. In the meantime, the gap persists, and most enterprise clients are on their own to figure out what they’re actually getting.

When the failures accumulate enough to be publicly visible and financially attributable, the reaction from the industry will be surprise. It shouldn’t be. The engineers at the labs have been quietly clear-eyed for years about the gap between what they know their systems can do and how they’re being sold. The gap just hasn’t been visible enough, yet, to force a reckoning.

How the People Who Build AI Actually Use It (It's Very Different from How They Describe It)

Reliable Backup Strategies for Developers

ChatGPT Codex: Scoping tasks tightly enough that the agent can finish in one pass

GitHub Copilot: Switching models mid-conversation when reasoning quality starts to degrade

Agentic coding: Planning outputs as the real product of the first turn

MCP servers: Composing agents over a tiny catalogue of shared tools