Agents in the Wild

What Autonomous Agents Are Actually Being Used For

Strip away the conference-stage demos and you find a much narrower, more interesting set of real deployments

By Jakub Jirák Jan 1, 2027 9 min read

ai-agents automationenterprise-aiproduction-aiagentic-ai

If you watched the product launches and research demos of 2025 and 2026, you came away believing that autonomous AI agents were about to manage your calendar, negotiate your contracts, run your software deployments, and possibly file your taxes. The agents in those demos were impressive. They traversed the web, wrote code, called APIs, caught their own mistakes, and narrated their reasoning in calm, confident prose. The demos were real. What they showed was technically possible.

What they did not show was what actually happens when you try to run that same agent in a production environment, against real data, with real edge cases, under real organizational policies, at real scale.

By early 2027, enough enterprises have moved past the pilot phase that patterns have emerged. The pattern is not the one predicted in the breathless 2025 coverage. It is narrower, more specific, and frankly more interesting for what it reveals about the gap between what agents can do in controlled conditions and what they can reliably do in the wild.

The clearest success story, repeated across financial services, logistics, and healthcare administration, involves what practitioners have started calling “bounded document work.” This is a category of tasks where the inputs are structured and predictable (invoices, insurance claims, regulatory filings, shipping manifests), the required outputs follow known templates, the rules governing transformations can be codified, and the cost of a wrong output is recoverable — meaning a human can catch it before it propagates. In these contexts, agents built on large language models have proven genuinely reliable. JPMorgan Chase reportedly processes north of two million routine compliance documents per month through agent pipelines with human-review rates below eight percent. That is not glamorous, but it is real.

The second category that has earned genuine production status is internal knowledge retrieval and synthesis. Every large organization has the same problem: enormous archives of institutional knowledge trapped in wikis, Slack threads, email chains, old PowerPoint decks, and the heads of people who left three years ago. Traditional search fails here because the questions people actually ask require synthesis, not just keyword matching. “What was our legal team’s position on indemnification clauses in SaaS contracts before the 2025 policy update?” is not a search query. It is a research task. Agents that can read across a corpus, reason about temporal order, identify conflicting documents, and produce a coherent answer have found willing buyers — not because the technology is perfect but because the alternative is asking a junior associate to spend three days doing the same work.

Code review assistance has graduated from “AI suggests comments” to something more substantive. Agents that can run a pull request through static analysis, check it against a company’s architectural decision records, identify where the change intersects with known fragile systems, and produce a prioritized list of concerns have become routine at companies large enough to have an internal developer productivity team. The agents do not replace the reviewer. They reduce the cognitive load on the reviewer by doing the mechanical parts, which frees the human to focus on the judgment parts. This division of labor is more stable and less controversial than “the agent reviews the code.” It is also, practically speaking, more useful.

Noticeably absent from the production deployments: open-ended research agents. The vision of an agent that can be handed a vague strategic question — “What should our market position be in Southeast Asia over the next five years?” — and produce a useful answer remains stubbornly a demo artifact. The problem is not that the agent cannot generate a plausible-sounding answer. The problem is that it reliably cannot tell the difference between the parts of that answer that are well-supported and the parts that it has confabulated from plausible-sounding priors. In a bounded document task, the agent can check its outputs against ground truth. In open-ended strategic reasoning, there is no ground truth to check against, and the errors are not random noise — they are coherent, confident-sounding falsehoods that propagate through any downstream work that relies on them.

The financial industry learned this the hard way in 2025 when a major investment bank’s research agent produced a market analysis that cited three academic papers that did not exist. The papers had plausible titles, plausible authors, plausible journal names, and plausible abstracts — all generated. The human analyst who relied on the output without verification incorporated the fictional research into a client report. The client report was wrong in ways that eventually cost money. Not catastrophic money, but enough to establish a clear policy: no agent-produced citations without independent verification. That policy, sensible as it is, essentially eliminates the use case that was most exciting on paper.

Customer service automation tells a more complicated story. Agents handling routine, transactional customer interactions — tracking a shipment, processing a return, resetting an account — have replaced humans at significant scale. The economics are straightforward and the customer satisfaction numbers, for routine queries, are comparable. Where agents fail is at the boundary cases: the customer whose situation is genuinely unusual, the person who is angry in ways that require de-escalation, the interaction where what the customer says they want and what they actually need are different things. Handling the routine queries at scale frees human agents to handle the hard ones, which is the intended model. In practice, the routing between routine and non-routine has proven harder to get right than expected. Agents confident enough to handle the routine queries tend to be overconfident about their ability to handle the hard ones too.

This overconfidence problem — technically called calibration failure — runs through almost every production deployment story. The agents that work are the ones operating in domains where calibration has been externally enforced by the task structure: the output is either correct or incorrect by a standard that does not require the agent itself to judge. The agents that fail or get pulled back to pilot status are the ones operating in domains where the agent must decide how confident to be. Asking an agent to assess its own uncertainty is, it turns out, approximately as reliable as asking a twenty-three-year-old to assess their own driving ability.

There is a structural observation here that the industry is beginning to absorb, if slowly. The early framing of “autonomous agents” suggested a new paradigm where AI systems operate with broad latitude, making decisions and taking actions independently across long chains of reasoning. The production reality looks more like a set of highly specialized, carefully constrained tools that are autonomous only within very narrow corridors. The corridors where they work reliably happen to contain genuinely valuable work — routine document processing, internal search, code review assistance, transactional customer service. But calling them “autonomous agents” in the sense the demos implied is a category error.

The more accurate frame is task automation with an LLM reasoning layer. That is less exciting to put on a slide. It is also what actually ships.

The honest summary of where we are: agents are earning their keep in large organizations by doing specific, bounded, high-volume tasks that were previously done by humans at significant cost. They are not managing projects, making strategic decisions, or acting as general-purpose cognitive assistants. The demos suggested the latter. The production deployments are, almost universally, the former.

That gap — between the demo and the deployment — is not a sign that the technology failed. It is a sign that the technology is real, useful, and substantially narrower than its proponents claimed. Which is, in the long arc of technology adoption, more or less how every transformative technology has entered the world. The printing press did not immediately produce the encyclopedia. It produced Bibles and indulgences and pamphlets, and those were enough to change civilization.

The question worth asking is not whether agents will get better — they clearly will. The question is whether the path from “bounded document automation” to “general cognitive assistant” is a smooth gradient or involves a discontinuity that current approaches cannot cross. Everything we know from the production deployments suggests the latter. The interesting work is in figuring out what that discontinuity actually is.

One category that deserves more attention than it typically receives is scheduling and workflow orchestration — not the AI-managed-calendar fantasy, but the narrower problem of coordinating complex multi-step processes across multiple human participants and systems. A pharmaceutical company’s clinical trial management agent that can track protocol deviations, generate compliance reports, remind investigators of upcoming submission deadlines, and flag potential regulatory issues before they become violations is doing something genuinely useful. The task is well-defined, the rules are written down (FDA regulations are nothing if not explicit), and the cost of errors is high enough that the agent saves significant human review time without needing to make any novel judgments.

Procurement automation — matching purchase orders against invoices, flagging discrepancies, routing approvals based on amount and category — is another area of quiet, unglamorous success. The work is repetitive enough that human performance degrades over time and important enough that errors create downstream accounting problems. Agents handling this work at mid-sized companies have produced measurable reductions in both processing time and error rates. The procurement teams that deployed these agents did not generate press releases about “autonomous AI transformation.” They updated their process documentation and moved on.

The geography of deployment reveals something interesting about who is actually benefiting from agents in practice versus who is featured in the case studies. The companies with the most sophisticated agent deployments are not, for the most part, the technology companies or the AI-forward startups. They are large, process-heavy organizations in industries with high transaction volumes, clear compliance requirements, and relatively stable business rules: insurance, logistics, large professional services firms, financial institutions. These organizations have the scale to make the economics work, the process discipline to define tasks clearly enough for agents to execute reliably, and the regulatory familiarity to navigate the compliance requirements that come with automated decision-making.

The AI-forward startups are doing interesting things with agents, but they tend to be building for themselves and selling to the technology industry. The large-enterprise deployments — which is where the majority of the actual work processed by agents is happening — receive much less coverage because they are not photogenic. A logistics company reducing its invoice-processing headcount by forty percent through agent automation is a significant economic event. It produces no conference presentation, no breathless article, no investor deck. It produces an updated organizational chart and a better quarterly margin.

What should you conclude from this pattern? The technology that gets the attention is not the technology doing the most work. The agents featured at product launches are the sophisticated, general-purpose orchestrators capable of traversing the web and managing multi-step research tasks. The agents doing the most economic work are the narrow, boring, process-automation agents that no one finds interesting enough to write about.

This is not an anomaly — it is the standard pattern of technology maturation. The exciting applications drive investment and attention; the mundane applications generate the economic returns that actually justify the investment. Email was exciting when it was a novelty; it became economically important when corporations used it to replace paper correspondence at scale. The internet was exciting when it was a platform for new kinds of social interaction; it became economically important when retailers used it to sell things more efficiently than physical stores. Agents will follow the same arc. The mundane applications are already here. They are just not what anyone is writing about yet.

What Autonomous Agents Are Actually Being Used For

AI coding workflows: Pair-with-AI rituals for a team that had never tried them

Single-Threading Your Brain: Why Doing One Thing at a Time Is the Last Competitive Advantage

Google Gemini: Using Gemini CLI as glue between Cloud Run and your local repo

MCP servers: Smallest MCP server that earns its keep

The AirPods Max 2024 and the Subtle Art of Listening

JetBrains AI Assistant: Using inspections as invisible rails for the agent

MCP servers: What the protocol gave us and what our product still had to

How to Win Friends and Influence People: The 87-Year-Old Book That Still Knows You Better Than Your Therapist

Why Open Source Is the Key to Technological Progress