The 95% stat: why most AI pilots fail (and what the 5% have in common)

The number that should change the conversation In August 2025, MIT's NANDA group published "The State of AI in Business 2025" — a research report based on 300+ enterprise AI pilot reviews. The headline finding became one of the most-shared statistics in the enterprise AI space: 95% of enterprise GenAI pilots produce no measurable P&L impact despite…

## The number that should change the conversation

Technical diagram showing vulnerability chain

Figure 1: Visual representation of the BeyondTrust vulnerability chain

The number that should change the conversation

In August 2025, MIT's NANDA group published "The State of AI in Business 2025" — a research report based on 300+ enterprise AI pilot reviews. The headline finding became one of the most-shared statistics in the enterprise AI space: 95% of enterprise GenAI pilots produce no measurable P&L impact despite an estimated $30-40 billion invested in such pilots over the prior 18 months.

The report has been criticized in some circles for definitional sleight-of-hand — the bar for "measurable P&L impact" is high, and a pilot that improves employee satisfaction or reduces a 6-hour task to 90 minutes might not register as a P&L line item even though it is genuinely valuable. Fair criticism. The 95% number is probably overstated.

But even the strongest defense of enterprise AI productivity studies — including BCG's 2025 "Closing the AI Impact Gap" research — finds that only 5% of organizations are capturing AI value at scale, with 60% generating no material AI value at all. The methodology differs but the broad finding holds: most enterprise AI is not paying for itself.

The interesting question is not "is this number right?" The interesting question is "what do the 5% have in common?" Because the answer to that question is replicable.

Cross-referencing MIT NANDA, BCG 2025, McKinsey's State of AI 2025, the Wharton GBK AI Adoption Year-Three report, Microsoft's Work Trend Index 2025, and the Forrester Total Economic Impact study on Microsoft 365 Copilot, five patterns appear consistently in the success cases:

Pattern 1: The workload was narrow and mechanical

Stripe's 10,000-line Scala-to-Java migration in four days. Pfizer's internal R&D document search saving an estimated 16,000 research hours per year. Lyft's customer support resolution time dropping by roughly 87%. Intercom Fin's average ~66% Tier-1 ticket deflection across 6,000+ customers.

What these wins share: the workload was bounded, repetitive, and had a cheap success signal. Did the tests pass? Did the user click "resolved"? Did the document come back? The AI did not have to be brilliant. It had to beat a specific baseline (a manual migration timeline, SharePoint search, the existing support queue).

The wins almost never come from "let's deploy AI broadly across the business and see what happens." They come from "let's deploy AI specifically against this one workflow that is currently expensive."

Pattern 2: There was a top-down mandate plus structured adoption

NVIDIA enabled Cursor for 30,000+ engineers as a daily-driver IDE. Coinbase enabled it for all 40,000 of their engineers. Stripe rolled Claude Code to roughly 1,370 engineers. Box pushed Cursor to 85% daily-active across engineering and built a structured mentorship program that drove adoption up 75% in six weeks.

What these deployments share: the rollout was top-down with explicit organizational commitment. Not "we have a 12-seat pilot in the engineering productivity team." Not "anyone who wants a license can request one through a form." A mandate. With change management treated as a first-class problem, not an afterthought.

The opt-in pilot model — which is what most mid-market enterprises default to because it feels safe — produces the "we have 14 active users out of 200 licenses" outcome. The structured-rollout model produces the productivity wins.

Pattern 3: The data underneath was already clean

Salesforce Agentforce works because Salesforce records exist. Customers with structured CRM data and well-defined sales motions get value out of the agentic workflows. Customers with messy or duplicated records get the AI equivalent of "garbage in, garbage out."

Pfizer's R&D corpus was already structured before Claude was deployed against it. The success came from putting a better retrieval layer on top of well-organized content, not from AI cleaning up the underlying data on the fly.

The MIT NANDA finding maps almost perfectly to this pattern in reverse: the 95% of pilots that return nothing tend to be deployments against unstructured, inconsistent, or fundamentally messy underlying data. The AI does not fix that. The AI exposes it.

The implication for mid-market deployment: spend the first phase of any AI rollout cleaning the underlying data the AI will work with. Document tagging, sensitivity labels, organization schemes. Not glamorous. Necessary.

Pattern 4: The framing was augmentation, not replacement

The Klarna case is the canonical lesson here. In February 2024, Klarna's CEO publicly claimed an OpenAI-powered customer service agent was doing "the work of 700 human agents." The story became one of the most-cited examples of AI displacing knowledge work. By mid-2025, the same CEO publicly admitted the strategy went too far. Klarna resumed hiring human agents. CSAT had dropped on complex tickets. The 700-jobs replacement number was never independently verified — it appears to have been mostly attrition reframed as AI displacement.

The Australian retail bank that announced 45 AI-driven contact-center role eliminations in mid-2025 reversed the decision weeks later when call volumes went up instead of down. The "AI replaces humans" framing kept producing the same outcome: the deployment looked great in the press release and underperformed in production.

The deployments that stuck — Lyft's support model, Intercom Fin's customer base, Salesforce Agentforce's 18,000+ customers — all framed AI as routing routine work to automation while escalating nuance to humans. Augmentation. Not replacement.

The corollary for organizational design: do not restructure headcount on the basis of AI productivity claims for at least the first 12 months of deployment. The Klarna and Commonwealth Bank reversals are what happens when leadership commits to headcount cuts based on three months of pilot data.

Pattern 5: They bought from a specialist vendor instead of building

The single most useful statistic in the MIT NANDA research, for an MSP audience, is this: buying from specialist vendors succeeds approximately 67% of the time. Internal builds succeed about a third as often.

The pattern is consistent. The enterprises that try to build their own AI stack end up rebuilding what a vendor already ships, six months later, at higher cost. The "we'll build it ourselves to maintain control" instinct is well-intentioned and almost always wrong at the unit economics level.

The exception is when the workflow is genuinely proprietary and high-volume enough to justify custom investment — which is rare in mid-market. Bloomberg built BloombergGPT because Bloomberg has a unique financial-news corpus that no vendor model is trained on. Most mid-market companies do not have that justification. They have a slightly customized version of a workflow that ChatGPT Enterprise or Microsoft Copilot or Claude already handles.

The buy-vs-build calculation in 2026 is much more lopsided toward "buy" than it was in 2022.

Figure 2: How the authentication bypass vulnerability works

The diagnostic checklist

Take any AI deployment you are considering and run it against the five patterns:

Workload check. Is the workload bounded, repetitive, with a cheap success signal? If yes → continue. If no → narrow the scope until yes.

Mandate check. Is there top-down commitment with a real rollout plan, or is this an opt-in pilot? If opt-in → expect the "14 active users" outcome unless the pilot is specifically designed to convert into a mandate.

Data check. Is the underlying data the AI will work with already in good shape? If no → spend the first 60-90 days on data cleanup before the AI deployment.

Framing check. Is the success metric augmentation (output quality, time savings, throughput) or replacement (headcount reduction)? If replacement → revisit. The wins are augmentation-framed.

Buy-vs-build check. Is there a specialist vendor that already does this? If yes → buy from them unless there's a defensible reason not to. The internal-build success rate is roughly a third of the buy success rate.

If your deployment passes all five checks, you are in the 5% pattern. If it fails on two or more, the math says you will be in the 95%.

What this means for mid-market

Three implications for the mid-market business owner reading this in 2026:

1. The competition is not winning at AI by being smart. They are winning by being disciplined. The companies producing real AI ROI are following a pattern that is fully knowable. The pattern requires operational discipline more than technical brilliance. This is good news — it means catching up is a project plan, not a moonshot.

2. The wrong question is "should we deploy AI?" The right question is "which workload first, and how do we deploy it well?" The 95% failure rate is concentrated in deployments that skipped the workload-narrowing step. Pick the right workload and you have moved the success probability from ~5% to ~67% before any other variable.

3. Buy the playbook, do not invent one. The patterns above are now well-understood. The MSPs (and consultancies, and analyst firms) that operate AI deployments at scale have repeatable playbooks. The 3x success-rate advantage of buying over building applies to advisory as well as to product. Hire someone who has done this.

Figure 3: Privilege escalation from user to SYSTEM level

The work, and the offer

The free 90-minute IT health check we run for prospective clients includes an AI deployment readiness assessment. We score your situation against the five-pattern checklist, identify the 1-2 workloads where AI is most likely to produce ROI in your specific environment, and give you a 12-month deployment roadmap. Yours to keep either way.

The full case-study gallery — including all the wins and failures referenced above — lives at /ai/case-studies. The 6-point governance framework that fits around any AI deployment is at /ai/governance. The fear-vs-power tension piece is at /ai/the-balance.

The 95% stat is real. The 5% pattern is replicable. The work is operational, not magical.

The 95% stat: why most AI pilots fail (and what the 5% have in common)

The number that should change the conversation

Pattern 1: The workload was narrow and mechanical

Pattern 2: There was a top-down mandate plus structured adoption

Pattern 3: The data underneath was already clean

Pattern 4: The framing was augmentation, not replacement

Pattern 5: They bought from a specialist vendor instead of building

The diagnostic checklist

What this means for mid-market

The work, and the offer

Related Topics

AI vendor due diligence in 2026: the 5 questions procurement now asks

Microsoft Copilot: the SharePoint permissions audit you should run first

Shadow AI: the threat your DLP isn't catching

The number that should change the conversation

What the 5% share

Pattern 1: The workload was narrow and mechanical

Pattern 2: There was a top-down mandate plus structured adoption

Pattern 3: The data underneath was already clean

Pattern 4: The framing was augmentation, not replacement

Pattern 5: They bought from a specialist vendor instead of building

The diagnostic checklist

What this means for mid-market

The work, and the offer

Related Topics

Related Reading

AI vendor due diligence in 2026: the 5 questions procurement now asks

Microsoft Copilot: the SharePoint permissions audit you should run first

Shadow AI: the threat your DLP isn't catching