This is not a technology failure. It is a measurement failure — and it is nearly universal.
Deloitte, 2025
MIT GenAI Divide Study
The industry has accepted a convenient fiction: that AI ROI can be understood at the portfolio level — a blend of productivity estimates, adoption metrics, and cost-per-seat calculations that produces an impressive-looking slide for the board deck. It cannot. Portfolio-level ROI measurement is how organizations convince themselves they are winning when they have no idea which bets are paying off and which are quietly burning cash.
The discipline that is missing is feature-level unit economics. And the executive who should own it is the CTO.
What Is Actually Being Measured — And Why It Falls Short
The current state of AI measurement exists on a spectrum from vague to misleading.
At the broadest level, organizations track aggregate AI spend against qualitative outcomes: "our developers are more productive," "customer satisfaction improved," "time-to-market accelerated." These statements may be true. They are not measurement. They are sentiment.
One level more specific, FinOps teams track cloud and token spend — a genuine improvement. But cloud cost dashboards show that AI spending increased; they cannot tell you which product feature caused the increase, whether that feature was worth it, or whether the team that built it understood the economics before they shipped.
The most sophisticated organizations have moved to productivity proxies: developers save X hours per week, support agents resolve tickets Y% faster. These metrics are useful. They are also systematically incomplete. They count the value of AI-assisted output without counting the downstream cost of checking, correcting, or reworking that output. When a developer uses an AI coding tool to generate a component in two hours instead of eight, the productivity gain is only real if the component doesn't require three days of architectural remediation three months later.
The question being asked is "What does AI cost us?" The question that should be asked is "What did this specific AI-powered feature cost to build and operate, and what did it actually return?"
The Anatomy of What a Feature Actually Costs
To measure feature-level ROI, organizations first need an honest accounting of what a feature costs. This is harder than it sounds and significantly more expensive than most project estimates suggest.
Build Cost is the starting point most teams get reasonably right: engineering hours, model selection, prompt engineering, evaluation cycles, and integration testing. What gets underestimated is the iteration cost — the number of cycles required to get AI output to production-grade reliability. Unlike traditional software, AI features rarely work well on the first build. The prompt engineering, retrieval tuning, and evaluation infrastructure required to move from "impressive demo" to "production-ready" is substantial.
Inference Cost is where the math gets dangerous. Token consumption at scale is non-linear and deeply sensitive to feature design. A system prompt that runs 2,000 tokens fires on every single user interaction. A retrieval-augmented generation pipeline that pulls three documents per query at 1,500 tokens each adds 4,500 tokens to every call. Multiply that by daily active users, and what looked like a manageable cost in the pilot environment becomes a significantly different number in production.
A pilot customer-service chatbot costing $5,000 per month can become $50,000 or more once deployed across the enterprise — with no change in vendor pricing. The only variable is usage. This is what analysts call the consumption trap, and it catches most organizations off guard.
Hidden Operational Costs extend beyond token spend. They include retry logic, tool call overhead in agentic workflows, context window bloat as conversation history accumulates, and the human review loops organizations install to catch AI errors before they reach customers. That last cost is particularly underreported. If an AI feature requires a human to review 20% of outputs to maintain quality standards, the labor cost of that review belongs in the feature's cost model.
The AI Tax on Existing Features deserves its own category. When AI capabilities are embedded into existing workflows — an AI triage layer added to a ticketing system, a generative summary baked into a SIEM alert — the cost is layered on top of existing infrastructure with no separate cost identity. The result is that AI spending becomes structurally invisible, pooled into shared infrastructure that serves multiple teams and products. Without deliberate attribution, it shows up as undifferentiated compute on the cloud invoice.
Maintenance and Drift close out the true cost picture. Models are updated by providers. Prompts that performed well in one model version degrade in the next. Retrieval pipelines require ongoing tuning as data volumes and schemas evolve. The organizations that account for these costs honestly in their feature economics models are a small minority.
The ROI Side Is Equally Murky
If the cost side of the feature P&L is underestimated, the return side is overestimated — and often significantly.
The most common measurement failure is what might be called the phantom productivity problem. Organizations measure AI adoption and draft speed as proxies for productivity gain, without accounting for the downstream cost of validating, correcting, or reversing AI-assisted decisions. The metric shows that developers are writing code faster. It does not show whether that code required more architectural review, introduced more regressions, or created technical debt that consumed engineering capacity months later.
This is not a theoretical concern. An experiment placing AI coding tools directly in the hands of non-developers resulted in exactly this outcome — impressive initial velocity metrics, followed by architectural inconsistency, cost overruns from poorly structured AI calls, and a remediation effort that consumed more engineering time than the original work would have required. Read the full account here.
There is also a structural timeline problem. Traditional enterprise technology follows a predictable 7–12 month payback cycle. Most organizations achieve satisfactory AI returns within 2–4 years — three to four times longer than conventional technology deployments, but budgeted on conventional timelines. This mismatch creates a situation where AI investments are evaluated against benchmarks they were never designed to meet, declared failures prematurely, or declared successes before enough time has passed to know.
Why the Accountability Gap Persists
Understanding why this gap exists requires examining the incentive structure around AI deployment — and it does not reflect well on how technology organizations are currently structured.
Product owns the roadmap and is incentivized to ship AI features. Engineering builds and ships them, measured on velocity and delivery. FinOps owns the cloud bill but lacks the product context to attribute costs to features meaningfully. Finance measures overall AI spend as a budget category. Nobody sits at the intersection with ownership of a feature's complete economic profile from build through production.
The CTO occupies the natural position to own this — with visibility across engineering cost, infrastructure spend, and product delivery. However, CTOs are rarely incentivized to slow feature deployment in service of measurement rigor. In PE-backed environments particularly, the pressure to demonstrate AI progress to the board creates a strong bias toward shipping over accounting. The cost of not measuring is diffuse and deferred. The cost of not shipping is immediate and visible.
Vendor pricing complexity compounds the problem. The shift from per-seat licensing to consumption-based and hybrid models has made AI cost forecasting genuinely difficult. A budget built on per-seat assumptions applied to a consumption model is a budget built on sand.
What Leading Organizations Are Starting to Do
A small percentage of organizations have moved beyond the portfolio fiction and begun treating AI like a business unit — with a chart of accounts that tracks revenue attributable, costs allocated, and margin calculated at the workload level.
The architectural shift that enables this is request-level attribution: tagging every LLM call at the point of execution with metadata identifying the feature, the team, the user segment, and the deployment stage. This is not a complex engineering problem. It is a discipline problem. Organizations that build attribution into their AI infrastructure from day one have the data to answer the boardroom question.
Model routing is the second lever. Not every AI task requires frontier-model capability. Large frontier models cost 17–25 times more per token than capable smaller models. Organizations that route by task complexity rather than defaulting to the most powerful model available are finding meaningful cost reductions without sacrificing output quality.
Pre-deployment cost modeling — estimating token volume and call patterns before a feature ships rather than discovering costs on the next cloud invoice — is the third distinguishing practice. This requires product, engineering, and finance to collaborate at the design stage in a way that most organizations have not built into their development process.
A Framework: The AI Feature P&L
The organizations that will win the next phase of enterprise AI are not the ones deploying the most features. They are the ones that know which features are worth deploying. The mechanism is a feature-level P&L — applied before a feature ships and reviewed quarterly once it is in production.
| Dimension | What to Measure | Where It Lives | Who Owns It |
|---|---|---|---|
| Build Cost | Engineering hours + model eval cycles + prompt engineering + integration testing | Engineering | CTO / VP Engineering |
| Run Cost | Tokens × call volume + ops overhead + human review labor + maintenance | FinOps / Cloud | CTO + Finance |
| Hard Return | Time saved × fully-loaded cost, error reduction, revenue attribution, deflection rate | Finance + Product | CFO |
| Soft Return | CSAT delta, adoption rate, decision quality, time-to-market acceleration | Product | CPO |
The governing principle is straightforward: if this table cannot be completed before a feature ships, the feature is not ready to ship. The build cost column should be known. The run cost should be modeled from expected call volume. The return should be defined in terms that finance can validate — not productivity proxies that engineering self-reports.
For features already in production without this discipline applied, the retrofit process starts with instrumentation: adding attribution tags to existing LLM calls, pulling cost data for the trailing 90 days, and mapping that cost against the business outcome metrics the feature was built to move. Most organizations will find surprises in both directions — features that looked expensive but are returning strong value, and features that looked cheap but are consuming resources against negligible return.
The Call to Action for Technology Executives
This is a leadership discipline problem before it is a tooling problem. The tools to instrument AI costs, attribute them to features, and model unit economics exist and are maturing rapidly. What is missing is the organizational will to use them — and a CTO willing to make feature-level economic accountability a first-class concern.
The companies that establish this discipline now will have a compounding advantage: the data to make better deployment decisions, the credibility to answer the board's ROI question with specificity, and the economic model to deploy AI at scale without the consumption surprises that are derailing many enterprise programs today.
AI ROI is not a portfolio metric. It never was. The executives who internalize that earliest will define the next phase of enterprise AI — not by deploying the most, but by understanding what they have deployed.