Behind the 90% Failure Rate of AI Projects: Prompt Debt, Retrieval Debt, and Evaluation Debt Are Dragging Down Enterprise Deployments

In 2025, 42% of enterprises scrapped multiple AI initiatives, far exceeding the previous year's 17%. The problem isn't that models aren't powerful enough — it's that a new form of technical debt is silently accumulating within enterprise AI infrastructure: prompt debt, retrieval debt, evaluation debt. (Background: What is Harness Engineering? Breaking down the 7 major engineering modules behind true AI Agent deployment (AI Harnessing Engineering)) (Context: GPT-5.5 Instant rolls out to all users, OpenAI teaches you how to write smarter, more efficient Prompts) The data shows that AI failure isn't an occasional phenomenon — it's a systemic problem. A MIT study from the same year indicated that 95% of AI pilots never actually reach production or generate quantifiable business value. 2%, that's the share of enterprises that halted multiple AI initiatives in 2025 — a full 1.5x increase over the previous year. According to S&P Global Market Intelligence, these failures are typically attributed to insufficient model capabilities, poor data quality, or hard-to-justify ROI. But Vikram, head of Cota Capital, argues the real cause is more hidden: a new form of technical debt is quietly accumulating in the prompt layer, model dependency layer, and evaluation layer of AI systems — entirely different from traditional code debt, yet equally lethal. Traditional technical debt resides in codebases. Bugs can be reproduced, tested, and fixed. AI debt has fundamentally different characteristics: it is distributed, scattered across prompts, model APIs, data pipelines, and various layers of infrastructure. It is intermittent, because AI systems are inherently probabilistic — the same input doesn't guarantee the same output; it is also nearly invisible, because the system "appears" to be functioning normally, right up until the moment it collapses entirely. Prompt Debt is the most visible of the three. It includes undocumented ad-hoc adjustments, prompt changes without version control, and "prompt stuffing" — cramming massive amounts of irrelevant background information into prompts in an attempt to make the model understand more. The result is that prompts become a kind of informal code without types, tests, or version management. Each tweak is performed on an opaque system, and as they accumulate, the system's fragility grows exponentially. Model Dependency Debt stems from enterprises' heavy reliance on external foundation model APIs. Application logic is built on calls to external models, but updates to those models are beyond the enterprise's control. When model vendors silently upgrade versions, prompts carefully tuned for the older version may simply stop working, or output behavior may drift unpredictably. Reproducibility disappears. Retrieval Debt emerges within the RAG architecture adopted by most enterprise AI deployments. The issue is that the data warehouses are often packed with messy data, duplicate files, and outdated information. The answers AI returns may have been technically correct at one point — they just no longer apply. This is harder to detect than hallucinations, because it appears entirely reasonable and can even pass the review of an average tester. Evaluation Debt is the most easily underestimated of the four new types of AI debt. Most existing AI benchmarks focus on narrow, point-in-time evaluation results that fail to reflect real-world performance after deployment. The vast majority of enterprises lack consistent testing standards, benchmark datasets, and real-time monitoring mechanisms for deployed models. Compared to the mature CI/CD (Continuous Integration/Continuous Delivery) workflows long established in traditional software development, AI deployment still has no equivalent "prompt continuous integration" mechanism. In plain terms: when an engineer merges a piece of code, automated tests tell you what broke; but when a prompt is modified, no system raises a real-time alert. The result is that CIOs and CTOs lack visibility into the model's actual performance and have no way to track whether performance is deteriorating. These four new types of debt stack on top of existing code-based technical debt, accelerating compound accumulation. To make matters worse, ownership of AI systems is itself fragmented: engineering, product, data, and business teams each own different parts of the system, and when something goes wrong, accountability is often unclear. Stronger models won't solve this problem. Vikram's argument is direct: the persistently high failure rate has nothing to do with model accuracy — the root causes are flaws in system design, integration controls, and organizational culture. Specifically, prompts must be treated as code: brought under version control, supplemented with documentation, and rigorously tested across all possible configurations before and after deployment. Evaluation mechanisms need to be embedded throughout the entire AI infrastructure stack, building continuous evaluation pipelines that cover both technical and business metrics, integrated with AI observability systems to monitor output quality, failure rates, model drift, and data drift. Furthermore, all AI outputs should include explainability by default — data sources, models used, and steps executed must be clearly traceable, ensuring auditability and enabling rapid remediation when systemic errors occur. This requires establishing explicit AI debt elimination programs with dedicated budgets — the way enterprises previously invested in security hardening or cloud modernization — driven personally by CXO-level leadership. After all of this, you can probably now see: the 95% failure rate may not be because AI isn't smart enough. It's because the way AI systems are built is still stuck at treating them as black-box API calls, rather than as complex systems that demand serious engineering discipline. Technical debt never disappears on its own — it only comes due, at a higher interest rate, at some point in the future.