Data & AIStrategyEngineering

What makes AI projects succeed -- and why most fail

Most AI project failures are not technology failures. They are scope failures, data failures, and expectation failures. Here is what the organisations that succeed do differently.

Tchalla6 min read18 April 2024

Across the projects we have worked on, the pattern of AI success and failure has been remarkably consistent. The technology -- whether you are building a classifier, a forecasting model, or a retrieval-augmented generation pipeline -- is rarely the reason a project fails. The reasons are almost always upstream: bad problem definition, insufficient data quality, or misaligned expectations about what "working" means.

This post is about what the organisations that succeed do differently, and what the ones that struggle have in common.

The failures have a pattern

Most AI projects that fail share one or more of these characteristics.

The problem was not defined precisely enough. "Use AI to improve customer service" is not a problem definition. "Reduce the average handle time for tier-1 support tickets by 20% without reducing customer satisfaction scores" is. The precision matters because it determines what data you need, what a success metric looks like, and what "good enough" means for a model to go into production.

The data was not ready. We have never worked with an organisation that had cleaner data than they thought. Training a model on inconsistent, biased, or poorly labelled data produces a model that reflects those problems -- sometimes subtly, in ways that only become visible when the model is in production and causing real decisions to go wrong.

The deployment plan was an afterthought. A model that produces accurate results in a Jupyter notebook and a model that creates value in production are different things. The gap between them -- serving infrastructure, monitoring for drift, integration with existing workflows, user adoption -- is underestimated in almost every AI project we have seen fail.

What the successful ones do differently

They start with the decision, not the data. The most successful AI projects we have worked on started with a very specific question: what decision does this model need to support, and what would a 10% improvement in that decision be worth? Starting from the decision works backwards to the data you need and the model performance threshold that is worth deploying.

They treat data quality as a first-class project. A meaningful portion of every successful AI project we have delivered has been data work -- cleaning, labelling, deduplicating, and sometimes simply deciding that a data source that looked useful is not actually usable. Organisations that invest in this work upfront spend less time in the confused middle stage where the model is not working and nobody knows exactly why.

They define "production-ready" before they start. What accuracy threshold means the model is worth deploying? What false positive rate is acceptable? How often should the model be retrained? These are decisions that need to be made -- with the people who will live with the model's outputs -- before the modelling work begins.

They build for failure. Every ML model degrades over time as the world changes and the training data becomes less representative of current conditions. The organisations that get sustained value from AI projects are the ones that built monitoring from day one -- watching for drift in input distributions, tracking prediction confidence over time, and building the retraining cadence into their operational process.

A note on LLMs specifically

Large language models have changed what is possible for organisations with limited ML infrastructure -- you can build genuinely useful AI applications without training a model from scratch. But the failure modes are the same: undefined success criteria, bad input data, and no plan for what happens when the model produces a wrong or harmful output.

The additional failure mode with LLMs is evaluation difficulty. It is harder to define a precise accuracy metric for a text generation task than for a binary classifier, which means it is easier to convince yourself that something is "working" when it is not.

Our recommendation is the same regardless of the model type: define what good looks like before you build, measure it consistently, and treat the first deployment as the beginning of an operational process, not the end of a project.

Ready to Start

Found this useful?

Talk to us about what you're building — we're always up for a direct conversation.

Start a Conversation