'Most AI projects fail to reach deployment': Eric Siegel

Siegel’s book “The AI Playbook” explains what it takes to get traditional and advanced artificial intelligence projects from idea to execution.

Most banks are using and testing various forms of traditional and advanced artificial intelligence, including machine learning, deep learning and generative AI. But according to Eric Siegel, a former professor at Columbia University and data scientist, most AI projects fail to reach deployment.

Siegel, who has had a lifetime obsession with predictive analytics and AI – so much so that he wrote and performed a music video about predictive analytics – had just written a book called “The AI Playbook.” In an interview, he shared some of his thoughts on how to get practical results from advanced AI.

First of all, what inspired you to create a music video about predictive analytics?

ERIC SIEGEL: Well, I’ll do anything to help educate and ramp up the world on this technology. It’s fascinating learning from data to predict and then use those predictions to improve any and all of the large-scale operations that make the world go round, including targeting marketing, fraud detection, credit score management, insurance, pricing and selection, so many other application areas. It’s fascinating and it’s mandatory, and that’s the message in my book, “The AI Playbook,” is that we need to bridge a gap between the buzzwords and the tech, and bridging that gap requires business professionals to ramp up on a certain semi-technical understanding so they can collaborate deeply in a meaningful way.

Right now, most new enterprise machine learning projects actually fail to reach deployment and it’s due to this gap and a lack of rigorous business side deployment planning.

That was going to be one of my key questions for you, this idea that most machine learning projects fail to deploy. But let me go back to the idea that machine learning is mandatory. Why do you say it’s mandatory? Because companies can’t really compete or stay relevant if they don’t use it?

Just to clarify, it’s mandatory to learn about it. But that’s because it’s mandatory to use it. What’s one of the last remaining points of differentiation as large-scale enterprise processes become commoditized and everyone’s doing largely the same thing, and products have largely the same look, touch and feel. This is what it means to improve business with science. Prediction is the holy grail for improving decisions. Business is a numbers game and this is the way that you tip the odds in your favor and play that numbers game more effectively. We don’t have clairvoyance, we don’t have magic crystal balls, but using data and learning from it to predict means that you can predict better than guessing. So marketing’s more effectively targeted, credit risk is more effectively assessed and fraud is more effectively detected.

So when you say that most machine learning projects fail to deploy, would you say in a way that that’s appropriate because not everything lends itself to machine learning and some machine learning models are not designed to do certain things? Or do you see this as a problem that needs to be overcome?

I’m referring to a problem that needs to be overcome. I’m talking about projects where it’s already been broadly sussed out: Hey look, this is an opportunity where our fraud auditors could be looking at a more well-chosen pool of transactions to audit those more likely, significantly more likely than average, to be fraudulent. Therefore, much better use of their precious and costly time. Places like that where we have a very clear-cut use case, value proposition of predictive analytics, predictive AI, enterprise machine learning, whatever you want to call it, machine learning generates models that predict.

So the idea is already sussed out. The data scientist does the number crunching, uses the machine learning software and churns out a predictive model, with the intention that it would be deployed to improve those operations. But then the stakeholders ultimately get cold feet or things just haven’t been prepared rigorously enough from a technical standpoint because the focus was on that technology, which is the cool rocket science part, rather than on the enterprise operations improvement. On the business side of it, that change to operations, things weren’t planned rigorously enough, stakeholders weren’t ramped up well enough and didn’t participate in enough details. So if business stakeholders don’t get their hands dirty, their feet will get cold, and that’s the syndrome. So these models get made, they are potentially very valuable. The value is not captured because it’s not deployed, it’s not acted upon.

And is that happening because of fear or because of lack of understanding or because of corporate bureaucracy and permafrost?

Yeah, it’s happening because of fear, bureaucracy and lack of understanding. First of all, it’s change management like any other. So here’s the bad news. You can’t just use this incredible rocket science and do the core number crunching, which is by the way, really amazing. It’s the reason I got into the field more than 30 years ago, machine learning, and I dare say it’s the reason why most data scientists get into it. The bad news is that doing that science doesn’t deliver value.

It doesn’t capture or realize value. It generates potential value only by acting on it. You’re only going to get enterprise value when operations change. Change management isn’t anything new, but the focus with these projects, where everyone’s kind of fetishizing the core technology, isn’t on change management. It’s like people are forgetting, wait a minute, we’re trying to improve the business. This is a business project first, an operations improvement project that uses machine learning as a necessary but not sufficient component. As part of the project, we now need to implement, deploy, operationalize it, change operations according to its predictions in order to improve them.

So in financial services, as you mentioned, there is quite a bit of use of machine learning in making lending decisions, in fraud detection, in cybersecurity analysis and in marketing and areas like that. And in some of those areas there is some risk, like for instance, where banks use machine learning in lending decisions, their regulators, like Rohit Chopra, who’s the director of the Consumer Financial Protection Bureau, frequently warn banks that when they use AI models, they can’t be a black box, they have to be explainable, they have to be transparent, there can’t be any bias and the decisions must be fair and not have a disparate impact on protected groups. And we hear these warnings over and over again. Based on what you know about how machine learning models generally work, do you think those kinds of worries are overblown or merited?

I think they’re mostly merited. There’s certain ways in which they’re overblown. Let me go through some of them. First of all, the issues with responsible AI, responsible machine learning, the ethical considerations, I actually take those more seriously than your average data scientist. In fact, the second chapter of my first book, “Predictive Analytics,” is on ethics. But my pet causes are discriminatory models and machine bias, and I try to break that down. Models make or at least inform very potentially consequential decisions about whether you’re approved for credit or even in the case of law enforcement, whether you’re approved for parole. So when the model makes a mistake, you could be unjustly left in jail for an extended period of time or withheld from getting credit approval.

And these are just a couple examples. The problem is that we don’t have a magic crystal ball. We can’t predict whether somebody’s going to commit a crime again after release with extremely high confidence. But we can predict better than guessing where there are going to be errors. The problem is when those errors that limit access to resources are higher for a certain protected group, like a certain race than another, and that difference in what’s called false positive rates, where those costly errors are incurred from one group to another, that’s often referred to as machine bias. I call it discriminatory models, when the model explicitly makes decisions based on a protected class like race. So that’s a whole issue. I think it’s extremely important. And yes, you need visibility into how the model is making its decisions to suss those out.

I think the place where the understandability of models gets overblown and the requirement of that transparency gets overblown is in a couple ways. One is there’s a sense that hey, we need to understand the model in order to trust it. But there’s a limit to our understanding in general. Most of these models are created over found data. There’s no experimental design, there’s no control group. So we’re not actually getting causality. But that doesn’t mean it’s not predictive. So it predicts, but it’s hard to understand exactly why for one ad targeting project, students who had indicated interest in military were more likely to respond to an ad for the art institute than average. And you can explain that in a bunch of different ways. What’s their family background? Are people interested in the military more well-balanced? There’s a million ways you could explain it. But we do not know unless we do additional experiments. We don’t need to do those experiments for business value. We aren’t doing sociology, we’re not trying to understand what makes humans tick. We’re just trying to decide which ad to show the person that they’re most likely to click on. So that’s the mythology there about the degree to which we need to understand the model, but we do need transparency, at least for the ethical considerations.

So obviously the buzz over the last several months has been about generative AI and large language models. And I just wonder, what do you think are some of the most useful or practical use cases for large language models?

Basically it makes first drafts – of writing, of computer code, of images. So I think that there’s a false promise in the general public narrative, which is that this thing is going to become capable of human-level activities in general. And there’s a lot of hype about it. What it does is absolutely incredible. I spent six years in the natural language processing research group at Columbia in the ’90s, and believe me, I never thought I’d see what these things can do now. But the ability to create such seemingly humanlike copy or text, to respond in an often coherent way, a meaningful way across topics, the human use of language with metaphors and all that, is amazing. But those core large language models are trained on the per word, or technically it’s per token, but that level of detail per word basis.

So they create this seemingly humanlike aura and as a side effect have exhibited a lot of capabilities, but were not designed in and of themselves, unless there’s additional layers on top, to meet higher-order human goals such as being correct or always knowing the right answer. And if you’re trying to get the thing to really be human level, they call that artificial general intelligence, and I like to call it artificial humans. I don’t think that we are headed in that direction actively, even if it may theoretically be possible someday. If you’re churning out a hundred letters a day to customers for customer service, the amount of time that takes could be cut in half.

It depends on the very particular scope of your task, who you are and the exact language model you’re using. And it’s an empirical thing. You’ve got to try it out and see how well it helps and how much time it saves. It potentially can be a huge time saver, but there always has to be the human in the loop. You have to review everything that it generates. You can’t just trust it blindly.

Predictive AI is the type of machine learning that you turn to if you want to improve any of your existing large scale operations, can automatically decide which credit card transaction to hold as potentially fraudulent instantly without a human in the loop. Predictive AI is older, but it’s not old school by any means. The potential has only barely been tapped, and it’s where there’s an improvement track record, there’s still a lot more resources thrown at it than generative, but it’s not a competition, not a zero sum game. And generative is a whole new world. There are probably new ways to use it. I’m not sure that we’re ever going to come across the killer app.

It’s a little hard to manage the expectations without overblowing them.

A lot of what you said jibes with what we’re seeing in financial services where all of the hype and curiosity about generative AI has brought about an increase in interest and use of more traditional forms of AI like machine learning and natural language processing and such. I feel like the title of your book is appealing. I think a lot of companies would like to be given an AI playbook that just says, here, do this, this, and this, and you’ll have a machine learning or an AI deployment. But I suspect that the playbook would need to be a little bit different for each organization, each use case, each team. Do you think that is so, or do you think there are certain principles that everybody needs to use when they are trying to deploy AI?

There’s some principles that may not be fully sufficient. Every project has its own ins and outs, whether it’s machine learning or any other kind of project. But there are some principles that are routinely missing, and that’s why new machine learning projects routinely fail to deploy. What I offer in the book “The AI Playbook” is a six-step paradigm playbook framework that I call biz ML – business practice for running machine learning projects. And the last step is actually deployment. So culminate with actually getting the thing integrated and operationalized so that operations are actually being changed. The first step is to plan for that for the get go.

But the broader theme is that across those six steps, we need a deep collaboration between the data scientist and the business stakeholder, the data scientist’s client, maybe the manager in charge of the operations meant to be improved with a predictive model. And that’s generally missing, and that’s what I’m trying to issue here, a clarion call to the world that, hey, look, the business stakeholders need to collaborate deeply, and to do so, they need to ramp up on some semi-technical understanding, which I can outline. Basically, you need to understand for any given project, three things: what’s predicted, how well and what’s done about it. So let’s predict which transactions are fraudulent in order to target auditor activity or to automatically hold or block a transaction. Let’s predict which customer’s going to respond to marketing in order to decide who to spend $2 sending a glossy brochure to, let’s decide who’s going to be a bad debtor.

And this is a standard use of a credit score in order to decide whether to approve an application for a credit card or any other kind of loan. The how well part is, how good is it? And that’s often a key missing ingredient to these questions. How good is AI? How do you quantify it? What are the pertinent metrics? Right now, the disconnect is as follows, the data scientists in most cases only measure the pure predictive performance, which only tells you relatively how well does it predict compared to a baseline like random guessing, which is helpful to see and tells you it’s potentially valuable. Whereas we also need business metrics like profit, ROI, number of customers saved, numbers of dollars saved. That is to say, what are the pertinent business metrics that could be improved and how much could they be improved?

Then the stakeholder is ready to participate. It’s sort of like to drive a car, I don’t need to understand what’s under the hood. And in fact, I’ve personally never changed a spark plug and I don’t know where they are in my car. I’ve only looked under the hood of my car once. But I know how to drive, rules of the road, how the car operates and the mutual expectations of drivers. That’s a lot of expertise. You analogously need that expertise to drive a machine learning project if it’s meant to successfully deploy and deliver value.

A lot of financial companies, especially small community banks, don’t have a staff of data scientists, programmers and other technology specialists. They might have two or three tech people and that’s about it. So companies like that are really dependent on vendors who prepackage these things for them. Do you have any advice on choosing the right AI-related vendors and vetting their products and working with them when you might be their smallest client?

Don’t fall for the software sales pitches. This is a consulting gig, not a solution plugin. By definition, a machine learning project is not just the technical number crunching part, it’s the actual change to operations. And that’s what this practice is about. You can participate in the practice, you do need data scientists, and you can go external. The size of the company, by the way, is not in itself a determining factor for whether there’s a potential viable project. If you are sending marketing to a million prospects just once a year, you might be a pretty small company, but you’ve collected enough historical data in terms of who did and didn’t respond in the past from which to learn.

So if the operation’s big enough that tweaking it could deliver a huge benefit to the bottom line, then by virtue of the size of that operation, you’ve probably collected and aggregated enough historical learning examples. That’s called the training data. Now it’s a business practice: How would I change my operations in terms of targeting marketing or changing decisions about loan application processing, insurance pricing and selection, fraud detection? How could that operation potentially be changed? That’s where you’re starting, it’s reverse planning. To that end, what exactly would I need to predict? OK, then what kind of data do I need to pull together? And it’s the involvement. If it’s an external service provider doing the analytics part, you’re still the stakeholder. It’s still a collaboration across these steps. It’s not plug and play. There’s this notion of a citizen data scientist and some of these machine learning software tools try to simplify things so much. I call it a PHD tool – push here, dummy. It does everything for you. So you’re protected from the technical details and deciding too much about the parameters when you’re setting it up to hit go. But it still requires data science expertise and it requires your business expertise. The core number crunching itself is literally step five out of six and the way I formulated it, and that alone, the world needs to learn that lesson. That alone is not sufficient to deliver value.