Beware of the risks surrounding AI training

Patricia De Melo Moreira/Bloomberg

Making AI usable for businesses requires a critical activity known as annotation — the process of labelling data to train AI models. Open-source Generative AI solutions and foundation models which have been pre-trained on labelled general data sets such as the internet provide a good head start, but these general AI models must be tuned for industry-specific applications. This is where industry expert annotation becomes vital to ensuring that AI models are reliable and consistent and achieve a high level of accuracy. Generative AI solutions also require upfront effort in prompt engineering to enable the correct data to be pulled from the models.

In 1848, during the California gold rush, while many people focused on mining for gold, others recognized that providing crucial tools and equipment for prospectors was equally important. These enterprising individuals became known as “picks and shovels” providers and were vital for prospectors in their search for gold.

Digitizing the insurance experience
Fast forward to 2023 and as artificial intelligence (AI) captures the public attention, executives and boards across all industries are prioritizing AI investments and driving a ‘gold rush’ in the AI space. In fact, according to recent research by Accenture, 40% of all working hours could be impacted by AI and be transformed into more productive activity through augmentation and automation.

Due to how important data annotation and model training are to realizing the promise of AI, they have themselves become big business. Just as the picks and shovels providers supported the search for gold in 1848, companies such as Scale AI have become the backbone of AI development by offering essential annotation services used for training the AI models of Open AI, Microsoft and others.

So, what does all this mean for the insurance industry? Well, according to research, insurance will be the second most impacted industry in the AI era, with an estimated 62% of all working hours impacted.

For insurers, especially commercial insurers, the key to unlocking this transformative opportunity is the ability to extract and structure data from unstructured communications, documents and other types of digital content, accurately, reliably and cost effectively. Data extraction is the foundation on which the industry will build new AI-powered business models and is the most impactful development in the digitalization of insurance in the last two decades.

Perhaps no surprise then that insurance carriers around the world are investing heavily in automated data ingestion, turning broker underwriting submissions and claims communications into structured data records that they can use in their systems.

However, building AI models that can accurately ingest, for example, complex commercial underwriting submissions, is no simple task. A global carrier can have hundreds of product variations, in multiple languages, with each requiring tens or hundreds of data items to enable pricing, underwriting and claims decisions — all of which need to be extracted from submissions where the data is wholly unstructured and in potentially millions of formats.

Challenges of data interpretation
Reading and interpreting this data are very specialized tasks, requiring industry knowledge and experience, and utilizing highly skilled resources. Annotating insurance data sets to build AI models that can accurately replicate this is equally specialized work and highly labor intensive with an almost infinite variety of data entities across a vast array of sources to label.

This is what is becoming known as the ‘bottomless pit of AI training’ and many in the insurance industry have fallen into this hole without realizing it. This is costing carriers, brokers, MGAs who are spending time on this topic a staggering 10-50x multiple in cost and time from the initial investment.

Insurers often under-estimate the complexity of AI model development, whether building them in-house or using Intelligent Document Processing (IDP) platforms that streamline the process but still require insurers to build and test their models. Building these extraction models can cost tens to hundreds of millions of dollars and span multiple years, leading to many abandoned projects. The problem is exacerbated by vendors who make false claims that they can provide highly accurate ‘out of the-box’ solutions, often by leveraging foundation LLM models in pilot projects which are unsuitable for use in a scaled production environment extracting regulated insurance data reliably, consistently and accurately. They may provide an illusion of accuracy in a simple proof of concept, but cannot deliver at scale.

So how can you avoid falling into the ‘bottomless pit’? Design your selection process to test vendors’ real ability to accurately extract and map any data entity from any source, for any class. The answer can often be found in a vendor’s willingness (or not) to commit commercially to the performance of their model.

The good news is that the technology has come of age and one or two vendors have successfully developed highly accurate and scalable solutions. Insurers who follow the right approach and select the right partner to avoid those hidden 10-50x AI implementation costs and the time delays in execution, are now able to digitize unstructured data across UW submissions and claims on a global, enterprise level in a few months with very low implementation costs.

With the right picks and shovels provider, insurers can now strike gold.