How to develop a data masterplan
Insurance companies large and small are coming to the same conclusion: Our enterprise data warehouses can’t keep up with demand. Experts see this and are looking for a solution. Will it be Data Mesh? Data Fabric? Data Virtualization? I use ‘data warehouse’ to include data warehouse, data lake, delta lake, and/or cloud data warehouse.
To understand the path forward, it helps to unpack why the data warehouse is falling behind. Let’s cite three clear reasons: first, one data producer can’t keep up with many data consumers; second, data intelligence is closer to the source systems (e.g. claims, premiums), not in the back-end data warehouse; and, finally, the data warehouse is an extra hop. The limitations of the data warehouse are about skills and organization design more than technology.
With the data analytics era just getting started and the centralized data factories already clogged up, it’s time to consider a new approach. To unleash the talent and energy of their knowledge workers, insurance companies need to embrace decentralization and self-service. Governance and control are nonetheless non-negotiable. Therefore, a future-facing data master plan is based on four cornerstones: data mesh, repurposed data warehouses, distributed data governance and data catalog.
1. Data mesh
Data mesh is an emerging approach for managed decentralization. A data mesh is a network of data products created by data producers and used by data consumers. Who’s a data producer? Anyone who cares to be, but there are guardrails on data products to ensure they’re safe and workable. There are no limits to the number of data producers or consumers, so a mesh scales infinitely, and it’s light on shared services. A good analogy is Airbnb, there are property suppliers, guests, an operating model, a platform and governance. It works, it scales, everyone’s happy.
In a data mesh the primary actors are domain owners, data producers, shared services and data consumers. Domain owners are accountable for creating, maintaining and governing data products from their business domain (e.g., claims, losses, marketing) and used by any domain. Shared services provide self-service tools for data producers and consumers to create development and run-time environments like CPU, storage, SQL engine, etc.
Data mesh is peer-to-peer decentralization based on self-service. Are enterprises ready for Airbnb-style collaboration? Well, why not? Most of today’s cloud-based data lakes are a grandchild of a 1990s data warehouse, enhanced with a few big improvements (e.g., MPP, schema on read, infrastructure as code, blob storage). However, the enterprise data space is wildly different. Most notably, data supply and demand have increased exponentially in volume and variety, data literacy is immeasurably elevated, low code and no code tools are everywhere, and data center readiness is a non-issue. Yes, enterprises are ready for decentralization and self-service.
A lot of experimenting is going on to determine how to incent domains to create data products for others to use, how to ensure that data products are well-behaved, and how to make data products available to data consumers.
2. Integration hubs aka repurposed data warehouses
A question that data mesh orthodoxy does not answer well is what to do with our data warehouses. Purist mesh’ists even propose that data warehouses be dismantled. In my opinion, that’s too radical.
Data warehouses do heavy lifting that need not be replicated. For example: cross-domain data integration — linking premiums with losses, orders with campaigns, agents with claims — requires complex job pipelines, rich with code to cleanse data and conform dimensions. Another example, star schema databases or facts and dimensions, that power high performance queries. The work to get operational data in that shape was heady and laborious.
Going forward, the data warehouse should pair back its mission to core services, including integration of cross-domain data. New data sets should be created by domains, where the data warehouse is a last resort. Additionally, the data warehouse should divest of BI (dashboards, reports, etc) and data science activity. While the warehouse goes into maintenance (vs growth) mode, IT headcount – data engineers, BI specialists, BAs, data modelers, and business liaisons — can be moved from shared services to the domains.
There is, nonetheless, an important role for shared services in creating infrastructure services. Shared services will create the tools for end users to create workspaces (CPU, storage, SQL engine, BI tool, etc) required for building and/or using data products. This will ensure compliance, workability, manageability and cost savings.
3. Distributed data governance
Data governance includes the policies and procedures that ensure data quality, privacy and security. The need for data governance won’t go away with decentralization; in fact, the opposite is foreseeable. Without governance data decentralization is a non-starter.
How do we do data governance in a peer-to-peer network? Here’s a framework that puts responsibilities with shared services, domain owners, and product owners: shared services provisions governance tools (e.g., tools for data catalog, quality, lineage, and stewardship) and promulgates enterprise policies, such as data privacy and security requirements. Domain owners promulgate business unit policies and enforce compliance with enterprise and business policies for its data products. Finally, product owners in a domain are responsible for code-level enforcement of governance policies from the enterprise and domain level.
4. Data catalog
A data catalog is sine qua non for data producers and consumers and the key enabler of a well-governed, decentralized and self-service data mesh.
For data producers the catalog provides a massive leap in productivity by automating the collection of metadata for all enterprise data, model, business intelligence, and API resources. Then it uses advanced algorithms including AI/ML to auto classify, name, profile, quality score, and construct lineage. With this foundation of knowledge at their fingertips, producers can easily find key assets, identify experts, share queries, endorse each other’s work, and publish their domain’s data products.
For consumers, the catalog serves as an easily accessible marketplace where they can find, understand, and access reliable data products. In addition, it’s a knowledge base that provides important context such as term glossaries, metrics and KPI definitions, and business process descriptions.
The common thread between the work of both producers and consumers is the attachment of corporate governance data policies such as privacy, security, sharing, lifecycle, and ethics to all cataloged assets including data products. This ensures that policies are communicated clearly and transparently. It also ensures that they can be monitored based on actual data usage and enforced.
If you’re serious about self-service and driving a data culture, you must be serious about data catalog adoption. It is the new corporate commons where employees meet and become data citizens.
Conclusion
Many modernizing insurance companies have outgrown their data warehouse, though the data era has just begun. The problem is not technical as much as it is organizational. With enterprises like Airbnb we see the low friction scalability of well governed peer-to-peer decentralization and self-service. This can be the model for a modern data enterprise.