Making machine intelligence profitable
Part 2: Data Engineering – Enterprise-scale, end-to-end streams
Article information and share options
Transforming data into actionable insight is key to its value. Leaders from companies in a variety of industries that are materially investing in technology architectures want return on their investment, and they want it now. Unfortunately, many of these leaders are missing a crucial first step that precedes generating useful analyses.
A range of challenges are putting the brakes on leaders’ efforts to extract value from their data and system investments. Crucially, data used in analysis first have to be clean, and data ingestion and curation efforts are time consuming – the process (still quite labor intensive) requires 70%-90% of a typical data professional's time. Most companies hire data professionals for their specialised and expensive analytical skills, not for their ability to clean data. Second, technology architectures lack end-to-end data pipelines, data ontologies and taxonomies. Instead of performing data ingestion and curation effects in a centralised manner, the simultaneous processing of data in multiple departments within a firm results in fragmentation, limiting the ability to spread and use data insights across the organisation. It's just not efficient.
The current state of data engineering has become a key drag on making machine intelligence (MI) profitable in the insurance industry.
Data Engineering - key to extracting value from data investments
We define data engineering as the process of refining data into fuel – clean, easily accessible, allowing its insights to be available to use and benefit from anywhere in an organisation. At-scale success requires an automated end-to-end data pipeline performing ingestion, curation, transformation, visualisation, and finally widescale distribution of data. Think of this pipeline as a production line, similar to what one may find in a factory.
Usefully transformed data provide competitive advantage in the insurance industry. Firms can differentiate themselves through identifying, ingesting, and curating novel (not generally available) data and even further, developing more useful derived analytics & insights with these data. These data-driven differentiation opportunities can be found at nearly every step in the risk-transfer value chain -- from product development through core underwriting and claims management to portfolio analysis.
Historically, firms mastered traditional data sources at the needed scale for success in one or more of those steps. However, the ingestion, curation, and transformation of data within each of those steps became tailored to the particular needs and data types of the activity of the department processing the data. That tactically fragmented data engineering landscape has a cost today and places a significant cap on data transformation efforts for tomorrow.
Most insurers today have multiple bespoke ingestion and curation processes, embedded in various analytical processes. None are optimised for cost-efficient engineering, but rather for the desired target analysis. The costs add up. As different areas try to leverage each other's predictive analytics, data professionals, underwriters, actuaries and claims managers must reconcile how data have been differently engineered in different areas before moving onto analysis. This carries not only time and effort cost, but often prevents cross-department leverage from occurring at all.
Looking further, the missing ability to engineer data "at scale" is what prevents firms from extracting value from their investments in data scientists and data analytics tools. A good analogy is that those investments are like powerful engines purchased by the company, however the company cannot deliver them fuel or the fuel they do deliver, is full of sand, which materially degrade engine performance. In the insurance industry, traditional data are already riddled with regulatory and privacy challenges, and the need for decision certainty and traceability already slows the exploration of non-traditional data. Fragmenting internal data engineering capabilities compounds the problem.
Looking specifically through the lens of "making MI profitable", at-scale data engineering gaps prevent promising MI projects from becoming successful operational implementations. Making MI systems operational from the beginning at scale, vs. more siloed project-scale efforts, requires access to high-quality, always ready-for-use data. Project solutions relying on brute force, i.e., high-cost manual coordination of fragmented data engineering capabilities, can not scale into production, regardless how promising the algorithms and models are. New MI techniques, driven by exponentially growing sources of data, will impact competitive advantage in every step of the risk-transfer value chain. However, right now, useful and relevant information often does not find its way into underwriting decisions and more granular profitability analysis does not steer the business. This will have to change.
The nuts-and-bolts elements of data engineering to make data work
Data engineering can be broken out into a series of steps. New data must first be discovered and sourced, and on an ongoing basis. Previously, we used to refer to the "ETL" acronym to boil down these steps: extract, transform, load. But the world has changed.
Today, enterprises should have pipes attached to rivers of data so that sourced data are nearly continuous—not "extracted" ad hoc from a static datastore. No longer should we aim to just "transform" data to a single format when there is a given use case. Now, we should apply a collection of analytical transformations on curated data. Thus, ETL becomes something different-- continuous ingestion, curation, transformation, and visualisation (ICTV). These steps should inform and evolve the data-value-chain management process. Indeed, validation and differential and robust privacy-preservation along with meta-data tracking become integral to a more robust data-value-chain management process.
A challenge threaded through these steps is excessive dependence on manual and script-driven processes. This can happen when processes are designed around the manual steps and project code originally developed for experimental, interactive analysis. Multiple (or all) steps might involve a manual trigger, intervention, or review. Manual auditing, versioning and checking for errors and anomalies reduces not only efficiency, but also confidence in downstream modelled outcomes. Lack of a holistic standardised data engineering process increases operational risk as data scientists tailor existing processes or create brand new ones for their needs. The potential for error rates in both data and derived analytics grows.
The use of relational databases as storage solutions impacts speed and scale. Traditional database management, which relies on using rigid technology infrastructure, is by itself insufficient to handle the rising number of heterogeneous formats that are necessary to unlock value from data. The lack of enterprise-wide data ontologies (the defining of relationships among data) also limits the ability to scale and operationalise data across the organisation. Infrequent release iterations of data, coupled with limited CI/CD (Continuous Integration / Continuous Deployment) of both data and MI models fail to adapt to changes in the data environment, or changes in the data that describes the data environment.
Looking forward, running tomorrow's at-scale, operational data capability will require a pipeline driven approach to data engineering – similar to how automotive companies implement their production lines. A single rules-based data pipeline is needed that covers the entire data engineering lifecycle. Most processes in the pipeline are algorithmic, supervised by humans. The automated pipeline connects and streams relevant data from sources as, and when, it becomes available. Depending on the data format, an algorithmic template approach is applied to transforming and extracting data from unstructured / semi-structured documents real-time, e.g. PDF. Data are tagged automatically based on prescribed rules based on business requirements. The data ontology and data dictionary (data terms and definitions) – key to operationalising the data across the organisation - is built dynamically and kept up-to-date based on periodic functions as the new data become available. A single, flexible document-based data lake (well-fed by various data rivers) houses all structured, semi-structured, and unstructured data types. An automated high-integrity versioning and traceability of data process runs throughout the pipeline as part of scheduled batch processing.
Exhibit A: Designing Insurers’ Data Engineering Pipeline
Invest in a data engineering capability to help speed transforming data intovalue
An array of plug-and-play solutions built on open-source technologies can be applied with humans still intelligently integrated into the loop to speed up evolution of data engineering into a fully integrated end-to-end data pipeline process.
- Data sourcing and discovery should be done using algorithms that scan new datasets based on relevant sources associated with the domain. Indexing of downloadable content and comparison of new files versus existing files should also be done using automated scripts
- On the data transformation front, using custom OCR tools developed leveraging a combination of computer-vision-based algorithms and PDF parsers that allow template-based training of new document structure types are critical to performing this process at scale. Also important is to create logical and physical structures to enable automated tagging of documents against business requirements
- Data credentialing and parametrisation done on the fly where appropriate tags are applied to the dataset in the form of unique data attributes, and data ranges calculations are performed on the data to identify any anomalies
- Dynamic data ontology model and data stored into networked data rivers where a highly flexible ontologic model is implemented, which is capable of modelling almost any type of user defined data model – supervised by data architects; the ontology model operates as a lossless data abstraction layer and each data property and relationship can be sourced back to original document sources
- High integrity associated with data versioning and traceability where an automated process retains the complete history of data that enters the pipeline
- Ready-for-analysis data served in the form of microservices with a REST API
- Modularised code with reusable components that are shareable across the pipeline including across both development and production environments
Building a scalable data engineering capability has now become table-stakes for insurers to not only extract value from their data investments but also expedite the speed of data transformation across the organisation. As one data leader put it: “Data initiatives are a team sport." That is, executives, business leaders, technology architects, and data staff have to work together to deliver productive transformation. Insurers that make the right design and implementation choices with regard to data engineering related transformation will succeed in the future.
How to avoid MI pitfalls
In the past few years, myriad powerful tools, systems, open source algorithms, and vendors have appeared to address many (if not all) of these challenges. Even so, any organisation that haphazardly attempts to implement a mix of new capabilities runs the risk of becoming lost in the thicket of choices leading to Frankenstein systems that only breed more inefficiencies. What can well informed institutions do?
- Develop a comprehensive data strategy
- Create process and organization to maintain dynamic data taxonomies and ontologies
- Find and hire quality data-system architects (these are not data scientists or developers)
- Emphasize end-to-end, enterprise-scale use cases and plans, not targeted algorithm/model pilots and prototypes
- Keep humans in the process loop in the context of organizational re-designs that account for opportunities afforded by newer MI
- Plug into data rivers—even when it is not clear how the data will be used
- Hire designers (these are also not data scientists or developers) to design compelling data visualization
- Implement dynamic data visualization and train non-technical executives on how to consume the derived output in order that it regularly leads to actionable insight
Following all these recommendations is a tall order and impractical for many institutions in the near term. However, some of them can be quickly realised. For example, almost all insurers already invest heavily in "conventional" MI such as generalised linear models. This conventional MI informs risk selection, risk pricing, capital allocation, and risk management—albeit at too high of a cost due to poor data-system architectures. Thus, well placed data engineering investment will almost immediately improve returns to this conventional MI investment that has already been made by many firms. Even if comprehensive re-engineering of data architectures is not possible now, targeted efforts on improving data ingestion and curation using cost-effective, end-to-end systems (a few particularly good ones are now available) and investing in well-designed data visualisation are two practical recommendations almost all institutions can follow today. Comprehensive data strategy development and data engineering efforts will likely require more time and budget.
As previously commented, and perhaps counter-intuitively, the path to MI success in the risk-transfer industry does not focus on the algorithms and models that ultimately generate value-adding insights and predictions. Rather today's challenge is efficiently fuelling that growing analytical capability and enabling business leaders to digest its growing output. This introductory blog will be followed by a three-part series offering our more detailed views on tackling those engineering and consumption challenges.