Understanding medical risk: a network-based approach
Article information and share options
How can data science help us understand medical risk? In a life and health context, many studies aim to identify trends and cost forecasting by leveraging new data science techniques. This usually involves a machine learning exercise, where the expected cost (claims) next year – or the mortality – is forecasted as a function of the available a priori information, be it demographic or medical. Assuming a large training data set, state-of-the-art techniques will provide strong evidence that various cost profiles can be drawn. Hence, the model demonstrates some forecasting power. This approach, however, often relies heavily on the sensitive assumptions of the model (to mimic for instance the underwriting data), the inner complexity of the insurance value chain, and the strong in-house expertise.
Stepping away from this cost forecasting problem, data science can also be used at deeper levels to better understand the medical risk and describe the specificities of a given population. What are the key health risks that a teenager is exposed to versus an adult? How does the medical risk of someone living in an urban area compare to the broader population?
How data science helps us visualize disease
Strong medical expertise can answer these questions, but data science can also help to structure this tacit knowledge. By leveraging large data sources of medical claims, we can actually draw the diseasome – the network of diagnostics and drugs – for any target population, which enables us to visualize co-morbidities. Each diagnosis is represented as a node, and the co-occurrence of two diagnoses is represented as an edge between two nodes (Figure 1).
Considering the granularity of the diagnostics codes – up to 15,000 at the ICD level – such a network can be hard to visualize, and impossible to interpret. For a large population, the network can sometimes be nearly fully connected, meaning all pair-wise co-morbidities could be found in the population. To go further, a set of filtering techniques has to be applied, ranging from the theory of Bayesian networks to traditional heuristics removing irrelevant edges and nodes. Finally, the last step is to actually differentiate the diseasome of the target population from the benchmark population (Figure 2).
This final step identifies the diagnoses and co-morbidities that specifically characterize the target population compared to the benchmark population. As a first application, Figures 3 and 4 show the diseasome of a diabetic population with total claims lower than 5,000 USD, and total claims larger than 5,000 USD (calibrated based on US medical claim data). The level of co-morbidities is naturally stronger in the second case, where skin and cardio related conditions appear.
While in this example a specific medical condition and severity are used to define the cohort of interest, the analysis can be done purely at a population level. Figures 5 and 6 show the diseasomes of kids and teenagers, respectively. While kids are characterized by a cluster made of mood disorders and epilepsy-related conditions, teenagers are still characterized by mood disorders – but at a different level of severity, with schizophrenia and suicide – and various fractures of limbs.
Finally, the methodology can be used to analyse a given geography. For instance, how does the medical risk for male adults living in Southern US differ from the rest of the male adults in the country? Figure 7 show that this cohort is particularly exposed to circulatory problems, from hypertensions to neoplasms of the genitourinary system.
The visualization of the network of diagnoses can yield many elements of analysis to better understand the risk, but the methodology to reach them should be carefully mapped to the precise business requirements. Many parameters can be tuned to enable the model to focus on the most relevant risks, or to leave it open as it is now.