In recent years, there has been a growing interest in the use of artificial intelligence (AI) to uncover patterns and correlations of disease in large datasets, especially in projects within the European Union’s research and innovation programmes Horizon 2020 and Horizon Europe that fund cutting-edge research in various fields. One such project is EXIMIOUS, which is part of the European Human Exposome Network (EHEN) and like all the other eight projects in the network has been ongoing since 2020.
EXIMIOUS aims to deliver a novel approach to evaluating the human exposome – the cumulative exposure to various environmental factors over a lifetime – and elucidate its link to immune-mediated diseases. We spoke with Kenneth Kastaniegaard, CEO at Biogenity and partner in EXIMIOUS, to learn about his experiences working with AI in the exposome context. His team aims to discover potential pathways from occupational exposure towards development of autoimmune disease by analyzing nationwide data with AI technologies, focusing on rheumatoid arthritis.
Exposome: all life-course, environmental exposures one can encounter, from the prenatal period onwards. It includes all the environmental exposures that we experience throughout our life: diets, lifestyles, stress, pollution, and the many elements naturally present in the surrounding environment.
Job exposure matrices: Matrices of workplace exposures and job titles that enable assigning a value of exposure to each job title.
Neural network: A complex computational system made up of “artificial neurons”, inspired by biology.
Kenneth’s team has trained a series of neural networks combining nationwide health and job data with job exposure matrices derived from occupational measurement data. The data used in the project comes from the National Patient Register of Denmark, occupational history databases and job exposure matrices, which collectively provide information about patients’ workplaces and the exposures they may have encountered over the last 40 years. First, the team identified the occupational history of rheumatoid arthritis patients and accumulated data on their exposures in the workplace. Neural networks were then trained to look for patterns that would allow them to differentiate those patients from a healthy group with similar characteristics.
Many assumptions were tested and early on, they realized that the models couldn’t be trained to achieve the desired accuracy. However, the models seemed to identify patterns in the data which could provide insights for researchers. This is possible thanks to Explainable AI, a growing area of research that, while still in the early stages, provides context to the decisions made by AI and allows researchers to explore which exposures were considered in the models’ decision-making process.
This strategy allowed them to concentrate on a handful of job exposure factors to generate hypotheses that could then be studied with traditional epidemiological register analysis methods. As a result, a few psychosocial and chemical exposures were suggested to their Danish partners in EXIMIOUS, who will conduct the register studies of rheumatoid arthritis to investigate the relevance of the neural network models’ findings. This approach could lead to novel insights into autoimmune diseases, but also to getting more out of existing national data sources that can be hard to exploit with traditional methods. Kenneth’s team observed some intriguing patterns relative to physical and psychosocial working conditions, and chemical exposures, which they will publish in the near future.
Challenges in understanding complex diseases
Technical and scientific advances have provided better understanding of many diseases, which has been fundamental for the development of new therapies and prevention measures. In turn, this has improved health conditions and life expectancy. However, the mechanisms and causes for some diseases, particularly diseases related to the immune system such as autoimmunity and allergies, are still unclear. Better understanding of the relationship between an individual’s biological factors and their interaction with environmental and social factors could be the key to painting a clearer picture. We believe that the use of AI can help us improve such understanding and possibly prevent new cases.
“This is an issue that everyone working with AI needs to address: Is the data structured enough to provide the answer I am looking for, and can I trust it? Trust is critical when using these models for any purpose, and it is essential to align expectations with the model’s outcomes.”
Challenges and lessons learned in using AI for disease prediction
Using the power of AI technologies to explore patients’ health data and job exposures to generate hypotheses on the origins of disease is a new approach that hasn’t been tested before. The pioneering nature of this new approach inevitably comes with challenges and new lessons to learn, but this is part of the journey and it brings opportunities that enable us to look back and share recommendations based on our experience.
Kastaniegaard and his team quickly realized that they needed to adjust their expectations for accurate predictions. Instead of aiming for 70-80%, they had to work with a lower accuracy of around 55-60%. This was a hard lesson to learn, according to Kastaniegaard. He emphasized the need to accept this fact, when working with such low quality/resolution data and modify the questions posed to the models accordingly, to formulate new hypotheses. The models cannot be used to predict disease outcomes due to the poor resolution and structure of the data, which were already anticipated. The question was could the models be used for hypothesis generation and data filtration. This is an issue that everyone working with AI needs to address: Is the data structured enough to provide the answer I am looking for, and can I trust it? Trust is critical when using these models for any purpose, and it is essential to align expectations with the model’s outcomes.
Another important obstacle to highlight working with the job exposure matrices and patients’ data is the diverse coverage of job exposures – for example, only a small fraction of the working population is potentially exposed to welding fumes, whereas everyone will experience dimensions of psychosocial work environment factors at work. Some job exposures are rare, hence as little as 2% of the population may be covered. In contrast, other exposures cover close to 100% of the population.
Furthermore, while there are more and more tools available to evaluate which inputs are deemed important by different types of AI models, less is known about interpreting the outputs of explainable AI for rare events and very sparse inputs, and specifically on extracting useful information from a model that shows a globally low classification accuracy but might be successful on a more local scale (i.e. for specific subgroups).
One way to address this complication could have been to conduct a pilot study by synthesizing data and testing different AI models and explainability tools. This would have been a valid approach to gain insights into the various methods and their behaviors related to the challenges with the data. Yet it is important to keep in mind that pilot studies can only be as representative as their ability to mimic the exact challenges in the data.
With challenge comes opportunity
“AI has a great ability to identify patterns in data that are otherwise impossible to highlight, and may indicate relationships never explored before.”
Despite it being a challenging and explorative endeavour, AI offers exposome research the ability to make use of heterogenous, unstructured data collected for different purposes, and find a guided way through what’s relevant and what’s not. Much more research must be done on which models and tools to select for this type of work and how to evaluate the outcome correctly, but the huge potential for taking vast amount of data and narrowing them down to something of interest for further study, is a valuable opportunity.
AI has a great ability to identify patterns in data that are otherwise impossible to highlight, and may indicate relationships never explored before. However, unlike more traditional methods, the theoretical and mathematical foundations of these approaches are not fully formalized yet. As technologists and researchers, we need to find the right balance between exploration and thoroughness when adopting and exploring novel tools; ultimately, the goal is to provide tools that yield compelling results, and to address the gaps in theoretical foundations as the technology matures. One possibility that we are very excited about is the ability of AI to be assisted by providing context and knowledge to models that are otherwise data-driven. For example, we can tell the algorithms that a marker belongs to a gene that is part of certain metabolic pathways, or specify that some metabolic pathways are important in the metabolism of certain molecules. Thus, we can use context and relation to optimize the outcome, so the AI’s response is not only based on the data, but also on existing knowledge about which mechanisms are related to inflammatory diseases, exploring these relationships at different levels.
Best practices for data-driven AI research projects
“Companies, organizations, and scientists holding databanks and other valuable data need to prioritize identifying which questions and answers are within reach and what kind of proof is needed to trust the models’ outcome. We need to work towards an aim where those who deliver the data can rely on the results that AI returns. These models have enormous strengths in some areas but also a lot of weaknesses. Therefore, we need to be more specific on the questions we want answered to ensure we don’t waste too much time and resources, especially when pioneering the field”, says Kastaniegaard.
For a funding agency like the European Commission, it could be important to consider investing in different categories of AI development. We have on one hand the development of classifiers to determine biological stages, like diagnostic models, where assigning thresholds for limitation on data size and quality is more measurable as we need to achieve high predictive accuracy. On the other hand, we wish to enable research projects like EXIMIOUS to utilize AI models to actually filter through enormous amounts of data to identify relevant associations. Here, too many restrictions and regulations could inhibit the research and the pioneering of the field.
As Kastaniegaard rounds up, “We want researchers to start testing many different types of approaches to evaluate AI models, and we need to assess the limitation on the explainable tools as well”. He urges not to limit that and instead get a great understanding of what questions lie in AI models’ different categories and uses.