AI has a transformative role to play in healthcare, from improving diagnosis and care planning, to addressing conditions that have been historically difficult to treat. The integration of AI into healthcare is already well underway. Notable and increasing adoption of AI technologies within the industry is evident in:
AI is a powerful and innovative force to improve healthcare practices and outcomes, but the book on testing AI is not yet written, let alone for safety-critical applications.
Some of the key approaches to testing AI-infused applications we can draw from traditional testing and AI data collection. However, some challenges require innovation of our Quality Engineering practices in topics that are unique to the world of AI-infused healthcare applications.
A significant concern in AI systems is the susceptibility to suffer bias and provide unfair outcomes.
These are valid concerns, but there are times – particularly in the medical world – when a dataset that may appear biased is actually truly representative of the real-world and this needs to be factored in. Just think of the myriad conditions that affect one demographic, gender or ethnicity more than others, treatment plans that – due to financial, legal or regulatory reasons – are only available in certain geographic regions and conditions that occur, or are remediated as a side-effect of, other treatments and routines.
We need Quality Engineering approaches that can identify an unfair bias in an AI, but without penalizing software in cases where there’s a legitimate imbalance in data.
A key challenge in AI-infused systems across industries is explainability. How do we know how an AI has made its decision? For many AI architectures, including some of the deep learning and generative models that we tailor to the most challenging problems, there’s an inherent difficulty in understanding how large inputs, with potentially millions of parameters, drive a given output.
Whilst we can show statistically that, in many problem domains, AI decision power can outstrip our own, even in highly specialized tasks, if we can’t explain the decisioning process in something that may have critical impact to a patient’s wellbeing, there can be legal implications and ethical blocks to using that output.
Optimal performance and ultra reliability are critical considerations in healthcare AI contexts to ensure accurate diagnostics, timely treatment decisions and patient safety.
From computer viruses to human viruses – the impact of a malicious/terrorist attack on an AI in charge of a patient’s health and treatment might also have catastrophic life changing or loss of life impacts.
Healthcare AIs must therefore be dependable to safeguard sensitive patient data, maintain confidentiality, and prevent unauthorized access. The attack surface on an AI powered system may be very different to that of a traditional software offering but must still be thoroughly tested to ensure integrity of the solution.
One of the key problems for developing fair, reliable, high-quality AI-Infused healthcare applications is balancing the AI-centric considerations with those applicable to any software, without jeopardizing patient safety, our ability to deliver software or our ability to continuously improve outcomes.
These errors are present in the foundational data that we build our AI from. Whether caused by null, missing or corrupted rows; logical constraints in our data; poor sampling; or poor labelling, these are the errors encountered at the start of our AI development journey. To identify these, in-depth understanding of our data, problem domain, and pipelines is required.
Even with great data we can still see poorly performing AI. Whether that’s because of under-/over-fitting a model; low stability or ability to adapt to new data; algorithmic or learned bias; or the use of sub-optimal features; these errors will prevent an AI initiative’s success. To detect and avoid these errors we need to understand how our Data Scientists and Engineers are choosing algorithms, parameters and training approaches; identifying test strategies and approaches to push the boundaries of our model and identify weak behavior and contrast our model’s performance with real-world users and existing systems.
As with any component of a system, a strong-performing AI model that is not effectively integrated into our processes or software will cause issues with our day-to-day operation. These issues may be a model introducing performance or security bottlenecks; or an over-reliance on the model and a susceptibility to automation bias. We need to prepare careful end-to-end tests of AI systems and acceptance checks of the processes around them to ensure that when we are using intelligent components, we are using them in the optimal ways for our healthcare services and not actually weakening our overall healthcare provisioning.
These kinds of challenges need to be met not only for organization and healthcare provider’s internal assurance practices, but also need to be factored in to the static and dynamic checks required in regulatory, compliance and formal verification and validation (V & V) frameworks. Whilst big questions remain on how to Quality Engineer AI in line with these frameworks, as they are today, it seems likely that they will almost certainly be updated with more and more specific AI considerations as the uptake of these technologies in healthcare grows.
Whilst there are specific challenges to Quality Engineering AI, and nuances to be accounted for, it’s not all bad news. Professionals around the world are working hard all the time to understand these challenges, identify where they are introduced and how to mediate them and make the path to safe, responsible AI as smooth as possible.
To learn more about testing of AI-unfused applications visit Forrester for Diego Lo Giudice’s video on minimizing risk and bias here. *
* registration required.
Data is the foundational component of an AI. To ensure reliability, fairness and robustness, the training data used for healthcare AI models needs to be validated as diverse and representative of the population it serves; data that is a good representation of the population helps mitigate the risk of poor model fitment, low stability and biased outcome.
AI training data impact assessments should be deployed to evaluate how variations in training data impact overall performance, data should be subjected to low level tests to confirm its integrity, and models should be stressed with samples and datasets curated to push boundaries of the model and samples to maximize the performance of the model in the real world.
In implementing healthcare AI there will be clinically- and problem-relevant biases as well as undesirable ones. As datasets are curated for AI problems, and models implemented we need to work closely with SMEs to design and perform rigorous performance analysis, assess model variability, reliability, and review security impacts across demographics to ensure we build and operate in line with the nuances and context-dependencies of bias, mitigating risk and promoting fairness.
We need to understand how changes in variables affect model outputs. Conducting sensitivity analyses to evaluate sensitivity to variables that can alter performance, reliability and security. As we identify weaknesses or extremes of sensitivity we need to develop tests and mitigations to ensure safety and stability in our model across inputs and data.
Implement transparent and explainable AI models to understand how decisions are made wherever possible to assure fairness and reliability, and establish new methods to improve explainability. XAI is essential for clinicians to trust and interpret the results and apply the outputs into their daily work.
Implementation teams need to:
Provide education and training for healthcare professionals on how to interpret AI outputs, understand the limitations of models, and recognize potential biases. This enhances collaboration between AI systems and healthcare practitioners.
Provide education on interpreting and understanding performance metrics, enabling healthcare professionals to make informed decisions on the use of AI and to avoid automation & uptake bias in their work. We should further train healthcare professionals to assess and contribute to the reliability of AI models in clinical settings. Implement programs to raise awareness among healthcare professionals about the importance of security in AI applications.
In navigating the transformative potential of AI in healthcare, the role of quality practitioners needs to take on not just the technical considerations and variations these technologies have vs. traditional software; but also the business and disciplinary considerations of intelligent systems in support of safety-critical, potentially life-altering decisions. As AI continues to revolutionize medical practices, addressing nuanced considerations in dynamic bias, performance, usability, cultural sensitivity, and localization may become as important as raw technical performance.
Further, developing new compliance and V&V frameworks and approaches will involve a combination of technological advancements, and a commitment to ethical practices across industry and regulatory bodies to succeed.
As AI encroaches more and more into healthcare systems, safe and successful uptake will require the expertise of emerging high-skill roles dedicated to ensuring the effectiveness, fairness, and ethical use of AI in the evolving landscape of healthcare.