In October, I was fortunate to attend the Health AI Systems Thinking for Equity (HASTE) workshop as a mentor, which was hosted by the Temerty Centre for AI Research and Education in Medicine (T-CAIREM – I know it’s all quite a mouthful!) at UofT and Dr. Leo Celi from MIT. This was a workshop designed for computer scientists, engineers, and healthcare experts alike with an interest in how AI systems and technologies are being integrated into safety-critical situations like hospitals or the healthcare system more generally. We discussed a variety of papers and articles, which I’ll link at the end, and the ways that their information relates to key concepts like fairness, accountability, and transparency.
We split into groups, each consdering one such paper, so I’ll go into a bit more detail for how the discussion on our paper went, followed by some key takeaways at the end.
Our discussion
Our group covered the paper by Griot et al. (see Resources) about how language models do not actually “learn” from its data but rather from its structure. This was in context of learning medical information and diagnoses, similar to writing tests used to place medical students in residency. These are wholly multiple choice questions (MCQ). The study used existing generative AI to create MCQs for an invented medical diagnosis. The paper claims that the LLMs are simply pattern matching from the structure of the questions to determine the answer, as opposed to simply “learning” the material and being able to reason. However, it’s also important to note that MCQs are typically surface level for humans as well.
Worst-case scenarios
Nonetheless, in order to properly consider how the claims from this paper could impact the healthcare system with the integration of LLMs. Although the paper discussed medical implications, they did not specify the task in which these LLMs would be performing. As a group, we made some assumptions to the contexts in which these LLMs would be performing. For example, in an app diagnosis tool that would replace check-ins with a regular doctor, or doctors who might use a similar tool, or a situation in which a hospital or healthcare system verifies and trusts such an app. We consider the “worst case” scenarios. Then, the worst-case situations would include receiving an incorrect diagnosis that would result in loss of life, or a societal loss of trust in the healthcare system.
Ideas and points of discussion
It’s not enough to simply catastrophize about the possible implications of this report, but also consider what can be done about it! The majority of our discussions were about the pros and cons of certain procedures that might help our worst-case scenarios.
Accountability
- Some governmental regulation might be beneficial to decide the minimum level of “checking” that should ensure the model has not deprecated and its performance is still within an acceptable threshold.
- Incentives should be provided to the institutions to comply with these metrics of “checking”; one idea is to include financial rewards.
- This goal is to ensure model longevity and incentivize integration of research into society.
Fairness
- The paper also shows that the models performs worse on a non-English language like French, and this problem is well-known within the ML field; however, representation for different languages is very important in healthcare spaces. Mandating tests on different language data might help this issue.
- Another consideration is what subset of population would be more or less willing to try less or untested material, as well as what subset of the population is less represented in the training data. For example, differences financial stability, whether patients come from a rural or urban background, other disenfranchized groups.
Transparency
- The entity providing the service should advertise that it is AI to the public or whoever is using the service; at this point in time, many people are aware of the dangers or inconsistenties of AI.
- Some aspects of the results (e.g., blind trials, different tests, model results) and any key limitations (e.g., data) should be made available and in a manner that is accessible to the public to understand.
- Some form of data persistence that is available to the users might be beneficial; such as conversation history for any legal issues. However, patient privacy is an important factor to consider as well.
- Another interesting question to consider: if LLMs are learning the structure rather than content, and it’s established that MCQs are surface level evaluations for humans, is this the best way to evaluate medical students for important elements like residency placements?
Other discussions
Although I was not part of any of the other groups’ discussions, we did get to share our key takeaways and any other interesting points. Some other discussion papers included how bad actors can extract gigabytes of training data from LLMs without prior knowledge of the training dataset, or AI can infer sensitive attributes from images based on invisible features. Other groups suggested areas of focus such as open source health code and developing guidelines and analyses on building datasets. Another interesting suggestion was the idea of a “model card” for the public to quickly grasp the key metrics and quantify bias for a given model or application.
Conclusion
As this workshop was attended by medical practitioners, students, and academics, we had a wide variety of perspectives represented. Of course, there is always room for improvement such as those with expertise in policy, but nonetheless, this exchange of ideas amongst people with different experiences and knowledge was a key component for the workshop. I imagine that it is hard to create any actionable change or to handle problems as they arise without interdisciplinary collaboration from experts of different fields, which is one benefit of attending workshops like these.
All in all, I believe the rise of ML research integrating into safety critical contexts like healthcare requires careful thought and consideration. Especially as computer science students – perhaps some of us even going to work or study in related areas – it might be interesting to think about the implications of your work.
I’ve included all the links to the papers we discussed below. A proceedings paper with more detail from all the group leaders is also in progress, so stay tuned if you’re curious, and I hope this gave you some enjoyable food for thought!
Resources
- Chen, Lingjiao, Matei Zaharia, and James Zou. “How is ChatGPT’s behavior changing over time?.” arXiv preprint arXiv:2307.09009 (2023)
- Gichoya, Judy Wawira et al. “AI recognition of patient race in medical imaging: a modelling study.” The Lancet Digital Health, Volume 4, Issue 6, e406 - e414.
- Goodman KE, Yi PH, Morgan DJ. AI-Generated Clinical Summaries Require More Than Accuracy. JAMA. 2024;331(8):637–638. doi:10.1001/jama.2024.0555
- Griot, Maxime, et al. “Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data.” arXiv preprint arXiv:2406.02394 (2024).
- Gurovich, Y., Hanani, Y., Bar, O. et al. Identifying facial phenotypes of genetic disorders using deep learning. Nat Med 25, 60–64 (2019). https://doi.org/10.1038/s41591-018-0279-0
- Nasr, Milad, et al. “Scalable extraction of training data from (production) language models.” arXiv preprint arXiv:2311.17035 (2023)
- Travis Zack, et al. “Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study.” The Lancet Digital Health, Volume 6, Issue 1, 2024, Pages e12-e22, ISSN 2589-7500, https://doi.org/10.1016/S2589-7500(23)00225-X
- Wang, Y., & Kosinski, M. (2018). Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology, 114(2), 246–257. https://doi.org/10.1037/pspa0000098
- Wilson F P, Martin M, Yamamoto Y, Partridge C, Moreira E, Arora T et al. Electronic health record alerts for acute kidney injury: multicenter, randomized clinical trial BMJ 2021; 372 :m4786 doi:10.1136/bmj.m4786
- Hospitals struggle to validate AI-generated clinical summaries. ‘It’s a bit chaotic’
- Does AI Help or Hurt Human Radiologists’ Performance? It Depends on the Doctor