The internet has long been a source for individuals seeking to understand their symptoms, a practice often frowned upon by medical professionals who cite the unreliability of “Dr. Google.” However, a new paradigm is emerging in self-diagnosis: artificial intelligence (AI) chatbots. Tools like OpenAI’s ChatGPT, Microsoft’s Bing (powered by OpenAI), and Google’s Med-PaLM are leveraging large language models (LLMs) to provide surprisingly sophisticated answers to health-related queries. As healthcare systems grapple with workforce shortages, the potential for AI in health diagnosis to assist patients and alleviate pressure on medical professionals is gaining serious attention. Initial research indicates that these AI programs surpass traditional search engines in accuracy, leading some experts to predict imminent collaborations between major medical centers and LLM chatbots for patient interaction and preliminary diagnosis.
Yale neurologist and ethicist, Benjamin Tolchin, observes this shift firsthand. He notes that patients are already utilizing ChatGPT to self-diagnose symptoms and research medication side effects. While acknowledging the early stage of this technology, Tolchin finds the responses generated by these AI tools to be “reasonable” and “impressive,” highlighting their “future potential” in healthcare.
The Double-Edged Sword of AI-Assisted Diagnosis
Despite the promising advancements in Ai For Health Diagnosis, concerns persist. Tolchin and other experts caution against potential pitfalls, including questions surrounding the accuracy of AI-generated medical information, privacy violations, and the perpetuation of racial and gender biases embedded within the training data. Furthermore, the interpretation of AI-provided information and the potential for over-reliance pose new risks that go beyond the limitations of basic online searches or symptom checkers.
The integration of technology into healthcare has been accelerating, particularly evident in the surge of patient-physician digital communication during the COVID-19 pandemic. While simpler chatbots are already employed for administrative tasks in healthcare, LLM chatbots have the potential to fundamentally transform doctor-AI collaboration and redefine the diagnostic process. Nina Singh, a medical student at New York University specializing in AI in medicine, emphasizes the rapidly evolving nature of this field, underscoring the need for careful consideration and responsible implementation.
A groundbreaking study, yet to undergo peer review, from Harvard University epidemiologist Andrew Beam and colleagues, provides compelling evidence of AI’s diagnostic capabilities. The study presented OpenAI’s GPT-3 with 48 symptom-based patient descriptions. Remarkably, the AI correctly included the actual diagnosis within its top three suggestions in 88% of cases. This figure, while slightly lower than the 96% accuracy achieved by physicians using the same prompts, significantly surpasses the 54% accuracy rate of individuals without medical training. Moreover, it dramatically outperforms traditional online symptom checkers, which, according to previous studies, correctly identify the diagnosis within the top three possibilities only 51% of the time.
Beam expresses surprise at the “out of the box” symptom-checking prowess of these “autocomplete things.” He emphasizes the user-friendly interface of chatbots, enabling patients to describe their experiences in natural language, a stark contrast to the rigid formats of statistical symptom checkers. The ability of these bots to engage in follow-up questions, mirroring a doctor’s approach, further enhances their diagnostic potential. However, Beam acknowledges limitations, noting that the study’s carefully crafted symptom descriptions may not reflect the complexities and potential inaccuracies of real-world patient descriptions.
Navigating the Pitfalls: Accuracy, Bias, and Trust in AI Health Tools
Misinformation poses a significant challenge for AI in health diagnosis. LLMs, trained on vast datasets of online text, risk assigning equal credibility to sources of varying reliability, from reputable institutions like the CDC to potentially misleading online forums. While OpenAI claims to “pretrain” its models for intended user responses, the specifics of source weighting remain unclear. The risk of “hallucinations,” where AI fabricates information, necessitates disclaimers against using ChatGPT for serious diagnoses or life-threatening conditions.
The potential for malicious actors to manipulate future AI responses by flooding the internet with misinformation, particularly concerning sensitive topics like vaccines, is a growing concern. Google’s continuously learning chatbots are especially vulnerable to such manipulation. N.Y.U. computer engineer Oded Nov warns of this “new front of attempts to channel the conversation.”
While Microsoft Bing’s approach of linking to sources offers a potential solution, LLMs have demonstrated the ability to fabricate nonexistent sources, making source verification a user burden. Alternative solutions include curating AI source material or employing fact-checking mechanisms, though scalability remains a concern.
Google’s Med-PaLM adopts a different strategy, drawing from curated datasets of real patient-provider interactions and medical licensing exams. In preprint testing, Med-PaLM aligned with medical consensus 92.6% of the time, statistically comparable to human clinicians’ 92.9%. While chatbot answers were slightly more prone to content omissions, they exhibited a marginally lower risk of causing harm.
The ability of AI to pass medical licensing exams, demonstrated by Med-PaLM and ChatGPT, is noteworthy. However, Google’s Alan Karthikesalingam emphasizes that real-world healthcare transcends multiple-choice scenarios, requiring a nuanced understanding of patient, provider, and social contexts. Med-PaLM’s training on real-world patient data aims to address this complexity.
The rapid pace of AI deployment in healthcare raises regulatory concerns, as highlighted by MIT computer scientist Marzyeh Ghassemi. She points out the potential for technology deployment to outpace regulatory frameworks.
Addressing Bias and Fostering Equitable AI in Healthcare
Ghassemi expresses particular concern about AI perpetuating existing biases in medicine and society. AI models trained on human-generated data inevitably inherit societal prejudices, such as disparities in pain medication prescriptions for women and racial biases in mental health diagnoses. Beam’s unpublished research indicates that ChatGPT exhibits biases in trusting symptom descriptions based on race and gender. OpenAI has not yet addressed inquiries regarding bias mitigation in medical applications.
While eradicating internet bias is unrealistic, Ghassemi suggests proactive bias audits and interventions. Her research on an “evil” LLM chatbot revealed that users, including medical professionals, were more likely to follow discriminatory advice when presented as instructions rather than factual statements, highlighting the influence of presentation style.
Karthikesalingam emphasizes the role of diverse development teams in mitigating bias in Med-PaLM. He acknowledges bias mitigation as an ongoing process contingent on real-world usage and feedback.
Building trust in AI for health diagnosis is paramount. It remains unclear whether the process of sifting through search engine results fosters greater user discernment compared to receiving direct answers from a chatbot. Tolchin worries that the conversational nature of chatbots could lead to over-trust and risky personal information disclosure. Privacy concerns are amplified by data collection practices, as outlined in OpenAI’s disclaimers.
Public acceptance of AI in healthcare remains uncertain. A Pew Research Center survey revealed that approximately 60% of Americans are uncomfortable with AI playing a diagnostic or treatment-recommending role in their healthcare. The blurring lines between human and AI interaction, highlighted by studies demonstrating the difficulty in distinguishing ChatGPT from physicians, further complicate the issue of trust.
Devin Mann, an NYU Langone Health physician, suggests that AI’s detailed and patient explanations may be beneficial for some users. However, trust levels decrease with increasing question complexity and perceived risk.
Mann anticipates the eventual integration of AI into diagnostic and treatment processes, emphasizing the critical need for readily available human physician access as a safety net. The imminent announcement of AI chatbot collaborations by major medical centers raises crucial questions regarding service charges, data protection, and liability in cases of AI-related harm. Nov emphasizes the need to train healthcare providers for effective collaboration in a three-way AI-doctor-patient interaction.
In the interim, cautious implementation, potentially within clinical research settings, is advocated to allow for thorough vetting and refinement. Tolchin finds reassurance in ChatGPT’s consistent recommendation for physician evaluation, suggesting a built-in safety mechanism.
This article is part of an ongoing series on generative AI in medicine.
*Editor’s Note (4/3/23): This sentence has been updated to clarify how OpenAI pretrains its chatbot model to provide more reliable answers.