The internet has long been the first port of call for individuals experiencing unfamiliar symptoms. While doctors have traditionally cautioned against the perils of “Dr. Google,” citing its lack of context and unreliable sources, a new era of self-diagnosis is dawning. Artificial intelligence (AI) chatbots, powered by sophisticated large language models (LLMs), are emerging as significantly more potent tools for understanding health concerns. Programs like OpenAI’s ChatGPT, Microsoft’s Bing (utilizing OpenAI’s technology), and Google’s Med-PaLM are being explored for their potential to revolutionize how individuals approach preliminary diagnosis and health information seeking. Trained on vast datasets of internet text, these AI systems can predict word sequences to answer questions in a remarkably human-like and informative manner. In a healthcare landscape grappling with workforce shortages, the promise of AI chatbots to assist individuals with their health inquiries is attracting considerable attention from researchers and medical professionals alike. Initial studies indicate that these AI programs demonstrate a notable leap in accuracy compared to conventional search engine results, sparking predictions that major medical institutions may soon integrate LLM chatbots into patient interaction and diagnostic processes.
Dr. Benjamin Tolchin, a neurologist and ethicist at Yale University, has observed firsthand this shift in patient behavior. He notes that patients are increasingly utilizing AI chatbots like ChatGPT to investigate their symptoms or research medication side effects even in these early stages of public availability. While acknowledging the preliminary nature of these observations, Dr. Tolchin expresses that the responses generated by these AI tools have been impressively coherent and relevant. “It’s very impressive, very encouraging in terms of future potential,” he remarks, highlighting the transformative possibilities that AI could bring to medical understanding.
The Rise of AI-Assisted Diagnostic Tools
The integration of digital tools into medical practice has been accelerating for years, particularly highlighted by the surge in online patient-physician communication during the COVID-19 pandemic, with digital portal messages increasing by over 50%. Basic chatbots are already employed within healthcare systems for administrative tasks such as appointment scheduling and delivering general health information. However, the advent of well-informed LLM chatbots represents a potential paradigm shift, capable of elevating doctor-AI collaboration and even the diagnostic process itself to unprecedented levels.
A compelling study, yet to undergo peer review, was published on the preprint server medRxiv in February. Researchers, led by epidemiologist Andrew Beam from Harvard University, formulated 48 prompts detailing patient symptoms. These prompts were then inputted into OpenAI’s GPT-3, the algorithm underpinning ChatGPT at the time. Remarkably, the LLM successfully included the correct diagnosis within its top three potential diagnoses in 88% of cases. To provide context, human physicians achieved a 96% success rate with the same prompts, while individuals without medical training reached only 54%. This study underscores the significant diagnostic capabilities emerging within these AI systems.
“It’s crazy surprising to me that these autocomplete things can do the symptom checking so well out of the box,” Beam states, emphasizing the unexpected proficiency of these language models in symptom analysis. Prior research has indicated that traditional online symptom checkers, reliant on statistical algorithms, achieve a correct diagnosis within the top three possibilities only 51% of the time. This stark contrast further highlights the advancement represented by AI chatbots.
The user-friendliness of chatbots also presents a significant advantage over conventional symptom checkers. Instead of navigating rigid, statistically driven programs, individuals can describe their health concerns in natural language. “People focus on AI, but the breakthrough is the interface—that’s the English language,” Beam points out, underscoring the intuitive nature of interacting with these AI systems. Furthermore, AI chatbots possess the capability to engage in dynamic conversations, asking follow-up questions to refine their understanding of a patient’s situation, mirroring a doctor’s approach. However, Beam acknowledges a crucial caveat: the symptom descriptions used in the study were meticulously crafted and unambiguous. The accuracy of AI diagnosis may be compromised when faced with poorly articulated or incomplete patient descriptions.
Navigating the Pitfalls of AI in Medical Diagnosis
Despite the promising advancements, concerns regarding the potential pitfalls of LLM chatbots in medical diagnosis are being actively discussed. One primary concern is the susceptibility of these systems to misinformation. LLM algorithms predict subsequent words based on their statistical likelihood within their training data – the vast expanse of online text. This raises the possibility of equal weight being given to credible sources like the Centers for Disease Control and Prevention alongside less reliable information, such as anecdotal accounts on social media platforms. While OpenAI states that they “pretrain” their models to align with user intent, the specifics of source weighting remain unclear. The company also acknowledges the risk of “hallucinations,” where AI models fabricate information, and includes disclaimers advising against using ChatGPT for serious diagnoses, treatment instructions, or life-threatening conditions.
The potential for malicious actors to manipulate future AI responses by flooding the internet with disinformation is another valid concern. Google’s chatbots, which continuously learn from new online content, are particularly vulnerable to this form of manipulation. “We expect this to be one new front of attempts to channel the conversation,” warns Oded Nov, a computer engineer at N.Y.U., highlighting the ongoing challenge of maintaining information integrity in AI systems.
One proposed solution is to mandate source citation for chatbot responses, similar to Microsoft’s Bing engine. However, LLMs have demonstrated the ability to fabricate nonexistent sources, formatted to resemble legitimate citations, placing the burden of verification on the user. Alternative solutions include curated source control by developers or large-scale human fact-checking initiatives. However, the sheer volume of AI-generated content poses significant scalability challenges to these approaches.
Google is adopting a different strategy with its Med-PaLM chatbot, drawing from a substantial dataset of real patient-provider question-answer exchanges and medical licensing exams. In a preprint study evaluating Med-PaLM’s performance across various metrics, including alignment with medical consensus, completeness, and potential for harm, its responses aligned with medical and scientific consensus 92.6% of the time, closely mirroring the 92.9% achieved by human clinicians. While chatbot responses exhibited a higher likelihood of missing information compared to human answers, they were marginally less likely to pose a risk to users’ physical or mental health.
The diagnostic aptitude of these chatbots is not entirely unexpected, as earlier iterations of MedPaLM and ChatGPT have both successfully passed the U.S. medical licensing exam. Alan Karthikesalingam, a clinical research scientist at Google involved in the Med-PaLM study, emphasizes the significance of training AI on real-world patient-provider interactions. This approach enables the AI to consider the broader context of an individual’s health beyond textbook scenarios. “Reality isn’t a multiple-choice exam,” he explains. “It’s a nuanced balance of patient, provider and social context.”
The rapid pace at which LLM chatbots are entering the medical domain raises concerns among some researchers, even those optimistic about the technology’s potential. Marzyeh Ghassemi, a computer scientist at MIT, expresses apprehension about the regulatory framework lagging behind technological deployment. “They’re deploying [the technology] before regulatory bodies can catch up,” she notes, underscoring the need for proactive regulatory oversight.
Addressing Bias and Ensuring Equity in AI Diagnosis
A particularly salient concern raised by Ghassemi is the potential for chatbots to perpetuate existing biases within medicine, including racism and sexism. “They’re trained on data that humans have produced, so they have every bias one might imagine,” she states. For example, established biases in medical practice, such as women being less likely to receive pain medication prescriptions than men and racial disparities in diagnoses like schizophrenia and depression, can be inadvertently learned and amplified by AI systems. Beam’s unpublished research indicates that ChatGPT exhibits a tendency to place less trust in symptom descriptions provided by certain racial and gender groups, highlighting the presence of bias within these models. While OpenAI has not yet provided a response regarding bias mitigation strategies in their medical AI applications, the issue remains a critical area of focus.
Eradicating bias from the internet is an insurmountable task. However, Ghassemi suggests proactive audits to identify and correct biased responses within chatbots. Developers could also implement strategies to identify and flag common biases during user interactions.
Intriguingly, research from Ghassemi’s team involving a deliberately “evil” LLM chatbot designed to provide biased advice in emergency medicine revealed that users, both medical professionals and non-specialists, were more inclined to follow discriminatory advice when presented as instructions rather than simple information. This suggests that the manner in which AI communicates information can significantly influence user interpretation and decision-making.
Karthikesalingam emphasizes the role of diversity within the development and evaluation teams at Google Med-PaLM as a crucial factor in identifying and mitigating biases. He acknowledges that addressing bias is an ongoing process that requires continuous monitoring and adaptation based on real-world system usage.
Building trust in AI diagnostic tools is paramount, and ensuring equitable treatment across patient populations is a fundamental prerequisite for achieving this trust. It remains unclear whether the process of sifting through multiple search engine results fosters greater user discernment compared to receiving a seemingly authoritative answer from a chatbot.
Tolchin expresses concern that the conversational and seemingly empathetic nature of chatbots could lead to over-trust and the disclosure of sensitive personal information, potentially jeopardizing user privacy. OpenAI’s website disclaimers indicate data collection practices, including location and IP address. Even seemingly innocuous details shared during chatbot interactions, such as family information or hobbies, could pose privacy risks.
Furthermore, the public’s willingness to accept medical information from a chatbot in place of a human physician is uncertain. An experiment by the mental health app Koko, which utilized GPT-3 to generate supportive messages for users, revealed that while the AI significantly expedited message creation, user engagement and perceived effectiveness diminished upon disclosure of AI involvement. This experiment also sparked ethical concerns regarding undisclosed AI experimentation on users. A Pew Research Center survey revealed that approximately 60% of Americans would feel uncomfortable with their healthcare provider relying on Ai For Diagnosis and treatment recommendations, highlighting public apprehension surrounding AI in healthcare.
However, studies also indicate that individuals struggle to distinguish between AI and human responses in medical contexts. A “medical Turing test” study by Nov, Singh, and colleagues found that volunteers could correctly identify both physician and chatbot responses only 65% of the time. Devin Mann, a physician and informatics researcher at NYU Langone Health and a study author, suggests that users may be reacting not only to phrasing nuances but also to the level of detail provided. AI systems, with their unlimited time and capacity, may offer more comprehensive and patient explanations, which could be beneficial for certain individuals.
The study also revealed a nuanced relationship between question complexity and user trust in AI. While users expressed comfort with chatbots addressing simple queries, trust diminished proportionally with increasing question complexity and perceived risk.
Mann posits that AI integration into diagnosis and treatment is likely inevitable. The critical factor, he emphasizes, is ensuring readily available human physician access for patients who are dissatisfied with AI-driven interactions. “They want to have that number to call to get the next level of service,” he states.
Mann anticipates that a major medical center will soon announce the implementation of an AI chatbot for diagnostic assistance. Such collaborations will raise crucial questions regarding service charges, data privacy protocols, and liability in cases of adverse outcomes resulting from chatbot advice. Nov emphasizes the need to prepare healthcare providers for collaborative roles in a “three-way interaction among the AI, doctor and patient.”
In the interim, researchers advocate for a measured and cautious rollout of AI diagnostic tools, potentially initially confining implementation to clinical research settings while developers and medical experts collaboratively address existing limitations and ethical considerations. Tolchin finds reassurance in the consistent recommendation for physician evaluation generated by the AI systems he has tested. “When I’ve tested it, I have been heartened to see it fairly consistently recommends evaluation by a physician,” he concludes, highlighting a potentially crucial safety net within current AI diagnostic models.