The internet age has empowered patients with unprecedented access to health information. For years, doctors have cautioned against the perils of “Dr. Google,” citing unreliable sources and a lack of contextual understanding. However, a new paradigm is emerging in the realm of self-diagnosis: artificial intelligence chatbots. Tools like OpenAI’s ChatGPT, Microsoft’s Bing AI (powered by OpenAI), and Google’s Med-PaLM are rapidly changing how individuals approach their health concerns. These sophisticated large language models (LLMs), trained on vast datasets of text, offer human-like responses to medical queries, sparking both excitement and apprehension within the medical community.
Amidst a growing global shortage of healthcare professionals, the potential of AI chatbots to assist with preliminary medical guidance is undeniable. Early research indicates that these AI programs surpass traditional search engines in diagnostic accuracy, raising the prospect of widespread adoption in healthcare settings. Some experts predict that major medical institutions will soon integrate LLM chatbots into patient interactions and diagnostic processes.
While still in its nascent stages, the use of chatbots for medical self-diagnosis is already being observed in patient behavior. Dr. Benjamin Tolchin, a neurologist and ethicist at Yale University, notes that patients are increasingly using tools like ChatGPT to investigate symptoms and medication side effects. His initial observations suggest that the AI responses are often surprisingly coherent and relevant, highlighting the technology’s promising future in healthcare.
Despite the encouraging initial findings, the integration of AI chatbots into medical diagnosis is not without significant challenges. Concerns regarding information accuracy, patient privacy, algorithmic bias, and the potential for misinterpretation of AI-generated advice are paramount. The very features that make these chatbots powerful, such as their ability to generate seemingly authoritative answers, also present new risks that were less pronounced with basic online symptom checkers.
The Evolution of AI in Medical Assistance
The healthcare landscape has been gradually shifting towards digital solutions, a trend accelerated by the COVID-19 pandemic. The volume of patient-physician communication through digital portals surged during this period, underscoring the increasing reliance on online health interactions. Many healthcare systems already employ simpler chatbots for administrative tasks like appointment scheduling and disseminating general health information. As Nina Singh, a medical student at New York University specializing in AI in medicine, points out, “It’s a complicated space because it’s evolving so rapidly.”
However, LLM chatbots represent a quantum leap in AI capabilities within medicine. They are poised to revolutionize doctor-AI collaboration and even redefine the diagnostic process itself. A pre-print study published on medRxiv by epidemiologist Andrew Beam and his colleagues at Harvard University investigated the diagnostic capabilities of OpenAI’s GPT-3. The study presented the AI with 48 symptom-based prompts. Remarkably, GPT-3 included the correct diagnosis within its top three suggestions in 88% of cases. This performance level, while slightly below the 96% accuracy achieved by physicians using the same prompts, significantly outperformed individuals without medical training (54%) and vastly surpassed traditional online symptom checkers, which typically achieve around 51% accuracy for top-three diagnoses.
“It’s crazy surprising to me that these autocomplete things can do the symptom checking so well out of the box,” remarks Beam. He emphasizes the enhanced usability of chatbots compared to earlier symptom checkers. Instead of navigating rigid, algorithm-driven interfaces, users can interact with chatbots in natural language, describing their symptoms in their own words. “People focus on AI, but the breakthrough is the interface—that’s the English language,” Beam explains. Furthermore, these AI systems can engage in interactive dialogues, asking follow-up questions to gather more detailed information, mimicking a doctor’s approach. Beam acknowledges, however, that the study’s symptom descriptions were carefully crafted and unambiguous. The accuracy of chatbot diagnoses could potentially decrease with poorly articulated or incomplete patient descriptions.
Navigating the Pitfalls of AI-Driven Medical Advice
Despite their diagnostic promise, LLM chatbots are not immune to limitations and potential risks, most notably the propagation of misinformation. These algorithms predict subsequent words based on statistical probabilities derived from their training data, which encompasses a vast but potentially uneven landscape of online text. This raises the concern that information from credible sources like the CDC could be given equal weight to less reliable information found on social media or online forums.
An OpenAI spokesperson stated that the company “pretrains” its models to align with user intent but did not specify whether certain sources are prioritized for credibility. The company also acknowledges the risk of “hallucinations,” where the AI fabricates information or sources. To mitigate these risks, OpenAI includes disclaimers advising against using ChatGPT for diagnosing serious conditions or managing life-threatening situations.
The challenge of misinformation is further complicated by the evolving nature of online content. While ChatGPT’s training data is currently limited to pre-September 2021 information, other chatbots like Google’s continue to learn from real-time internet content. This continuous learning process could be exploited by malicious actors seeking to disseminate false health information. Oded Nov, a computer engineer at N.Y.U., warns, “We expect this to be one new front of attempts to channel the conversation.”
One proposed solution is to mandate chatbots to cite their sources, similar to Microsoft Bing’s approach. However, LLMs have been shown to fabricate citations, making source verification a significant burden for users. Alternative solutions include curating the data sources used by chatbots or employing human fact-checkers to identify and correct misinformation, although the scalability of these approaches remains a concern given the sheer volume of AI-generated content.
Google’s Med-PaLM adopts a different strategy, drawing upon a curated dataset of real-world patient-provider questions and answers, along with medical licensing exam materials. In a pre-print study evaluating Med-PaLM’s performance, its responses aligned with medical consensus 92.6% of the time, comparable to the 92.9% achieved by human clinicians. While chatbot responses were more prone to content omissions, they were slightly less likely to cause potential harm to users’ physical or mental health.
The ability of these chatbots to pass medical licensing exams and answer complex medical questions is not entirely unexpected. However, Alan Karthikesalingam, a clinical research scientist at Google and Med-PaLM study author, emphasizes the importance of training AI on real-world patient interactions. This approach allows the AI to understand the broader context of a patient’s health beyond simple multiple-choice scenarios. “Reality isn’t a multiple-choice exam,” he states. “It’s a nuanced balance of patient, provider and social context.”
The rapid pace of LLM chatbot deployment in healthcare raises concerns among researchers, even those optimistic about the technology’s potential. Marzyeh Ghassemi, a computer scientist at MIT, points out, “They’re deploying [the technology] before regulatory bodies can catch up.”
Addressing Bias and Ensuring Equitable Access
A particularly pressing concern is the potential for AI chatbots to perpetuate and amplify existing biases in healthcare. Ghassemi emphasizes that these systems are trained on human-generated data, inherently reflecting societal prejudices related to race, gender, and other demographics. For example, documented biases in healthcare, such as women being less likely to receive adequate pain medication and racial disparities in mental health diagnoses, can be inadvertently encoded into AI algorithms. Beam’s unpublished research indicates that ChatGPT may exhibit biases in trusting symptom descriptions based on the perceived race and gender of the patient.
While eliminating bias from the internet entirely is unrealistic, developers can implement strategies to mitigate its impact. Preemptive audits to identify biased AI responses and interventions to correct them are crucial. Ghassemi’s team’s research on an “evil” chatbot demonstrated that users, including medical professionals, were more likely to follow discriminatory advice when presented as instructions rather than mere information, highlighting the importance of how AI advice is framed.
Karthikesalingam notes that Google’s diverse Med-PaLM development team helps identify and address potential biases. However, he stresses that bias mitigation is an ongoing process that requires continuous monitoring and adaptation as the system is used in real-world settings.
Building trust in Chatbot Medical Diagnosis is paramount for widespread adoption, and equitable treatment is a cornerstone of trust. It remains unclear whether the interactive nature of chatbots fosters more or less critical evaluation by users compared to simply sifting through search engine results.
Tolchin expresses concern that the conversational and seemingly empathetic nature of chatbots might lead users to over-trust the AI and disclose sensitive personal information, potentially compromising their privacy. Privacy policies from companies like OpenAI indicate data collection practices that users should be aware of.
Furthermore, patient acceptance of AI-driven medical advice in place of human interaction is uncertain. An experiment with the mental health app Koko, which used GPT-3 to generate supportive messages, revealed that user engagement decreased once they realized the messages were AI-generated, highlighting the complex relationship between empathy and technology in healthcare. A Pew Research Center survey indicated that approximately 60% of Americans are uncomfortable with healthcare providers relying on AI for diagnosis and treatment recommendations.
Despite these reservations, studies suggest that people may struggle to distinguish between AI and human medical advice. A “medical Turing test” study by Nov, Singh, and colleagues found that volunteers correctly identified physicians and ChatGPT only 65% of the time. Devin Mann, a physician and informatics researcher at NYU Langone Health, suggests that AI’s capacity for detailed and patient explanations, contrasting with potentially rushed human consultations, might be perceived positively by some patients, particularly for simpler queries. However, trust in AI diagnosis diminishes as the complexity and risk associated with the medical issue increase.
Mann anticipates the eventual integration of AI systems into routine diagnosis and treatment pathways. He emphasizes the critical importance of ensuring human oversight and readily accessible pathways to human clinicians when patients require or desire a higher level of care. “They want to have that number to call to get the next level of service,” he states.
The widespread adoption of chatbot medical diagnosis is likely imminent, with Mann predicting major medical centers will soon announce AI-driven diagnostic tools. This integration will necessitate careful consideration of ethical, legal, and practical implications, including data privacy, service costs, liability for inaccurate advice, and the evolving roles of both AI and human healthcare providers in this new paradigm. Nov emphasizes the need to prepare healthcare professionals for “a three-way interaction among the AI, doctor and patient.”
In the interim, a cautious and phased rollout, prioritizing clinical research and rigorous evaluation, is advocated by researchers to allow for refinement and address potential shortcomings before widespread implementation. Tolchin concludes with a note of cautious optimism, “When I’ve tested it, I have been heartened to see it fairly consistently recommends evaluation by a physician,” suggesting that AI chatbots, at least currently, are more likely to guide users towards appropriate medical care than replace it entirely.