Examining the Efficacy of Buoy Diagnosis and Mobile Self-Diagnosis Apps: An In-Depth Analysis

Self-diagnosis mobile applications are increasingly prevalent, leveraging complex algorithms and artificial intelligence (AI) to offer users preliminary health assessments. These tools, including platforms like Buoy Diagnosis, Ada, Babylon, and Your.MD, operate on sophisticated, often undisclosed, algorithms, sparking both excitement and scrutiny within the medical community. The US Food and Drug Administration (FDA) is actively adapting its regulatory framework to differentiate between AI-driven apps with learning capabilities and those with static algorithms, recognizing the former as potential medical devices. However, comprehensive, real-world testing of these self-diagnosis apps, particularly in specialized fields like ophthalmology, remains limited.

This article delves into a critical evaluation of several prominent self-diagnosis apps, with a specific focus on Buoy Diagnosis and its contemporaries. Building upon prior scientific literature, we examine their diagnostic performance over a defined period. Our objective is twofold: to assess the consistency and accuracy of these apps in diagnosing ophthalmological conditions and to explore whether observed variations in results suggest the use of dynamic, “nonlocked” learning algorithms. By analyzing the performance of Buoy Diagnosis alongside other leading apps, we aim to provide a clearer understanding of their current capabilities, limitations, and the implications for patient care and regulatory oversight in the evolving landscape of digital health.

The Rise of AI in Mobile Health Diagnosis: Understanding Buoy and Its Competitors

Algorithms and machine learning (ML) have profoundly reshaped numerous facets of modern life, from personalized search engine results to the advent of self-driving vehicles [1, 2]. This technological revolution has extended into healthcare, with the emergence of self-diagnosis apps designed to empower patients in understanding their symptoms [3, 4]. While ML applications are becoming commonplace in areas like radiology image analysis [5], the integration of AI and ML in healthcare is tempered by concerns regarding trust, stringent regulatory demands, and the imperative for thorough validation [3, 6]. Despite their growing popularity, rigorous independent testing of these self-diagnosis tools has been limited. Early investigations, such as Semigran et al.’s 2015 study, evaluated symptom checker apps but did not specifically address the role of ML [7]. A 2019 scoping review by Aboueid et al. identified a range of such apps [3], but only a fraction have undergone functional diagnostic testing [810].

Alt text: User interacting with a self-diagnosis mobile app on a smartphone, showcasing the accessibility of digital health tools.

The regulatory landscape is also evolving. Initially, the US FDA exempted “symptom checker” apps from stringent medical device regulations [11]. However, the FDA has since released a white paper proposing a revised regulatory approach for self-diagnosis apps, particularly distinguishing between “locked” algorithms and AI/ML-driven learning algorithms. The latter category is now subject to more rigorous oversight [12]. The definition of these categories remains somewhat ambiguous, relying on manufacturer disclosures rather than definitive technical criteria. Therefore, repeated assessments of these apps are crucial to determine the stability of their diagnostic outputs over time. Variations in performance would strongly suggest the incorporation of “learning” AI technologies, differentiating them from simple symptom checkers. Thus, the primary aim of this study is to evaluate apps previously cited in scientific literature, including Buoy Diagnosis, across a set of medical conditions and to track their performance changes over a two-year interval. Significant shifts in results could indicate the utilization of learning algorithms by app developers.

Furthermore, the field of ophthalmology, the author’s area of expertise with over a decade of clinical experience, has been notably absent from prior app testing. Therefore, a secondary objective is to assess the diagnostic efficacy of self-diagnosis apps in this specialized domain. By challenging these apps with three common ophthalmological diagnoses representing varying levels of urgency, we aim to evaluate their diagnostic accuracy and treatment recommendations against the expertise of a seasoned ophthalmologist. This includes a detailed look at Buoy Diagnosis in comparison to other available platforms.

Methodology: Testing Buoy Diagnosis and Other Mobile Health Apps

Study Design and Setup

To rigorously evaluate the diagnostic capabilities of mobile health applications like Buoy Diagnosis, we conducted a comparative study using Android 9 and Android 10 mobile platforms, as well as Google Chrome on OSX for web-based applications. All tests were performed within Germany, utilizing the English language user interfaces of the apps. We ensured that the most current versions of each application from the Google Play Store or official websites were used, aligning with the specified dates for each diagnostic assessment (detailed in Multimedia Appendix 1).

The selection of apps for this study was based on their prior mention in scientific literature. This included Ada [3, 8, 13], Babylon Health or Babylon Check [3, 9, 13], Buoy Health [14], and Your.MD [13]. Certain apps were excluded: Baidu Doctor [15] due to its exclusive availability in Chinese, and K Health [3], which presented download and compatibility issues potentially related to regional restrictions. Notably, while an Android-based app for Buoy Diagnosis has been reported in development, at the time of testing, it was exclusively accessible as a web application [16].

Alt text: Buoy Health web application interface, illustrating the symptom checker’s interactive question format.

We summarized the fundamental operational principles of each app, drawing from developer descriptions and recent literature, including gray literature where scientific publications were scarce. While standardized methodologies for testing AI-based apps are still evolving, previous studies have employed virtual diagnoses combined with patient vignettes to assess symptom checker apps [7]. This approach was also used in recent studies evaluating Babylon Health [10] and Ada Health [8] in specific medical fields. Our study adopted a simplified version of this methodology, involving a single physician who created a virtual patient profile for each of three distinct ophthalmological diagnoses.

We selected three diagnoses representing varying levels of medical urgency within ophthalmology:

  1. Acute Angle Closure Glaucoma: Representing an absolute emergency requiring immediate intervention. The virtual patient presented with symptoms typical of a glaucoma attack: a painful, red eye of approximately two hours duration, blurred vision, headache, and other symptoms as elicited by each app’s questioning process (detailed symptom walkthroughs are available in Multimedia Appendix 1).
  2. Retinal Tear: A relative emergency necessitating same-day treatment.
  3. Dry Eye Syndrome: A condition not requiring immediate medical attention and often amenable to self-treatment.

While definitive, universally accepted clinical guidelines are lacking for these conditions, regional societies provide general recommendations [1719]. The symptomology of these diagnoses is well-established in ophthalmology, as detailed in the American Academy of Ophthalmology’s Basic and Clinical Science Course, aligning with the author’s clinical expertise [20]. Given the absence of strict guidelines, the critical benchmark for these apps was to avoid underestimating the urgency of the patient’s condition, irrespective of whether a precise diagnosis was reached.

Scoring and Evaluation

The primary diagnoses and treatment recommendations provided by each app were evaluated by the author, and a scoring system was applied:

  • 1 point: Awarded for a correct diagnosis and appropriate treatment recommendation.
  • 0.5 points: Awarded for a partially correct diagnosis or treatment recommendation. This was applied when a fully correct response was not provided, but the app’s output was not misleading and did not minimize the urgency (e.g., recommending consultation with a physician even without a specific diagnosis).
  • 0 points: Awarded for diagnoses or treatment recommendations that failed to meet the above criteria, indicating incorrect or potentially harmful advice.

Specific scoring criteria were established for each diagnosis: For glaucoma, any recommendation less urgent than emergency treatment received 0 points. For retinal tear, treatment recommendations ranging from “instantly” to within a few days received 1 point. For dry eyes, urgent treatment recommendations received 0 points, while recommending self-treatment prior to physician consultation received 0.5 points.

To mitigate potential biases from device-specific data (e.g., phone type, GPS location), a virtual, anonymous patient profile was used for all diagnoses. A two-year interval was deliberately chosen between test administrations (2018 and 2020). This timeframe was selected based on the assumption of gradual user base growth for these apps, leading to a slow but continuous accumulation of data that could potentially refine their learning algorithms. No human subjects, other than the author, were involved in this research process. Statistical analysis of P values was conducted using a Student t test for independent variables with SPSS (version 16.0; IBM Corp).

App Descriptions: Ada, Babylon, Buoy Health, and Your.MD

Ada: Developed in Berlin, Ada initially gained traction in New Zealand in 2016 before broader release [21, 22]. It employs a chatbot interface to gather user data, prompting users with symptom lists based on free-text input and adapting subsequent questions based on prior responses. The final report can be shared with a physician on behalf of the user.

Babylon: Based in London and primarily focused on the UK market, Babylon began in 2013 as an online physician consultation service. Since 2016, it has incorporated a chatbot for symptom assessment, utilizing simple and multiple-choice questions [23]. While explicit details on its ML algorithms are not publicly available, gray literature suggests the potential use of recurrent neural networks (RNNs) for deep learning, with Python as a possible primary programming language [24]. Ni et al. have also mentioned Bayesian networks, although without a specific source [25].

Buoy Health: Originating from Harvard Medical School in 2014, Buoy Diagnosis is presented as a smart symptom checker with an undisclosed algorithm, purportedly drawing on natural language processing (NLP) of data from 18,000 clinical papers [26]. According to its CEO, Buoy Diagnosis avoids decision trees, instead “dynamically picking” from 30,000 questions based on minimizing diagnostic uncertainty. This approach does not necessarily imply the use of neural networks, and the claimed diagnostic certainty ranges from 90.9% to 98%, without detailed methodological explanation [27].

Your.MD: Founded in Oslo, Norway, in 2012 and now headquartered in London [28, 29], Your.MD allows users to input free-text questions via its chatbot, responding with simple or multiple-choice questions. The algorithms used are proprietary; however, public information suggests Python as a primary programming language and the potential use of Bayesian networks, as indicated by its CEO [30].

Results: Performance Analysis of Buoy Diagnosis and Comparable Apps

Comparative Diagnostic Accuracy: 2018 vs 2020

The diagnostic performance of each app across the three ophthalmological conditions in both 2018 and 2020 is summarized in Table 1 and Table 2. Detailed walkthroughs and raw output data for each app are available in Multimedia Appendix 1, with score overviews in Multimedia Appendix 2.

Table 1. App Performance in 2018: Diagnosis and Treatment Scores

Diagnosis Glaucoma Retinal tear Dry eyes
Ada (•)/(•) (•)/(•) (•)/(-)
Babylon (-)/(•) (-)/(-) (-)/(-)
Buoy (-)/(•) (-)/(-) (-)/(∗)
Your.MD (•)/(•) (•)/(•) (•)/(•)

Open in a new tab

Alt text: Table showing diagnostic and treatment scores for Ada, Babylon, Buoy, and Your.MD apps in 2018 across Glaucoma, Retinal tear, and Dry eyes diagnoses.

Table 2. App Performance in 2020: Diagnosis and Treatment Scores

Diagnosis Glaucoma Retinal tear Dry eyes
Ada (-)/(-) (•)/(•) (•)/(∗)
Babylon (-)/(•) (-)/(-) (-)/(-)
Buoy (-)/(-) (-)/(•) (-)/(∗)
Your.MD (•)/(•) (-)/(•) (∗)/(-)

Open in a new tab

Alt text: Table showing diagnostic and treatment scores for Ada, Babylon, Buoy, and Your.MD apps in 2020 across Glaucoma, Retinal tear, and Dry eyes diagnoses.

Ada: In 2018, Ada correctly diagnosed angle closure glaucoma but misdiagnosed it as a cluster headache in 2020, omitting glaucoma from its differential diagnosis.

Babylon: For glaucoma, Babylon consistently recommended emergency treatment after five questions (triggered by “severe pain” input) in both years but provided no specific diagnosis. Retinal tear remained undiagnosed (“insufficient information”), with a recommendation to consult an online or in-person general practitioner. Dry eyes also lacked a diagnosis and were classified as a relative emergency (same-day medical treatment advised). Performance was consistent between 2018 and 2020.

Buoy Health: Buoy Diagnosis did not yield a correct diagnosis in either 2018 or 2020. In 2018, for dry eyes, “Blepharitis” was suggested as a secondary diagnosis, which could be considered partially relevant [31]. However, for retinal tear in 2018, the results were significantly inaccurate, suggesting “Cataract” or “Bone disease” as possible causes. No improvement in diagnostic accuracy was observed in 2020.

Your.MD: Your.MD accurately diagnosed all three conditions in 2018, requiring fewer questions than Ada and Buoy Diagnosis. It also correctly categorized treatment urgency, recognizing dry eyes as self-treatable. It was the only app to correctly identify angle closure glaucoma in 2020 (in 2018, it stated simply “Glaucoma”). However, in 2020, it failed to diagnose retinal tear, which it had correctly identified in 2018. For dry eyes, the 2020 recommendation shifted from appropriate self-treatment advice in 2018 to an inappropriate emergency care recommendation.

Technology and Algorithm Transparency

All tested apps rely on an active internet connection for diagnostic functionality. They all utilize chatbot interfaces, likely based on NLP, and process discrete user responses. However, significant variations exist in how information is processed, the types of questions asked, and the diagnostic conclusions drawn (see Multimedia Appendix 1). Substantial information regarding the specific algorithms used by these apps remains unavailable.

Performance Trends: 2018-2020 Comparison

The average number of questions asked by each app remained relatively stable between 2018 and 2020. For Ada, it changed from 27.3 to 31 (P=.38); Babylon from 11 to 9 (P=.64); Buoy Diagnosis from 31.3 to 30.3 (P=.63); and Your.MD from 10 to 10.3 (P=.84) (Multimedia Appendix 3). No significant difference was found in the average number of questions between Ada and Buoy Diagnosis (P=.41) or between Babylon and Your.MD (P=.93). However, significant differences were observed between Ada and Babylon (P<.001), Ada and Your.MD (P<.001), Buoy Diagnosis and Babylon (P<.001), and Buoy Diagnosis and Your.MD (P<.001).

The average scores for diagnosis/treatment (out of a maximum of 3 points per year) showed variations. Ada’s scores changed from 2.5/1.75 in 2018/2020 (P=.37/.73); Your.MD from 3/3 to 1.5/2 (P=.16/.37). Babylon and Buoy Diagnosis remained unchanged at 0/1 and 0/1.5, respectively (Table 1 and Table 2).

The total scores across both years were 5/3.5 for Ada, 0/2 for Babylon, 0/3 for Buoy Diagnosis, and 4.5/5 for Your.MD. Summing total points, no significant differences were found between Ada and Your.MD (P=.70) or between Babylon and Buoy Diagnosis (P=.56). Significant differences were observed between Ada and Babylon (P=.02), Ada and Buoy Diagnosis (P=.03), Babylon and Your.MD (P=.01), and Buoy Diagnosis and Your.MD (P=.01).

Discussion: Implications of Findings for Buoy Diagnosis and Self-Diagnosis Apps

Several noteworthy observations emerged during the app testing process. Ada, for instance, appeared to ask repetitive questions, such as re-querying about eye pain after it had been initially identified as the primary symptom. This behavior could potentially serve to enrich the app’s diagnostic database. Ada attempts to address the “black box” issue prevalent in ML [32] by providing a visual representation of the statistical likelihood of a suggested diagnosis based on symptom input. Interestingly, the statistics provided for dry eyes seemed to indicate a reduced association between the entered symptoms and the diagnosis in 2020 (“8 in 10” vs “5 in 10 people”), suggesting potential challenges in integrating accumulated data over time. The statistical outputs from Ada suggest the possible use of Bayesian probabilities, as artificial neural network (ANN) outputs typically do not directly correlate with statistical values, or these values are interpolated from ANN outputs. This aligns with Ada Health’s published information referencing the use of Bayesian networks [33].

Buoy Diagnosis posed questions that seemed tangential to the presenting symptoms, such as inquiries about health insurance. In 2018, Buoy Diagnosis presented users with images of medical conditions for comparison, which may be unsuitable for non-medical users. For example, in the dry eye scenario, users were asked to compare their cornea to a microscopy image to identify Horner-Trantas dots and were shown an image of patellar reflex testing. These images were not presented in the 2020 assessment for the same symptoms. Both Babylon and Buoy Diagnosis consistently failed to provide useful diagnoses and offered limited treatment recommendations. Some results were significantly inaccurate, such as Buoy Diagnosis suggesting “Bone issue” or “Non-bacterial brain inflammation” for the retinal tear case. This contributed to Ada and Your.MD outperforming Babylon and Buoy Diagnosis in overall diagnostic accuracy.

The variation in treatment recommendations across apps for identical symptom sets is also notable. Ada tended to recommend emergency care broadly, potentially shifting responsibility to the patient but diminishing the value of nuanced medical guidance. Babylon seemed to direct any patient reporting “severe pain” to emergency care, appropriate for glaucoma but lacking a specific diagnosis and offering generic recommendations otherwise. In 2020, Buoy Diagnosis advised glaucoma patients to seek medical advice within three days as a primary recommendation, followed by “emergency treatment” as secondary and tertiary options, which could confuse users regarding the urgency of their condition. Your.MD provided the most clinically appropriate recommendations in this study but also showed a decline in performance for dry eyes between 2018 and 2020, inappropriately shifting from self-treatment advice to emergency care.

While the number of questions asked by the apps remained relatively consistent, the temporal variations in diagnoses and treatment recommendations across all four apps, including Buoy Diagnosis, suggest the use of learning algorithms. This indicates that the algorithms governing history-taking and diagnostic computation are evolving, potentially falling under the FDA’s proposed regulatory framework for “nonlocked” algorithms. Regarding diagnostic effectiveness for ophthalmological conditions, the results were mixed and trended towards worsening. Notably, no app demonstrated improved history-taking or diagnostic outcomes over time. Instead, Ada and Your.MD showed declining diagnostic performance, while Babylon and Buoy Diagnosis remained consistently low-performing. This deterioration in diagnostic performance appears to contradict the intended benefits of “learning” algorithms and warrants further investigation. Interestingly, the number of questions asked did not correlate with result quality; the app with the highest overall score asked the second-fewest questions. This highlights significant differences in diagnostic approaches and efficacy across these platforms, all deserving of systematic evaluation. The undisclosed algorithms of these apps, including Buoy Diagnosis, conceptually resemble the adaptive feedforward neural network-based mobile diagnosis engine proposed by the author in 2016 [34], mirroring the classic AI “20 Questions” game [35]. Both frameworks utilize separate neural networks (or analogous algorithms) to calculate current diagnoses and determine optimal next questions. While these examples used simpler ANNs, the tested apps may employ more advanced architectures like RNNs, Bayesian networks, or convolutional neural networks [15], accessed via chatbot interfaces.

Limitations and Future Directions

This study has several limitations. First, the input cases included questions that the author considered irrelevant to diagnosis, such as diabetes prevalence, smoking history, and other seemingly unrelated inquiries. However, it is possible that from the perspective of a large, unbiased database, these questions are relevant, and a biased physician’s input could potentially skew the algorithms. Second, the result evaluation is subjective, mirroring the subjectivity of symptom entry. This could assess the apps’ ability to mimic a potentially flawed physician rather than their capacity for accurate diagnosis. Future studies could incorporate systematic evaluations within randomized controlled trials. Semigran et al. previously used human input and output on randomized diagnoses to assess self-diagnosis apps [7]. New methodologies, possibly including automation, may be necessary to effectively evaluate AI-driven apps, considering their vast data processing capabilities and dynamic algorithms. A simpler approach could involve multiple physician evaluations and averaged assessments [10]. The potential for manufacturers to adapt to known question sets (e.g., from this study) should also be considered in future research. Third, the sample size is limited. Larger-scale investigations are needed in future research. Fraser et al. (2018) have advocated for standardized and transparent evaluation procedures for these technologies [36]. Kelly et al. (2019) emphasized peer-reviewed studies to build trust in AI devices and highlighted the opportunity for large-scale prospective studies using data collected from consumer-oriented technology, contingent on data transparency [37]. They also noted the development of an extension to the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement [38], which provides guidelines for diagnosis prediction evaluation since 2015, to include ML algorithms [39]. Furthermore, the World Health Organization and the International Telecommunication Union are developing benchmarking frameworks for AI tools in healthcare [40]. These initiatives can guide future scientific exploration but require funding and resources. The companies behind these apps employ physicians, often with medical informatics or related training, in prominent roles, including co-founder positions in some cases [4144]. Now that AI has entered patient-facing diagnostics, it is crucial to determine if scientific evaluation and public performance assessments will parallel the physician oversight within these companies, or if essential performance data will remain proprietary—a common practice in the commercial sector due to conflicts of interest. Given the potential impact of these apps, including Buoy Diagnosis, on public health [4], transparency and public interest should be prioritized. AI-powered physician support through accessible apps holds immense promise, provided they demonstrably learn, improve, and, most importantly, meet the critical healthcare imperatives of efficiency and safety.

Abbreviations

AI: Artificial Intelligence
ANN: Artificial Neural Network
FDA: US Food and Drug Administration
ML: Machine Learning
NLP: Natural Language Processing
RNN: Recurrent Neural Network
TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis

Appendix

Multimedia Appendix 1: Walkthrough through all apps and diagnoses.
jmir_v22i12e18097_app1.docx (35.3KB, docx)

Multimedia Appendix 2: Additional results tables with scores.
jmir_v22i12e18097_app2.docx (18.2KB, docx)

Multimedia Appendix 3: Additional tables (no. of questions asked, time taken).
jmir_v22i12e18097_app3.docx (14.7KB, docx)

Footnotes

Authors’ Contributions: The author AC is currently not affiliated with any institution, but is an Independent Scholar.

Conflicts of Interest: None declared.

References

[References]

Associated Data

Supplementary Materials

Multimedia Appendix 1: Walkthrough through all apps and diagnoses.
jmir_v22i12e18097_app1.docx (35.3KB, docx)

Multimedia Appendix 2: Additional results tables with scores.
jmir_v22i12e18097_app2.docx (18.2KB, docx)

Multimedia Appendix 3: Additional tables (no. of questions asked, time taken).
jmir_v22i12e18097_app3.docx (14.7KB, docx)

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *