Evaluating the Effectiveness of Diagnosis Checkers: A Comparative Study Using Simulated Patient Encounters

Introduction

Healthcare systems worldwide are facing increasing pressure due to overcrowding, leading to challenges in timely patient care and resource allocation [1-6]. In response to these pressures, digital health tools like symptom checkers (SCs) or, more accurately termed, Diagnosis Checkers, have emerged as potential solutions to enhance patient navigation and triage [12, 13]. These tools aim to streamline the initial stages of medical consultation by employing questionnaires or chatbot interfaces to gather patient information, mimicking the crucial questioning phase of a clinical examination [14]. By providing potential diagnoses and guidance on seeking appropriate care, diagnosis checkers hold promise for improving healthcare efficiency. However, to ensure their responsible implementation and effectiveness, rigorous evaluation is paramount.

Drawing inspiration from the standardized and objective methods used to assess medical students, such as Objective Structured Clinical Examinations (OSCEs), this study explores a novel approach to evaluate diagnosis checkers. OSCEs utilize simulated patient scenarios to assess clinical skills in a controlled and reproducible manner [17, 18]. This method addresses the limitations of traditional evaluation approaches that rely on retrospective clinical cases, which may lack standardization and real-world applicability [15, 16]. The increasing prevalence of teleconsultations, mirroring the rapid questioning process in emergency call centers, further supports the relevance of simulated patient encounters in evaluating diagnosis checkers [19-21].

This study aims to assess the effectiveness of a diagnosis checker in comparison to experienced emergency physicians, using OSCE methodology with simulated patients. By evaluating diagnostic accuracy and triage capabilities, this research seeks to provide valuable insights into the potential and limitations of diagnosis checkers in a simulated real-world setting.

Methods

To comprehensively evaluate the diagnostic performance of a diagnosis checker against emergency physicians, we employed a comparative study using simulated patient encounters, mirroring the OSCE approach. This methodology ensured both the diagnosis checker and the physicians received identical patient information, limited to the patient’s verbal responses, akin to a teleconsultation scenario. We adhered to the Strengthening the Reporting of Observational Studies in Epidemiology in simulation research (STROBE) guidelines to maintain rigor and transparency in our study design [22].

Simulated Patient Development

The foundation of our evaluation rested on the creation of standardized and realistic simulated patient scenarios. A panel of expert physicians, including general practitioners, emergency physicians, and internists, identified 44 of the most prevalent diseases encountered in unscheduled care settings. Crucially, the expert panel was blinded to the specific diseases coded within the diagnosis checker to ensure an unbiased evaluation, reflecting real-world conditions.

Each expert physician developed a clinical case for each of the 44 selected diseases, adhering to a consensual template to ensure consistency and reproducibility. These simulated patient charts were meticulously crafted to meet specific quality criteria:

  1. Clinical Concordance: Symptoms and medical history aligned with the primary diagnosis.
  2. Role Portrayal: Scenarios were designed for effective interaction with both software and physicians.
  3. Non-Critical Condition: Patients were not in vital distress to facilitate clear communication.

Furthermore, each simulated patient scenario included standardized questions and pre-defined answers. In instances where information was not explicitly provided, actors were instructed to respond with “I don’t know,” mirroring patient interactions in real-life consultations. Actors were rigorously trained to embody their roles as standardized patients, following established OSCE protocols.

Study Design and Procedure

Our study adopted a prospective, randomized, non-inferiority design. Each of the 220 simulated patient cases was enacted twice by trained actors. First, the actor interacted with the diagnosis checker, followed by an interaction with an experienced emergency physician. This paired design ensured a direct comparison within the same clinical scenario (Fig 1). Emergency physician consultations were conducted via conference calls, allowing evaluators to observe the interaction between the physician and the simulated patient. The order of clinical cases was randomized to minimize bias. Cases that did not yield a diagnosis from the diagnosis checker were repeated to rule out technical issues and ensure result reliability.

Fig 1. Study design workflow comparing diagnosis checker and emergency physician performance using simulated patients.

Outcome Measures and Statistical Analysis

The primary outcome measure was the accuracy of the main diagnosis provided by the diagnosis checker and the emergency physician, compared to the gold standard diagnosis for each simulated patient case. Secondary outcomes included:

  • Accuracy in identifying both primary and secondary diagnoses.
  • Duration of the patient interview.
  • Number and nature of questions asked by each method.
  • Assessment of patient triage (urgent vs. non-urgent).

We hypothesized that the diagnosis checker would demonstrate non-inferiority to emergency physicians in diagnostic effectiveness. Statistical analysis involved McNemar tests for qualitative variables and paired Wilcoxon non-parametric tests for quantitative variables. A pre-planned superiority analysis was предусмотрен if non-inferiority was not established, to determine if physicians exhibited superior diagnostic performance.

Results

Simulated Patient Characteristics and Diagnosis Coverage

The simulated patient cases encompassed a diverse range of medical conditions relevant to unscheduled care settings, as detailed in Table 1. However, during the evaluation, we identified that the diagnosis checker did not recognize 4 diagnoses out of the 44 conditions, representing 20 simulated patient cases (Table 2). This limitation is important to consider when interpreting the overall performance of the diagnosis checker.

Table 1. Demographic and clinical characteristics of the simulated patient cohort.

Table 2. Diagnoses not recognized by the evaluated diagnosis checker software.

Comparative Diagnostic Performance

The diagnosis checker demonstrated non-inferiority to emergency physicians in terms of interview duration. However, significant differences emerged in diagnostic accuracy and triage assessment. Emergency physicians significantly outperformed the diagnosis checker in identifying the primary diagnosis (81% vs. 30%) and in accurately suggesting both primary and secondary diagnoses (92% vs. 52%). Furthermore, physicians exhibited superior accuracy in patient triage, correctly identifying vital emergencies in 96% of cases, compared to 71% for the diagnosis checker (Table 3).

Table 3. Comparative analysis of diagnosis checker and emergency physician performance metrics.

Diagnostic Concordance and Discordance Analysis

Further analysis of diagnostic performance across specific pathologies revealed instances where the diagnosis checker exhibited better diagnostic accuracy. These included conditions like cystitis, acute viral pericarditis, asthma attack, and arterial hypertension. However, the overall diagnostic performance varied considerably across different diseases, as illustrated in (Fig 2).

Fig 2. Diagnostic performance comparison between diagnosis checker and emergency physicians across various medical conditions.

Physician Evaluator Feedback

Physician evaluators, through satisfaction questionnaires, largely affirmed the realism and relevance of the simulated patient encounters. Both evaluators strongly agreed that the clinical situations mirrored unscheduled care scenarios and were similar to real-life patient encounters. They also agreed that the simulated patient interactions effectively simulated their daily practice environment.

Discussion

This study, pioneering the use of simulated and standardized patients in evaluating diagnosis checkers, provides valuable insights into their current capabilities and limitations compared to human physicians. Our findings underscore that while diagnosis checkers can achieve comparable interview times, their diagnostic accuracy and triage capabilities currently fall significantly short of experienced emergency physicians.

The OSCE-based methodology employed in this study offers a robust and reproducible framework for evaluating not only diagnosis checkers but also other telehealth modalities like teleconsultations and telephone triage. The realism of simulated patient encounters, validated by physician feedback, strengthens the ecological validity of our findings. This approach addresses the limitations of solely relying on clinical case reviews, offering a more dynamic and representative evaluation method.

Diagnosis checkers and the broader adoption of remote consultation tools hold significant promise for public health. They offer potential solutions for streamlining patient flow, standardizing initial patient assessments, and improving access to care, particularly in resource-constrained settings [25]. Countries like Sweden have already integrated diagnosis checker systems to optimize paramedic dispatch and telephone triage services [26-28]. By acting as a preliminary gatekeeper, diagnosis checkers can potentially guide patients to the appropriate level of care, enhancing healthcare system efficiency and patient compliance [29].

However, our study highlights critical areas for improvement in diagnosis checker technology. The limited diagnostic scope of the evaluated diagnosis checker, evidenced by its inability to recognize certain conditions, significantly impacts its overall effectiveness. Furthermore, the study emphasizes the complexity of medical diagnostic reasoning, which encompasses patient history, nuanced symptom interpretation, and clinical experience – aspects that current diagnosis checkers struggle to fully replicate [31-34]. To enhance their utility, future development should focus on expanding diagnostic databases, incorporating patient history and treatment information, and integrating more sophisticated diagnostic reasoning algorithms.

Limitations

While this study offers valuable insights, certain limitations should be considered. Firstly, this is an exploratory study focusing on the feasibility and relevance of the OSCE methodology for diagnosis checker evaluation. Secondly, the exclusion of five diagnoses by the evaluated diagnosis checker may underestimate its potential performance. However, this limitation also reflects real-world scenarios where software may not encompass the entirety of medical knowledge. Thirdly, our evaluation focused on a single diagnosis checker utilizing neural network technology. Future research should expand to evaluate a broader range of diagnosis checker technologies to provide a more comprehensive assessment. Finally, further research is warranted to explore the educational applications of remote OSCEs in training healthcare professionals in the effective utilization of diagnosis checkers and other telehealth tools.

Conclusions

This exploratory study demonstrates the feasibility and value of employing simulated and standardized patients, within an OSCE framework, to evaluate the diagnostic performance of diagnosis checkers and compare them to physician expertise in scenarios mimicking telephone consultations or medical triage. Our findings suggest that while diagnosis checkers offer potential benefits in healthcare delivery, their current diagnostic accuracy and triage capabilities require further refinement. The OSCE-based evaluation method provides a valuable tool for objectively assessing and improving these digital health tools, ultimately contributing to their responsible and effective integration into healthcare systems. Future research should extend this evaluation framework to diverse diagnosis checker platforms and explore its educational applications to optimize the synergy between digital tools and healthcare professionals.

Acknowledgments

The authors express their gratitude to Nicolas Desrumaux for his contribution as an actor in this study and to the developers of the diagnosis checker software for providing access to their platform.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *