Principles for Medical Diagnosis: A Counterfactual Approach

Medical Diagnosis is a complex field, traditionally relying on associative reasoning to link symptoms to diseases. However, a more robust approach involves understanding the causal relationships between diseases and symptoms. This article explores the principles of diagnostic reasoning rooted in causal attribution, specifically focusing on counterfactual inference to enhance the accuracy and reliability of medical diagnosis.

Associative diagnosis often falls short because it merely identifies correlations without delving into causation. To truly diagnose, we need to assess the probability that a disease D is the actual cause of a patient’s symptoms S. This requires a diagnostic measure, denoted as ({mathcal{M}}(D,{mathcal{E}})), which ranks the likelihood of a disease causing observed symptoms ({mathcal{E}}). Any effective diagnostic measure should adhere to the following fundamental principles:

  1. Consistency: The diagnostic likelihood of a disease should align with its posterior probability. Mathematically, this means ({mathcal{M}}(D,{mathcal{E}})propto P(D=T| {mathcal{E}})). In simpler terms, a disease more likely to be present given the symptoms should also be considered a more probable diagnosis.
  2. Causality: A disease that cannot cause any of the patient’s symptoms, either directly or indirectly, should not be considered a valid diagnosis. Therefore, ({mathcal{M}}(D,{mathcal{E}})=0) if there’s no causal link between the disease and the symptoms. Diagnosis must be grounded in causal mechanisms.
  3. Simplicity: In line with Occam’s razor, diagnoses involving fewer diseases that explain a greater number of symptoms are preferred. This principle favors parsimonious explanations, avoiding unnecessarily complex diagnostic scenarios.

While posterior probability satisfies the consistency principle, it often fails to incorporate causality and simplicity effectively. To address these limitations, we delve into counterfactual diagnosis.

Counterfactual Reasoning in Medical Diagnosis

Counterfactual inference provides a powerful framework to assess causal relationships in medical diagnosis. It allows us to explore “what if” scenarios: what would have happened if a certain condition were different? In the context of diagnosis, counterfactuals help us determine if a symptom would disappear if a suspected disease were eliminated.

Given a patient’s symptoms ({mathcal{E}}=e), counterfactuals enable us to calculate the likelihood of observing a different symptom outcome ({mathcal{E}}=e^{prime}) if a hypothetical intervention were applied. This counterfactual likelihood is expressed as (P({mathcal{E}}=e^{prime} | {mathcal{E}}=e,{rm{do}}(X = x))), where do(X = x) signifies an intervention setting variable X to value x.

In medical diagnosis, we can use counterfactuals to quantify how well a disease hypothesis D = T explains a symptom S = T. We calculate the probability that the symptom would not be present if we were to intervene and “cure” the disease, represented by P(S = FS = T, do(D = F)). A high probability suggests that D = T is a strong causal explanation for the symptom. This counterfactual probability, contrasting with standard posterior probabilities, directly addresses the causal link between disease and symptom.

Inspired by this approach, we introduce two counterfactual diagnostic measures: Expected Disablement and Expected Sufficiency. Both measures, as we will demonstrate, satisfy the crucial principles of consistency, causality, and simplicity for effective medical diagnosis.

Expected Disablement: Quantifying Necessary Cause

Definition 1 (Expected Disablement): The expected disablement of a disease D quantifies the number of present symptoms expected to cease if we were to cure disease D. It is mathematically defined as:

$${{mathbb{E}}}_{{rm{dis}}}(D,{mathcal{E}}):= sum _{{mathcal{S}}^{prime} } left|{{mathcal{S}}}_{+}setminus {{mathcal{S}}}_{+}^{prime} right|P({mathcal{S}}^{prime} | {mathcal{E}},{do}(D=F))$$

Here, ({mathcal{E}}) represents the observed symptoms, and ({{mathcal{S}}}_{+}) is the set of currently present symptoms. The summation considers all possible symptom states ({mathcal{S}}^{prime}) under the counterfactual intervention do(D = F) – curing disease D. ({{mathcal{S}}}_{+}^{prime}) denotes the present symptoms in this counterfactual scenario. The term (left|{{mathcal{S}}}_{+}setminus {{mathcal{S}}}_{+}^{prime} right|) calculates the number of symptoms that were present but are no longer present after curing D.

Expected disablement is rooted in the concept of necessary cause. Disease D is considered a necessary cause of symptom S if S only occurs when D is present. Therefore, expected disablement assesses how well disease D alone explains the patient’s symptoms and the likelihood that treating D will alleviate those symptoms.

Expected Sufficiency: Assessing Sufficient Cause

Definition 2 (Expected Sufficiency): The expected sufficiency of disease D measures the number of symptoms expected to persist even if all other potential causes are eliminated, except for disease D. It is defined as:

$${{mathbb{E}}}_{{rm{suff}}}(D,{mathcal{E}}):= sum _{{mathcal{S}}^{prime} } left|{{mathcal{S}}}_{+}^{prime} right|P({mathcal{S}}^{prime} | {mathcal{E}},{do}({mathsf{Pa}}({{mathcal{S}}}_{+})setminus D=F))$$

In this definition, ({mathsf{Pa}}({{mathcal{S}}}_{+})setminus D) represents all direct causes of the present symptoms, excluding disease D. The intervention ({do}({mathsf{Pa}}({{mathcal{S}}}_{+})setminus D=F)) sets all these causes to false (inactive). Expected sufficiency then calculates the expected number of remaining symptoms ((left|{{mathcal{S}}}_{+}^{prime} right|)) in these counterfactual scenarios.

Expected sufficiency is based on the notion of sufficient cause. Disease D is a sufficient cause of symptom S if the presence of D can lead to S, but S can have other causes as well. By removing all other potential causes, expected sufficiency isolates the effect of disease D as a sufficient cause in our diagnostic model. If the assumption of disease as a sufficient cause is questionable, expected disablement is the more appropriate measure.

Theorem 1 (Diagnostic Properties): Both Expected Disablement and Expected Sufficiency adhere to the three fundamental desiderata for diagnostic measures: consistency, causality, and simplicity.

Structural Causal Models for Enhanced Medical Diagnosis

To implement and test these counterfactual diagnostic measures, we utilize Structural Causal Models (SCMs). SCMs, including Bayesian Networks (BNs), are powerful tools for representing causal relationships between diseases, risk factors, and symptoms. BNs, widely used in medical diagnosis, offer interpretability and explicitly encode causal links, essential for causal and counterfactual analysis.

These models typically represent diseases, symptoms, and risk factors as binary nodes (true or false). A directed acyclic graph (DAG) defines the relationships, with arrows indicating causal direction. For example, a risk factor might cause a disease, which in turn causes a symptom.

Fig. 2: Generative structure of our diagnostic Bayesian networks.

a Three-layer Bayesian network depicting risk factors Ri, diseases Dj, and symptoms Sk. b Illustration of a noisy-OR CPT. Symptom S is activated by the Boolean OR function of its parents, with each parent having an independent probability λi of failure to activate.

Full size image

SCMs go beyond BNs by explicitly modeling each variable as a deterministic function of its direct causes and unobserved noise. This allows for simulating interventions and computing counterfactuals. Existing diagnostic BNs, particularly noisy-OR networks, can be naturally represented as SCMs.

Noisy-OR Twin Diagnostic Networks for Efficient Computation

Noisy-OR models are frequently employed in medical diagnosis due to their intuitive representation of disease-symptom relationships and computational efficiency. In a noisy-OR model, a parent disease Di activates a symptom S if the disease is present and the activation isn’t blocked by random failure. The failure probability, ({lambda }_{{D}_{i},S}), is specific to each disease-symptom pair. A symptom is activated if at least one of its parent diseases successfully activates it, reflecting an OR logic.

To efficiently compute expected disablement and sufficiency in noisy-OR models, we use the twin-networks method. This approach constructs a “twin network” that simultaneously represents both the actual and counterfactual scenarios within a single SCM. This dramatically reduces the computational cost compared to traditional abduction methods, making counterfactual reasoning practical for large-scale medical diagnosis models. We refer to these models as twin diagnostic networks.

Theorem 2: For three-layer noisy-OR BNs, the expected sufficiency and expected disablement of disease Dk can be calculated using simplified expressions:

$$frac{sum _{{mathcal{Z}}subseteq {{mathcal{S}}}_{+}}{(-1)}^{| {mathcal{Z}}| }P({{mathcal{S}}}_{-} = 0,{mathcal{Z}} = 0,{D}_{k} = 1| {mathcal{R}})tau (k,{mathcal{Z}})}{P({{mathcal{S}}}_{pm }| {mathcal{R}})},$$

where for expected sufficiency:

$$tau (k,{mathcal{Z}})=sum _{Sin {{mathcal{S}}}_{+}setminus {mathcal{Z}}}(1-{lambda }_{{D}_{k},S}),$$

and for expected disablement:

$$tau (k,{mathcal{Z}})=sum _{Sin {mathcal{Z}}}left(1-frac{1}{{lambda }_{{D}_{k},S}}right),$$

Here, ({{mathcal{S}}}_{pm }) denotes observed symptoms (positive and negative), ({mathcal{R}}) represents risk factors, and ({lambda }_{{D}_{k},S}) is the noisy-OR parameter for disease Dk and symptom S.

Experimental Validation in Medical Diagnosis

To evaluate the effectiveness of counterfactual medical diagnosis, we conducted experiments comparing expected disablement and sufficiency with traditional posterior inference.

Diagnostic Model and Datasets: Clinical Vignettes

Validating diagnostic algorithms in medical diagnosis is challenging due to the difficulty in establishing ground truth diagnoses in real-world Electronic Health Records (EHRs). Diagnostic errors, incomplete data, and clinician biases can confound EHR-based validation.

To overcome these challenges, we utilized clinical vignettes – simulated patient cases that present typical disease symptoms, medical history, and demographics. Clinical vignettes are a gold standard for assessing diagnostic accuracy, offering a controlled and unbiased evaluation method, widely used for evaluating human doctors and symptom checker algorithms.

Our test set comprised 1671 clinical vignettes, created by a panel of experienced doctors. These vignettes were designed to be realistic diagnostic scenarios, with symptoms and risk factors aligned with our disease model. For each vignette, the true disease was masked, and our algorithms (posterior, expected disablement, and expected sufficiency) were tasked with ranking potential diagnoses. Doctors also provided independent diagnoses for the same vignettes, allowing for a direct comparison with algorithm performance in medical diagnosis.

Counterfactual vs. Associative Diagnostic Rankings

Our first experiment compared the diagnostic accuracy of counterfactual algorithms (expected disablement/sufficiency) against the associative algorithm (posterior probability). For each vignette, we assessed the top-k accuracy – the fraction of cases where the true disease was within the top-k ranked diagnoses, for k ranging from 1 to 20.

The results, shown in Figure 3, demonstrate that expected sufficiency (and expected disablement, which performed virtually identically) significantly outperforms the associative algorithm across all top-k rankings.

Fig. 3: Top k accuracy of Bayesian and counterfactual algorithms.

The figure illustrates the top k error rate (1 – accuracy) for counterfactual (green) and associative (blue) algorithms across 1671 vignettes, plotted against k. Shaded areas represent 95% confidence intervals. The dashed black line indicates the percentage reduction in error when using the counterfactual algorithm compared to the associative one (1 − ec/ea). Results are shown for k = 1 to 15.

Full size image

For top-1 accuracy, the counterfactual algorithm showed a 2.5% improvement. However, the performance gap widened significantly for higher k values. For k > 5, the counterfactual algorithm reduced misdiagnoses by approximately 30% compared to the associative approach. This suggests that while posterior probability effectively identifies the most likely diagnosis, counterfactual ranking excels at identifying subsequent likely diagnoses, crucial for differential medical diagnosis, triage, and treatment planning.

Comparing the ranking of the true disease, the counterfactual algorithm ranked the true disease higher in 24.7% of cases and lower in only 1.9% of cases compared to the associative algorithm. The average rank of the true disease improved from 3.81 ± 5.25 (associative) to 3.16 ± 4.4 (counterfactual).

Table 1: Position of true disease in ranking stratified by rareness of disease.

Stratifying by disease prevalence (Table 1), the counterfactual algorithm showed marked improvements, particularly for rare and very rare diseases. For these categories, the counterfactual approach achieved a higher ranking in 29.2% and 32.9% of vignettes, respectively. This is especially significant as rare diseases are diagnostically challenging and often represent serious conditions where accurate medical diagnosis is paramount.

Benchmarking Against Doctors in Medical Diagnosis

In our second experiment, we compared the algorithms’ performance to a cohort of 44 doctors. Each doctor diagnosed at least 50 vignettes, providing a partially ranked list of diagnoses. We evaluated the accuracy of doctors, the associative algorithm, and the counterfactual algorithm by comparing their diagnoses against the true disease for each vignette. The algorithms were configured to return diagnoses lists of the same size as each doctor for a fair comparison.

Fig. 4: Mean accuracy of each doctor compared to Bayesian and counterfactual algorithms.

This figure compares the average diagnostic accuracy of each of the 44 doctors against the accuracy of the posterior ranking (top) and expected sufficiency ranking (bottom) algorithms when diagnosing the same set of vignettes. The y = x line serves as a reference: points above the line indicate doctors with lower accuracy than the algorithm (blue), points on the line represent equal accuracy (red), and points below the line indicate doctors with higher accuracy (green). The correlation observed is due to variations in the difficulty of vignette sets across doctors.

Full size image

The results (Figure 4, Table 2) revealed that the associative algorithm performed comparably to the average doctor (72.52% vs. 71.4% mean accuracy). However, the counterfactual algorithm achieved a significantly higher mean accuracy of 77.26%, surpassing the average doctor and placing in the top quartile of the doctor cohort. Notably, the counterfactual algorithm tended to outperform doctors in more challenging diagnostic cases (vignettes with lower overall accuracy).

Table 2: Group mean accuracy of doctors and algorithms.

Conclusion: Advancing Medical Diagnosis with Counterfactual Reasoning

Our findings demonstrate that counterfactual reasoning offers a substantial improvement in diagnostic accuracy compared to traditional associative methods in medical diagnosis. The counterfactual approach, embodied in expected disablement and expected sufficiency, not only aligns with fundamental diagnostic principles but also shows superior performance, particularly for rare diseases and complex cases. While associative algorithms perform at the level of an average physician, counterfactual algorithms reach the performance of top-tier clinicians. This highlights the potential of counterfactual AI in augmenting and enhancing medical diagnosis, leading to more accurate, efficient, and ultimately, better patient care.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *