Decoding Diagnosis Code 285.9: Understanding Anemia, Unspecified, and Machine Learning in Medical Coding

The application of machine learning to healthcare is rapidly evolving, with significant strides being made in areas like automated medical coding. The MIMIC-III dataset [7] , a freely accessible critical care database, has become a cornerstone for research in this domain, particularly for multi-label classification tasks [10, 29]. Focusing on discharge summaries from MIMIC-III, which encapsulate a patient’s hospital stay into a single document, researchers are exploring methods to automatically assign ICD-9 codes. These codes are crucial for detailing diagnoses and treatments administered during a patient’s admission. This article delves into the performance of a novel model, SWAM (Shallow and Wide Attention Model), in predicting these codes, with a particular focus on Diagnosis Code 285.9, representing “Anemia, unspecified,” and the challenges it presents to machine learning models.

MIMIC-III Dataset and the SWAM Model

The study leverages the MIMIC-III dataset, training and evaluating the SWAM model on discharge summaries. The dataset is filtered to include instances with at least one of the 50 most frequent ICD-9 codes. To ensure data integrity and prevent patient-specific correlations from skewing results, the dataset is split by patient ID, guaranteeing no patient overlaps between training, validation, and testing sets. This rigorous split results in 8,067 summaries for training, 1,574 for validation, and 1,730 for testing, providing a robust foundation for model evaluation. Further details on dataset statistics can be found in Table 2.

Data Preprocessing and Baselines

Preprocessing steps are crucial to refine the text data for machine learning models. Tokens lacking alphabetic characters, such as standalone numbers, are removed, while tokens appearing infrequently (less than 3 times in the training documents) are replaced with an ‘UNK’ token to manage vocabulary size and focus on semantically relevant terms. Discharge summaries are truncated to a maximum length of 2500 tokens, accommodating the majority (90%) of summaries which fall under this length, as indicated by the dataset’s long-tailed distribution.

The SWAM model, described as a versatile CNN architecture, is implemented in two variations, drawing inspiration from existing models: “SWAM-textCNN” [25] and “SWAM-CAML” [12]. These variations differ in their attention layer implementations. The baselines for comparison include a bag-of-words logistic regression model and the original CAML model [12]. SWAM models and baselines are initialized with pre-trained word2vec vectors, while the logistic regression model utilizes unigram bag-of-words features.

Hyperparameter Tuning and Evaluation Metrics

To optimize SWAM model performance, hyperparameters such as learning rate (η), filter size (k), number of filters (d_c), and dropout probability (q) are tuned using grid search, guided by ranges and empirical insights from prior research [12, 25, 31]. A fixed batch size of 16 and early stopping (patience of 10 epochs based on f1-macro improvement) are employed during training.

Model evaluation focuses primarily on Macro-averaged F1 and Precision at n (P@n), specifically chosen to assess both per-label performance and the model’s practical utility as a decision support tool. Macro-averaged F1 reflects the average performance across different diagnostic categories, while P@n evaluates the accuracy within the top n predicted codes. Additional metrics, including AUC (Area Under the ROC Curve) and micro-averaged F1, are also reported for broader comparison with existing and future work. Macro-averaged values are computed by averaging per-label metrics, and micro-averaged values treat each document-code pair as an individual prediction.

Quantitative Results and Ablation Study: The Case of Diagnosis Code 285.9

Quantitative evaluation on predicting the 50 most frequent ICD-9 codes from MIMIC-III discharge summaries demonstrates the superior performance of SWAM models (both SWAM-textCNN and SWAM-CAML) across all metrics, particularly in Macro-averaged F1. This highlights SWAM’s strength in average performance across diverse diagnostic codes. The wide architecture of SWAM is credited for this improvement, enabling the model to learn more nuanced features specific to individual codes.

To investigate the impact of network width, an ablation study compares a wide SWAM model (wide-SWAM, 500 filters) against a narrow SWAM model (narrow-SWAM, 50 filters). The per-label precision comparison, visualized in Figure 3, reveals that the narrow model struggles with certain labels, achieving zero precision for 5 codes. In contrast, the wide model significantly improves performance for 4 of these 5 labels, achieving an average precision of 0.53 for those previously problematic codes, while also enhancing overall model performance. Intriguingly, diagnosis code 285.9, “Anemia, unspecified,” consistently exhibits zero precision in both narrow and wide models, prompting further investigation in the subsequent analysis.

Figure 1: Performance comparison between wide-SWAM and narrow-SWAM models, highlighting the impact of network width on per-label precision.

Secondary Evaluation: Informative Snippets and Factors Influencing Learning

Further analysis delves into the “informative snippets” extracted by the narrow and wide models to understand the performance discrepancies. Focusing on code 276.1 (Hyposmolality and/or hyponatremia), which shows improved precision in the wide model, and code 285.9 (Anemia, unspecified), which remains problematic, the study examines the snippets learned by each model.

Table 5 showcases informative snippets extracted for code 276.1. The wide-SWAM model effectively identifies “hyponatremia,” a term present in both the document and the code description, as crucial for prediction. Conversely, the narrow-SWAM model extracts less informative snippets, correlating with its zero precision for this code. This observation supports the hypothesis that the performance difference stems from the wide model’s ability to learn more specific, non-generic informative snippets.

To understand which “non-generic snippets” a narrow network might fail to learn, the study investigates the influence of training data distribution. By shuffling the training dataset with a different random seed and retraining the models, the researchers explore how the order of snippet appearance during training affects learning. Table 6 reveals that shuffling the data alters the labels with zero precision in the narrow model. However, diagnosis code 285.9 remains unpredicted even in the wide model after shuffling, suggesting that its poor performance is not solely due to data distribution or local optima.

The Challenge of Diagnosis Code 285.9: Anemia, Unspecified

The persistent failure to predict diagnosis code 285.9, “Anemia, unspecified,” across both model widths and data shuffles, necessitates a deeper analysis. Manual examination reveals a key challenge: the ICD-9 system contains over 50 codes specifying “Anemia” with a specific cause. This implies that while snippets related to “anemia” are necessary, they are not sufficient for accurately predicting code 285.9. Predicting “Anemia, unspecified” requires not just identifying mentions of anemia, but also inferring the absence of any specific cause mentioned in the patient’s record.

This presents a significant hurdle for current machine learning models. These models are adept at learning from the presence of indicative features but struggle with inferences based on missing information. The accurate prediction of diagnosis code 285.9 demands a more sophisticated approach capable of reasoning about the absence of specific details, a capability that remains a blind spot in current machine learning methodologies for medical coding.

Conclusion: Shallow and Wide Attention CNNs for Medical Coding and the Ongoing Challenge of Unspecified Diagnoses

This research demonstrates the effectiveness of the SWAM model, a shallow and wide attention CNN, in automated medical coding. Compared to other methods, SWAM significantly enhances the prediction accuracy for the most challenging diagnostic codes while achieving superior overall performance. The model’s strength lies in its ability to learn a broad spectrum of local and low-level features, making it well-suited for multi-label text classification in medical contexts where informative snippets are label-specific. Furthermore, SWAM offers interpretability by establishing a link between informative snippets and convolution filters, providing insights into the model’s decision-making process.

However, the persistent difficulty in predicting diagnosis code 285.9, “Anemia, unspecified,” underscores a crucial limitation of current machine learning models. Accurately coding unspecified diagnoses requires models to move beyond feature detection and develop the capacity to reason about the absence of information. Addressing this challenge is essential for advancing the reliability and comprehensiveness of automated medical coding systems and fully realizing the potential of AI in healthcare.

References

[7] Johnson AE, Pollard TJ, Shen L, Li-Wei HL, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3(1):1–9.

[10] Wang S, Chang X, Li X, Long G, Yao L, Sheng QZ. Diagnosis code assignment using sparsity-based disease correlation embedding. IEEE Trans Knowl Data Eng. 2016;28(12):3191–202.

[29] Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1(1):1–10.

[25] Kim Y. Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2014. p. 1746–1751.

[12] Mullenbach J, Wiegreffe S, Duke J, Sun J, Eisenstein J. Explainable prediction of medical codes from clinical text. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018;1(1):1101–11.

[31] Aghaebrahimian A, Cieliebak M. Hyperparameter tuning for deep learning in natural language processing. In: Proceedings of 4th Swiss Text Analytics Conference (SwissText); 2019. p. 1–7.