Automatic Diagnosis in Electrocardiography: A Robust Dataset for Deep Learning Applications

Electrocardiography (ECG) is a vital tool for diagnosing cardiac abnormalities, and the advent of Automatic Diagnosis systems promises to enhance its efficiency and accessibility. This study details the creation and validation of a large-scale ECG dataset, crucial for training and evaluating deep learning models designed for automatic ECG interpretation. This dataset, acquired through the Telehealth Network of Minas Gerais (TNMG) in Brazil, represents a significant resource for advancing automatic diagnosis in cardiology.

The TNMG, a pioneering public telehealth system, facilitates access to specialized healthcare across Minas Gerais and other Brazilian states. Since 2017, TNMG has extended its services to include telediagnostic ECG interpretation. Electrocardiograms, primarily 12-lead ECGs (S12L-ECGs), were recorded in primary care facilities using devices from Tecnologia Eletrônica Brasileira or Micromed Biotecnologia. These recordings, lasting 7 to 10 seconds and sampled at 300 to 600 Hz, were captured using custom software. This software enabled the secure upload of ECG tracings along with patient clinical data to the TNMG analysis center. At the center, experienced cardiologists analyzed each ECG, and their reports were made available online to the requesting health services.

Since December 2017, the University of Glasgow (Uni-G) ECG analysis program has been integrated into the TNMG system to enhance automatic diagnosis capabilities. This program automatically identifies ECG waves, calculates key parameters, performs rhythm analysis, and provides diagnostic interpretations, contributing to more efficient and standardized automatic diagnosis. The Uni-G program also generates Minnesota codes, a standard ECG classification system widely used in epidemiological research. From April 2018, the automatic measurements generated by the Uni-G software have been presented to cardiologists to aid in their reporting process. All clinical data, digital ECG tracings, and cardiologist reports were stored in a comprehensive database. Furthermore, historical data were retrospectively analyzed using the Uni-G software to ensure automatic diagnosis and measurements were available for all exams from the database’s inception. The CODE study was established to standardize and consolidate this extensive database for clinical and epidemiological research. For this study, data from patients over 16 years of age, collected between 2010 and 2016, were used for the training and validation sets, while data from April to September 2018 formed the test set.

Labeling Methodology for Training Data Using Text Reports

For the training and validation datasets, diagnostic labels were derived from the textual reports created by cardiologists, a critical step in creating a dataset for automatic diagnosis model training. A three-stage process was employed to extract these labels. Initially, the text reports underwent preprocessing, which included removing stop-words and generating n-grams. Subsequently, the Lazy Associative Classifier (LAC), trained on a 2800-sample dictionary of real diagnosis reports, was applied to these n-grams. Finally, the output from the LAC was used in a rule-based classifier to disambiguate classes and generate the final text label. The performance of this classification model was rigorously tested against 4557 manually labeled medical reports, where a certified cardiologist selected labels from predefined classes based on free-text reports. This text classification step demonstrated strong performance in recovering the true medical labels, achieving macro F1 scores of 0.729 for 1dAVb, 0.849 for RBBB, 0.838 for LBBB, 0.991 for SB, 0.993 for AF, and 0.974 for ST. This automated labeling process was essential for efficiently processing the large volume of textual reports.

Annotation of Training and Validation Sets: Combining Expert and Automatic Diagnosis

The annotation of the training and validation datasets leveraged a combination of (i) Uni-G statements and Minnesota codes from automatic analysis (automatic diagnosis), (ii) automatic measurements from Uni-G software, and (iii) text labels extracted from expert reports (medical diagnosis). Recognizing that both automatic and medical diagnoses can contain errors, a hybrid approach was adopted to improve dataset quality. Automatic classification accuracy is inherently limited, and text labels are subject to both cardiologist errors and limitations of the labeling methodology. Therefore, integrating expert annotations with automatic analysis was crucial for obtaining reliable ground truth labels.

The procedure for establishing ground truth annotation involved a multi-step process:

Initial Agreement:
a. A diagnosis was accepted (abnormality considered present) if both the expert cardiologist and either the Uni-G statement or Minnesota code from automatic diagnosis concurred on the same abnormality.
b. A diagnosis was rejected (abnormality considered absent) if only one automatic classifier indicated an abnormality, disagreeing with both the cardiologist and the other automatic classifier.
Rule-Based Rejection: After the initial step, discrepancies remained where (i) both automatic classifiers indicated an abnormality not noted by the expert, or (ii) only the expert indicated an abnormality missed by both classifiers. Rules were then applied to reject certain diagnoses:
a. ST diagnoses were rejected if the heart rate was below 100 bpm (8376 medical diagnoses and 2 automatic diagnoses).
b. SB (sinus bradycardia) diagnoses were rejected if the heart rate was above 50 bpm (7361 medical diagnoses and 16,427 automatic diagnoses).
c. LBBB and RBBB diagnoses were rejected if the QRS interval duration was below 115 ms (9313 medical diagnoses for RBBB and 8260 for LBBB).
d. 1st-degree AV block (1dAVb) diagnoses were rejected if the PR interval duration was below 190 ms (3987 automatic diagnoses).
Sensitivity Analysis and Rule-Based Acceptance: Sensitivity analysis of 100 manually reviewed exams per abnormality informed the rules for accepting remaining diagnoses:
a. For RBBB, 1dAVb, SB, and ST, all medical diagnoses were accepted (26,033, 13,645, 12,200, and 14,604 diagnoses, respectively).
b. For AF (atrial fibrillation), acceptance required both cardiologist classification and a standard deviation of NN intervals greater than 646 (14,604 diagnoses accepted).

This sensitivity analysis indicated that the false positive rate introduced by this procedure was less than 3% of total exams.

Manual Review: The remaining 34,512 exams, where diagnoses were neither accepted nor rejected by the above rules, underwent manual review by medical students supervised by experienced cardiologists using the Telehealth ECG diagnostic system. This manual review process was extensive, taking several months.

It is important to note that previous medical reports and automatic measurements were utilized solely for establishing ground truth for training and validation sets and were not used in subsequent DNN training stages.

Test Set Annotation by Expert Cardiologists

The test dataset was independently annotated by two certified cardiologists with expertise in electrocardiography to ensure high-quality evaluation of automatic diagnosis systems. Inter-rater agreement was assessed using kappa coefficients: 0.741 for 1dAVb, 0.955 for RBBB, 0.964 for LBBB, 0.844 for SB, 0.831 for AF, and 0.902 for ST, indicating substantial to almost perfect agreement for most classes. In cases of agreement, the shared diagnosis was considered ground truth. Disagreements were resolved by a third senior specialist, aware of the initial annotations. The American Heart Association standardization guidelines were followed for classification. Crucially, the annotation process for the test set was conducted using an upgraded TNMG software version that presented automatic measurements from the Uni-G program to the specialists. This allowed cardiologists to directly select ECG diagnoses from predefined abnormality classes, eliminating the need for textual report extraction, as was necessary for the training and validation sets. This streamlined process ensured direct coding of diagnoses into predefined categories.

Neural Network Architecture and Training for Automatic ECG Diagnosis

A convolutional neural network (CNN) architecture, similar to residual networks but adapted for one-dimensional signals, was used for automatic ECG diagnosis. This architecture, incorporating skip connections, facilitates efficient training of deep neural networks. The residual block modification proposed in a prior study was adopted for this network.

All ECG recordings were resampled to 400 Hz and zero-padded to a uniform length of 4096 samples per lead, serving as input to the neural network. The network comprised a convolutional layer followed by four residual blocks, each containing two convolutional layers. The output from the final block was fed into a fully connected (Dense) layer with a sigmoid activation function. The sigmoid function was chosen because ECG abnormalities are not mutually exclusive, allowing for multiple diagnoses in a single exam. Batch normalization and rectified linear activation units (ReLU) were applied after each convolutional layer’s output, with dropout regularization implemented after the ReLU nonlinearity.

Convolutional layers utilized filters of length 16, beginning with 64 filters in the first layer and residual block, increasing by 64 filters every second residual block, and subsampling by a factor of 4 within each residual block. Max pooling and 1×1 convolutional layers were integrated into skip connections to ensure dimensional consistency with signals in the main branch. The average cross-entropy was minimized using the Adam optimizer with default parameters and a learning rate of 0.001. The learning rate was reduced by a factor of 10 if the validation loss showed no improvement for seven consecutive epochs. Neural network weights were initialized according to a established method, and biases were initialized to zero. Training spanned 50 epochs, with the model exhibiting the best validation performance selected as the final model.

Hyperparameter Tuning for Optimized Performance

The final network architecture and hyperparameter configuration were the result of approximately 30 iterations of manual tuning. This iterative process involved: (i) training the neural network weights on the training set, (ii) evaluating performance on the validation set, and (iii) manually adjusting hyperparameters and architecture based on insights from previous iterations. The initial hyperparameters and architecture were based on a previous study in arrhythmia detection. The hyperparameter and architecture selection was performed concurrently with improvements in the dataset quality. Expert knowledge guided the manual tuning process, incorporating insights from previous iterations evaluated on slightly different dataset versions.

Hyperparameters explored included: residual neural networks with {2, 4, 8, 16} blocks, kernel sizes {8, 16, 32}, batch sizes {16, 32, 64}, initial learning rates {0.01, 0.001, 0.0001}, optimizers {SGD, ADAM}, activation functions {ReLU, ELU}, dropout rates {0, 0.5, 0.8}, and patience plateaus of 5 to 10 epochs for learning rate reduction factors between 0.1 and 0.5. Additional explorations included: (i) vectorcardiogram linear transformation for input dimensionality reduction, (ii) LSTM layer integration before convolutional layers, (iii) residual networks without preactivation architecture, (iv) VGG convolutional architecture, and (v) adjustments to the order of activation and batch normalization layers.

Statistical and Empirical Analysis of Test Results

Precision-recall curves were computed to evaluate the model’s discriminatory ability for each rhythm class, offering a detailed view of the trade-off between precision and recall at various binary decision thresholds. For imbalanced datasets like the test set, precision-recall curves provide more informative assessments than ROC curves. For further analyses, the DNN threshold was set to maximize the F1 score, the harmonic mean of precision and recall, chosen for its robustness to class imbalance.

With a fixed DNN threshold, precision, recall, specificity, F1 score, and confusion matrices were calculated for each class for both the DNN and medical residents/students. Bootstrapping (1000 resamples) was used to analyze the empirical distribution of these scores, presented as boxplots. The McNemar test was used to compare misclassification distributions between the DNN and medical professionals, and the kappa coefficient was used to compare inter-rater agreement.

All misclassified exams were reviewed by an experienced cardiologist and, following interviews with ECG reviewers, errors were categorized as measurement errors, noise errors, unexplained errors (DNN only), and conceptual/attention errors (medical residents/students only). F1 scores were also evaluated for alternative data splits (90%-5%-5% random, date-ordered, and patient-stratified splits), assessed on both the original test set and additional test splits using bootstrap analysis (1000 and 200 resamples, respectively), with performance distributions visualized as boxplots in supplementary materials.

Reporting Summary

Further details on research design are available in the Nature Research Reporting Summary linked to the original article.