Datasets Powering Chief Diagnosis in Pathology: A Comprehensive Overview

This study harnessed the power of 18 diverse pathology datasets, originating from prominent research consortia and institutional collaborations, to advance the field of Chief Diagnosis. These datasets, crucial for training and validating sophisticated diagnostic models, include 16 publicly accessible collections and two available upon request. Understanding the breadth and accessibility of these resources is paramount for researchers aiming to refine diagnostic accuracy and develop innovative tools for pathological analysis.

The publicly available datasets encompass a wide spectrum of sources, including The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project, accessible via the GDC portal and GTEx portal respectively. These large-scale initiatives provide invaluable genomic and pathological data. Further enriching the landscape are specialized datasets like PAIP, PANDA, BCC, ACROBAT, BCNB, TOC, CPTAC, DROID-Breast, Dataset-PT, Diagset-B, MUV and PLCO. These collections, each with a specific focus and accessible through their respective platforms such as Kaggle, AIDA Data Hub, and The Cancer Imaging Archive, offer diverse perspectives on pathological data and are instrumental in developing robust diagnostic algorithms. Supplementary Table 22 provides direct links to the raw data, ensuring ease of access for researchers worldwide.

In addition to these public resources, the research also utilized data from PAIP2020 and TissueNet, accessible through requests to the challenge organizers of PAIP2020 and TissueNet competitions. These datasets, often associated with specific challenges and benchmarks, provide targeted data for focused research questions in computational pathology.

Crucially, the study incorporated institutional data for CHIEF pretraining and validation from Dana-Farber Cancer Institute (DFCI), Brigham and Women’s Hospital (BWH), Yale-New Haven Hospital (YH), St. Michael’s Hospital (SMCH), Columbia University Irving Medical Center (CUCH), and the Hospital of the University of Pennsylvania. Access to this institutional data, while not publicly available due to patient privacy and ethical regulations governed by institutional review boards and data use agreements, can be requested by researchers for non-commercial academic use. Inquiries regarding data access should be directed to K.-H.Y., who will facilitate the request process with the respective institutional data managers.

In conclusion, this research leverages a rich tapestry of pathology datasets, both public and institutional, to advance the methodologies of chief diagnosis. The commitment to data accessibility, balanced with stringent ethical considerations, underscores the collaborative spirit driving progress in this critical field. Researchers seeking to build upon this work are encouraged to explore these valuable resources and engage with the data access procedures outlined to further refine the accuracy and efficacy of diagnostic pathology.

Source data are provided with this paper.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *