ML Ready Pathology Reports
(ML Path. Rep.)

Dataset Description
Dataset Description

This dataset contains 7,187 pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons Platform at the National Cancer Institute.

  • The files in ml_ready_raw_text_pathology_reports.tar.gz were converted from PDF to text using an optical character recognition program (refer to the Tesseract link). An example of a report is available on the GDC archive portal (refer to the GDC link).
  • The file ml_ready_raw_text_histo_metadata.csv contains annotations (such as site and histology) extracted from those reports.

This dataset is used as input to MT-CNN and HiSan (refer to the GitHub Repository links and Model links).

GDC

https://portal.gdc.cancer.gov/legacy-archive/files/a9a42650-4613-448d-895e-4f904285f508

GitHub Repository HiSan

https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Pathology-Reports-Hierarchical-Self-Attention-Network

GitHub Repository MT-CNN

https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Multitask-Convolutional_Neural_Network

Model HiSan

https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7565752

Model MT-CNN

https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7330732

Tesseract

https://github.com/tesseract-ocr/

Content Type
Content Type
Pathology Reports
Genomic Data Commons