Asset Version
1.00
Project
Dataset Description
Dataset Description
This dataset contains 7,187 pathology reports with the associated site and histology labels downloaded from the Genomic Data Commons Platform at the National Cancer Institute.
- The files in ml_ready_raw_text_pathology_reports.tar.gz were converted from PDF to text using an optical character recognition program (refer to the Tesseract link). An example of a report is available on the GDC archive portal (refer to the GDC link).
- The file ml_ready_raw_text_histo_metadata.csv contains annotations (such as site and histology) extracted from those reports.
This dataset is used as input to MT-CNN and HiSan (refer to the GitHub Repository links and Model links).
Content Type
Content Type
Pathology Reports
Genomic Data Commons