Deep Active Learning for Classifying Cancer Pathology Reports

Publication Type
Journal Article
Publication Year
2021
Authors
De Angeli, Kevin
Gao, Shang
Alawad, Mohammed
Yoon, Hong-Jun
Schaefferkoetter, Noah
Wu, Xiao-Cheng
Durbin, Eric B.
Doherty, Jennifer
Stroup, Antoinette
Coyle, Linda
Penberthy, Lynne
Tourassi, Georgia
Abstract

Background

Automated text classification has many important applications in the clinical setting; however, obtaining labelled data for training machine learning and deep learning models is often difficult and expensive. Active learning techniques may mitigate this challenge by reducing the amount of labelled data required to effectively train a model. In this study, the authors analyzed the effectiveness of 11 active learning algorithms on classifying subsite and histology from cancer pathology reports using a Convolutional Neural Network as the text classification model.

Results

The authors compared the performance of each active learning strategy using two differently sized datasets and two different classification tasks. The results showed that on all tasks and dataset sizes, all active learning strategies except diversity-sampling strategies outperformed random sampling, that is, no active learning. On the large dataset (15K initial labelled samples, adding 15K additional labelled samples each iteration of active learning), there was no clear winner between the different active learning strategies. On the small dataset (1K initial labelled samples, adding 1K additional labelled samples each iteration of active learning), marginal and ratio uncertainty sampling performed better than all other active learning techniques. The authors found that compared to random sampling, active learning strongly helps performance on rare classes by focusing on underrepresented classes.

Conclusions

Active learning can save annotation cost by helping human annotators efficiently and intelligently select which samples to label. The results showed that a dataset constructed using effective active learning techniques requires less than half the amount of labelled data to achieve the same performance as a dataset constructed using random sampling.

Citation
Date
Issue
1
Volume
22
Publication Title
BMC Bioinformatics
ISSN
1471-2105
DOI
https://doi.org/10.1186/s12859-021-04047-1
Publication Tags
Automatic Tags
deep learning, convolutional neural networks, active learning, cancer pathology reports, text classification
Related Materials
Project