Active Learning for NLP Systems (AL-NLP) | Computational Resources for Cancer Research

User Community

Data scientists who are interested in guiding the ground truth augmentation process to enhance performance of a classifier of free form texts (such as pathology reports, clinical trials, abstracts, and so on).
Researchers who want to perform improved assessment and prediction on pathology reports carried out by MOSSAIC (https://datascience.cancer.gov/collaborations/joint-design-advanced-computing/population-pilot).

Impact

Enables rapid annotation of pathology reports via machine learning.

Description

This repository implements an active learning loop for natural language processing (AL-NLP) of pathology reports related to MOSSAIC (Modeling Outcomes Using Surveillance Data and Scalable Artificial Intelligence for Cancer). This framework implements the following methods for embedding extraction of the unstructured text:

Bag-of-words with dimensional reduction methods, and
Pre-trained BERT (Bidirectional Encoder Representations from Transformers) model.

Deterministic and Bayesian classifiers are available (in https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Active_learning_NLP/blob/master/classifiers) to predict attributes in the pathology reports.

AL-NP is an algorithm developed to improve the assessment and prediction of attributes in the pathology reports.

Hypothesis/Objective

The objective was to develop a method utilizing active learning that can reduce the amount of labelled data required to effectively train a model.

Resource Role

This resource is related to other resources that involve natural language processing such as HiSAN, MT-CNN, P3B1, and P3B2.

Uniqueness

Active learning is an existing technique in machine learning. This example shows how active learning can be used during the generation of the ground truth of free text documents. Other examples perform active learning for NLP. However, the AL-NLP algorithm presented in this repository compares multiple acquisition functions to select the next batch of samples to be labeled (such as random, entropy, marginal sampling, and abstention).

Usability

To use the software package in this repository, users must meet the following criteria:

Possess the basic skills to program and run Python scripts.
Understand the input parameters of the AL-NLP algorithm, so that they can set the parameters appropriately to execute the algorithm.

To use the optimization loop (simulation), users must be familiar with natural language processing techniques, training of classifiers, and active learning methods. In the Active Learning Loop's execute method, a user can specify the percentage of data to initially use for training, the size of the test set, and the number of new samples the user wants each iteration of the loop to select for labeling.

Level of Documentation

Minimal

Components

Refer to the Active Learning for NLP Systems repository in GitHub (https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Active_learning_NLP).

Inputs

Type of data required: Free text documents with labels.
Source of data required: One example uses public data of news reports.
Public vs. Restricted: One example uses public data of news reports.

Input Data Type

Text

Input Data Format

Tabular

Results

The authors evaluated 11 different active learning strategies for their effectiveness in classifying cancer subsite and histology from cancer pathology reports characterized by many unique labels with extreme class imbalances. The authors used a CNN as the base classification model and two different active learning scenarios:

A high data availability setting starting with 15K labelled samples and an addition of 15K labelled samples after each iteration of active learning, and
A low data availability setting starting with 1K labelled samples and an addition of 1K labelled samples after each iteration of active learning.

In the high data availability setting:

The uncertainty sampling and Query-by-committee (QBC) strategies obtained the best overall micro F1 scores, and
The QBC Kullback–Leibler Divergence to the Mean strategy obtained the best overall macro F1 score.

In terms of micro F1 score, there was no single clear winner.

In the low data availability setting:

Ratio and marginal sampling achieved the strongest overall micro F1 scores but underperformed slightly in macro F1 scores;
Least confidence, entropy sampling, and the QBC strategies obtained the best macro F1 scores.

Ratio and marginal sampling were strong contenders for the overall best active learning strategy based on:

Overall performance in the high and low data availability settings,
Performance when additional labelled data is extremely limited, and
Low computation cost.

Compared to a model trained on all available data, active learning can obtain similar performance using less than half the data. Furthermore, on tasks with many unique labels with extreme class imbalance, active learning can significantly mitigate the effects of class imbalance and improve performance on the rare classes.

Primary Publication

Deep Active Learning for Classifying Cancer Pathology Reports

Outputs

After the execution, this software stores in the outputs folder a report with all the results and plots comparing different active learning methods and classifiers. This software generates a PDF file with plots.

AVAILABLE ON GITHUB

https://github.com/CBIIT/NCI-DOE-Collab-Pilot3-Active_learning_NLP