Learning Curves (LC) | Computational Resources for Cancer Research

User Community

Researchers interested in the following topics:

Primary: Cancer biology data modeling
Secondary: Machine learning, bioinformatics, and computational biology

Impact

May help to decide whether it would be worthwhile to collect more data and provide a framework for assessing the data scaling behavior of these predictors.

Description

A learning curve is an empirical method that clarifies whether a supervised learning model can be further improved with more training data. The trajectory of each curve provides a forward-looking metric for analyzing prediction performance and can serve as a co-design tool to guide experimental biologists and computational scientists in the design of future experiments in prospective research studies.

Hypothesis/Objective

The objective was to develop a practical methodology for generating learning curves of predictive models using both NNs and classical ML algorithms to investigate the data scaling trajectory of predictors for each pair of a dataset and learning algorithm. This helps to compare the performances of models across a range of training sizes.

Resource Role

This resource is related to other drug response resources, such as P1B3, Uno, E-COXEN, and Combo.

Uniqueness

A learning curve is a general method that can be applied to any supervised learning model. Scripts in this repository use this method to generate learning curves for two drug response prediction models: LightGBM regressor and a neural network regressor. These curves can be used to evaluate and compare the data scaling properties of prediction models across a range of training set sizes rather than for a fixed sample size.

Usability

Data scientists experienced in Python and the domain can use the current code.

Level of Documentation

Minimal

Components

This capability provides the following components:

Scripts that implement the learning curve method using two machine learning models: LightGBM and Neural Networks.
Examples on how to apply the learning curve method for models that predict drug responses using data from the Cancer Therapeutics Response Portal.

Inputs

The primary required input is a dataset in a tabular format (such as a CSV file) with features and prediction target. The user may also specify lists of IDs to split the dataset into training, validation, and test sets. The user can also specify the ML model.

Input Data Type

RNA-Seq

Input Data Format

Tabular

Results

Learning curves provide intuitive insight into the data scaling behavior of predictors, as opposed to single-value performance measures obtained with the entire set of training samples. The shape of these curves facilitates comparison between ML models by illustrating a global trajectory of model improvement. Thus, learning curves can be used for quantifying the learning capacity of predictors with increasing amounts of training data.

Primary Publication

Learning Curves for Drug Response Prediction in Cancer Cell Lines

Outputs

The output includes:

The raw predictions of the model for the different training set size and
The learning curves.

AVAILABLE ON GITHUB

https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Learning-Curve