Autoencoder Node Saliency (ANS) | Computational Resources for Cancer Research

User Community

Users who are interested in the following subjects:

Primary: Cancer biology data modeling
Secondary: Machine learning; bioinformatics; computational biology

Impact

Explains the unsupervised learning process in autoencoders

Description

The purpose of Autoencoder Node Saliency (ANS) is to identify the saliency of hidden nodes in autoencoders. This is done by ranking hidden nodes in the latent layer of the autoencoder according to their capability of performing a learning task.

Hypothesis/Objective

The objective was to create a novel autoencoder node saliency method that examines whether the features constructed by autoencoders exhibit properties related to known class labels.

Resource Role

This resource is related to other autoencoder-based resources such as P1B1.

Uniqueness

The uniqueness of the autoencoder node saliency method lies in ranking hidden nodes in the autoencoder according to their capability of performing a learning task, identifying the specialty nodes that reveal explanatory input features, and suggesting possible nodes that can be trimmed down for a more concise network structure.

Usability

The repository contains two examples of scripts that process data from the popular public image classification datasets (Modified National Institute of Standards and Technology database, MNIST) and preprocessed data from Genomics Data Common breast cancer cases.

Level of Documentation

Minimal

Components

The following components are in the Model and Data Clearinghouse (MoDaC):

The Single Drug Response Predictor (P1B3) asset contains the untrained model and trained model: The model topology file is p1b3.model.json. The trained model is defined by combining the untrained model (p1b3.model.json) and trained model weights (p1b3.model.h5). The trained model is used in inference.
The Cancer Drug Response Prediction Dataset asset contains the processed training and test data.

Inputs

Type of data required: Annotated samples used in typical classification problem in machine learning, features from the trained autoencoder as described below. The provided public data should include:
- A: Activation values of each sample
- L: Class labels of each sample
Activation values are generated using the optimal weight and bias terms in the autoencoder. Class labels should be given in datasets. The Data folder contains the A and L values used in the example.
Source of data required: Two preprocessed data samples were used to test the capability: Public image classification set called MNIST, and breast cancer data downloaded and processed from GDC. The training of the autoencoder is not part of the capability.
Public vs. Restricted: The data used is public.

Input Data Type

Agnostic

Input Data Format

Tabular

Results

The results show that ANS determines if an autoencoder can handle unbalanced gene expression datasets. The supervised node saliency (SNS) with the binary distribution provides useful rankings on the hidden nodes for different learning tasks, and the NED values verify the learning behaviors of the top ranked nodes.

Primary Publication

Autoencoder Node Saliency: Selecting Relevant Latent Representations

File Links

https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-8308552

https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-8088592

Outputs

Given two labels for the samples (such as ‘0’ and ‘1’) the code generates:

NED: Normalized entropy difference from both labels, 0 and 1.
NED0: Normalized entropy difference from labels 0.
NED1: Normalized entropy difference from labels 1.
sns_incr: Supervised node saliency with increasing probability distribution.
sns_bi: Supervised node saliency with binary distribution.
g0Count: Histogram bar counts for class 0 at each node.
g1Count: Histogram bar counts for class 1 at each node.

AVAILABLE ON GITHUB

https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Autoencoder-Node-Saliency/