Identifies the saliency of hidden nodes in autoencoders by ranking hidden nodes in the latent layer of the autoencoder according to their capability of performing a learning task.
Users who are interested in the following subjects:
- Primary: Cancer biology data modeling
- Secondary: Machine learning; bioinformatics; computational biology
Explains the unsupervised learning process in autoencoders
The purpose of Autoencoder Node Saliency (ANS) is to identify the saliency of hidden nodes in autoencoders. This is done by ranking hidden nodes in the latent layer of the autoencoder according to their capability of performing a learning task.
The objective was to create a novel autoencoder node saliency method that examines whether the features constructed by autoencoders exhibit properties related to known class labels.
This resource is related to other autoencoder-based resources such as P1B1.
The uniqueness of the autoencoder node saliency method lies in ranking hidden nodes in the autoencoder according to their capability of performing a learning task, identifying the specialty nodes that reveal explanatory input features, and suggesting possible nodes that can be trimmed down for a more concise network structure.
The repository contains two examples of scripts that process data from the popular public image classification datasets (Modified National Institute of Standards and Technology database, MNIST) and preprocessed data from Genomics Data Common breast cancer cases.
The following components are in the Model and Data Clearinghouse (MoDaC):
- The Single Drug Response Predictor (P1B3) asset contains the untrained model and trained model: The model topology file is p1b3.model.json. The trained model is defined by combining the untrained model (p1b3.model.json) and trained model weights (p1b3.model.h5). The trained model is used in inference.
- The Cancer Drug Response Prediction Dataset asset contains the processed training and test data.
-
Type of data required: Annotated samples used in typical classification problem in machine learning, features from the trained autoencoder as described below. The provided public data should include:
- A: Activation values of each sample
- L: Class labels of each sample
Activation values are generated using the optimal weight and bias terms in the autoencoder. Class labels should be given in datasets. The Data folder contains the A and L values used in the example.
-
Source of data required: Two preprocessed data samples were used to test the capability: Public image classification set called MNIST, and breast cancer data downloaded and processed from GDC. The training of the autoencoder is not part of the capability.
-
Public vs. Restricted: The data used is public.
The results show that ANS determines if an autoencoder can handle unbalanced gene expression datasets. The supervised node saliency (SNS) with the binary distribution provides useful rankings on the hidden nodes for different learning tasks, and the NED values verify the learning behaviors of the top ranked nodes.
Given two labels for the samples (such as ‘0’ and ‘1’) the code generates:
- NED: Normalized entropy difference from both labels, 0 and 1.
- NED0: Normalized entropy difference from labels 0.
- NED1: Normalized entropy difference from labels 1.
- sns_incr: Supervised node saliency with increasing probability distribution.
- sns_bi: Supervised node saliency with binary distribution.
- g0Count: Histogram bar counts for class 0 at each node.
- g1Count: Histogram bar counts for class 1 at each node.