Autoencoder - P2B1 (Autoencoder) | Computational Resources for Cancer Research

Short Description

The P2B1 capability is an autoencoder that determines a set of features to describe molecular dynamics (MD) simulation data most efficiently.

User Community

Scientists interested in working with efficient representations of MD simulation data.

Impact

Used to generate a tractable set of features from a larger input dataset that can then be fed into additional models for a variety of purposes.

Hypothesis/Objective

The objective was to develop autoencoder software to efficiently determine the best set of features of molecular dynamic simulation data.

Resource Role

This resource is related to other protein-protein interaction and molecular dynamic stimulation-assessment based software such as MuMMI, MemSurfer, and DynIm.

Uniqueness

MD simulation data consist of many descriptors. This capability shows how you can use an autoencoder to compress these descriptors into a minimal set that faithfully describes the data. This enables downstream analysis using a more tractable dataset as input.

Usability

Scientists can train the model on their own data and use the resulting reduced set of features as input for further analysis.

Components

The Autoencoder (also known as P2B1) has the following components in the Model and Data Clearinghouse (MoDaC):

Simulation Input Data: The default dataset is 3k disordered 3-component-system (DPPC-DOPC-CHOL) (https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7654212).
Converged Model: The trained weights (for both the full model and just the encoder; .hdf5 files) and the corresponding model topologies (.json files) are stored in Autoencoder for MD Simulation Data (P2B1) (https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-7681692).

Inputs

Data source: MD Simulation output as PDB files (coarse-grained bead simulation)
Input dimensions:
- Long term target: ~1.26e6 per time step (6000 lipids x 30 beads per lipid x (position + velocity + type))
- Current: ~288e3 per time step (6000 lipids x 12 beads per lipid x (position + type))
Output dimensions: 500
Latent representation dimension:
Sample size: O(10^6) for simulation requiring O(10^8) time steps
Notes on data balance and other issues: Unlabeled data with rare events
Source of data required: Molecular dynamics simulations
Public vs. Restricted: The provided data is publicly available on FTP sites

Results

The P2B1 benchmark performs these steps on a molecular dynamics (MD) simulation state (including Protein Data Bank files resulting from a coarse-grained bead simulation) of a "disordered," three-component system (DPPC-DOPC-CHOL). The default system consists of 3,000 lipids and 3,000 frames simulated for 10 microseconds. The implemented network is a convolutional neural network that very quickly and effectively minimizes the reconstruction loss.

File Links

Weights and Topology in MoDaC

Simulation Input Data in MoDaC

Outputs

Latent representation of the simulation state in 500 dimensions.

AVAILABLE ON GITHUB

https://github.com/CBIIT/NCI-DOE-Collab-Pilot2-Autoencoder_MD_Simulation_Data