Showing 132 Results
Showing 11-20 of 132
Publications
Authors: Nguyen, Kien, López, Cesar A., Neale, Chris, Van, Que N., Carpenter, Timothy S., Di Natale, Francesco, Travers, Timothy, Tran, Timothy H., Chan, Albert H., Bhatia, Harsh, Frank, Peter H., Tonelli, Marco, Zhang, Xiaohua, Gulten, Gulcin, Reddy, Tyler, Burns, Violetta, Oppelstrup, Tomas, Hengartner, Nick, Simanshu, Dhirendra K., Bremer, Peer-Timo, Chen, De, Glosli, James N., Shrestha, Rebika, Turbyville, Thomas, Streitz, Frederick H., Nissley, Dwight V., Ingólfsson, Helgi I., Stephen, Andrew G., Lightstone, Felice C., Gnanakaran, Sandrasegaram
TITLE: Exploring CRD Mobility During RAS/RAF Engagement at the Membrane , Biophysical Journal , 19 , 121 : 3630-3650 , 2022
PUBLICATION DATE: 10-04-2022
ABSTRACT: During the activation of mitogen-activated protein kinase (MAPK) signaling, the RAS-binding domain (RBD) and cysteine-rich domain (CRD) of RAF bind to active RAS at the plasma membrane. The orientation of RAS at the membrane may be critical for formation of the RAS-RBDCRD complex and subsequent signaling. To explore how RAS membrane orientation relates to the protein dynamics within the RAS-RBDCRD complex, the authors perform multiscale coarse-grained and all-atom molecular dynamics (MD) simulations of KRAS4b bound to the RBD and CRD of RAF-1, both in solution and anchored to a model plasma membrane. Solution MD simulations describe dynamic KRAS4b-CRD conformations, suggesting that the CRD has sufficient flexibility in this environment to substantially change its binding interface with KRAS4b. In contrast, when the authors anchor the ternary complex to the membrane, the mobility of the CRD relative to KRAS4b is restricted, resulting in fewer distinct KRAS4b-CRD conformations. These simulations implicate membrane orientations of the ternary complex that are consistent with nuclear magnetic resonance (NMR) measurements. While the authors observed a crystal structure-like conformation in both solution and membrane simulations, the authors observed a particular intermolecular rearrangement of the ternary complex only when it is anchored to the membrane. This configuration emerges when the authors inserted the CRD hydrophobic loops into the membrane and helices α3–5 of KRAS4b are solvent exposed. This membrane-specific configuration is stabilized by KRAS4b-CRD contacts that are not observed in the crystal structure. These results suggest modulatory interplay between the CRD and plasma membrane that correlate with RAS/RAF complex structure and dynamics, and potentially influence subsequent steps in the activation of MAPK signaling.
PROJECT: ADMIRRAL
Authors: De Angeli, Kevin, Gao, Shang, Blanchard, Andrew, Durbin, Eric B, Wu, Xiao-Cheng, Stroup, Antoinette, Doherty, Jennifer, Schwartz, Stephen M, Wiggins, Charles, Coyle, Linda, Penberthy, Lynne, Tourassi, Georgia, Yoon, Hong-Jun
TITLE: Using Ensembles and Distillation to Optimize the Deployment of Deep Learning Models for the Classification of Electronic Cancer Pathology Reports , JAMIA Open , 3 , 5 : ooac075 , 2022
PUBLICATION DATE: 10-01-2022
ABSTRACT: Lay Summary One of the goals of the Surveillance, Epidemiology, and End Results (SEER) program is to estimate incidence, prevalence, and mortality of all cancers. To that end, cancer registries across the country maintain a massive database of cancer pathology reports which contain rich information to understand cancer trends. However, these reports are stored in the form of unstructured text, and human annotators are required to read and extract relevant information. In this article, the authors show that the community can significantly improve existing deep learning models for automating information extraction from cancer pathology reports by using ensemble model distillation. The authors found that by training multiple predictive models and transferring their knowledge to a single, low-resource model, they can reduce the number of highly confident wrong predictions. The authors’ results show that their implemented methods could save thousands of manual annotation hours. Objective The authors aim to reduce overfitting and model overconfidence by distilling the knowledge of an ensemble of deep learning models into a single model for the classification of cancer pathology reports. Materials and Methods The authors consider the text classification problem that involves five individual tasks. The baseline model consists of a multitask convolutional neural network (MT-CNN), and the implemented ensemble (teacher) consists of 1,000 MT-CNNs. The authors performed knowledge transfer by training a single model (student) with soft labels derived through the aggregation of ensemble predictions. The authors evaluate performance based on accuracy and abstention rates by using softmax thresholding. Results The student model outperforms the baseline MT-CNN in terms of abstention rates and accuracy, thereby allowing the authors to use the model with a larger volume of documents when deployed. The authors observed the highest boost for subsite and histology, for which the student model classified an additional 1.81% reports for subsite and 3.33% reports for histology. Discussion Ensemble predictions provide a useful strategy for quantifying the uncertainty inherent in labeled data and thereby enable the construction of soft labels with estimated probabilities for multiple classes for a given document. Training models with the derived soft labels reduce model confidence in difficult-to-classify documents, thereby leading to a reduction in the number of highly confident wrong predictions. Conclusions Ensemble model distillation is a simple tool to reduce model overconfidence in problems with extreme class imbalance and noisy datasets. These methods can facilitate the deployment of deep learning models in high-risk domains with low computational resources where the community requires minimizing inference time.
Authors: Bhatia, Harsh, Thiagarajan, Jayaraman J., Anirudh, Rushil, Jayram, T. S., Oppelstrup, Tomas, Ingólfsson, Helgi I., Lightstone, Felice C., Bremer, Peer-Timo
TITLE: A Biology-informed Similarity Metric for Simulated Patches of Human Cell Membrane , Machine Learning: Science and Technology , 3 , 3 : 35010 , 2022
PUBLICATION DATE: 08-19-2022
ABSTRACT: Complex scientific inquiries rely increasingly upon large and autonomous multiscale simulation campaigns, which fundamentally require similarity metrics to quantify 'sufficient' changes among data and/or configurations. However, subject matter experts are often unable to articulate similarity precisely or in terms of well-formulated definitions, especially when they explore new hypotheses, making it challenging to design a meaningful metric. Furthermore, the key to practical usefulness of such metrics to enable autonomous simulations lies in in situ inference, which requires generalization to possibly substantial distributional shifts in unseen, future data. Here, the authors address these challenges in a cancer biology application and develop a meaningful similarity metric for 'patches': regions of simulated human cell membrane that express interactions between certain proteins of interest and relevant lipids. In the absence of well-defined conditions for similarity, the authors leverage several biology-informed notions about data and the underlying simulations to impose inductive biases on the authors’ metric learning framework, resulting in a suitable similarity metric that also generalizes well to significant distributional shifts encountered during the deployment. The authors combine these intuitions to organize the learned embedding space in a multiscale manner, which makes the metric robust to incomplete and even contradictory intuitions. The authors’ approach delivers a metric that not only performs well on the conditions used for its development and other relevant criteria, but also learns key spatiotemporal relationships without ever being exposed to any such information during training.
Authors: López, Cesar A., Zhang, Xiaohua, Aydin, Fikret, Shrestha, Rebika, Van, Que N., Stanley, Christopher B., Carpenter, Timothy S., Nguyen, Kien, Patel, Lara A., Chen, De, Burns, Violetta, Hengartner, Nicolas W., Reddy, Tyler J. E., Bhatia, Harsh, Di Natale, Francesco, Tran, Timothy H., Chan, Albert H., Simanshu, Dhirendra K., Nissley, Dwight V., Streitz, Frederick H., Stephen, Andrew G., Turbyville, Thomas J., Lightstone, Felice C., Gnanakaran, Sandrasegaram, Ingólfsson, Helgi I., Neale, Chris
TITLE: Asynchronous Reciprocal Coupling of Martini 2.2 Coarse-Grained and CHARMM36 All-Atom Simulations in an Automated Multiscale Framework , Journal of Chemical Theory and Computation , 8 , 18 : 5025-5045 , 2022
PUBLICATION DATE: 08-09-2022
ABSTRACT: The appeal of multiscale modeling approaches depends on the promise of combinatorial synergy. However, the community can realize this promise only when distinct scales are combined with reciprocal consistency. Here, the authors consider multiscale molecular dynamics (MD) simulations that combine the accuracy and macromolecular flexibility accessible to fixed-charge all-atom (AA) representations with the sampling speed accessible to reductive, coarse-grained (CG) representations. AA-to-CG conversions are relatively straightforward because deterministic routines with unique outcomes are achievable. Conversely, CG-to-AA conversions have many solutions due to a surge in the number of degrees of freedom. While automated tools for biomolecular CG-to-AA transformation exist, the authors find that one popular option, called Backward, is prone to stochastic failure and the AA models that it does generate frequently have compromised protein structure and incorrect stereochemistry. Although these shortcomings can likely be circumvented by human intervention in isolated instances, automated multiscale coupling requires reliable and robust scale conversion. Here, the authors detail an extension to Multiscale Machine-learned Modeling Infrastructure (MuMMI), including an improved CG-to-AA conversion tool called sinceCG. This tool is reliable (∼98% weakly correlated repeat success rate), automatable (no unrecoverable hangs), and yields AA models that generally preserve protein secondary structure and maintain correct stereochemistry. The authors describe how the MuMMI framework identifies CG system configurations of interest, converts them to AA representations, and simulates them at the AA scale while on-the-fly analyses provide feedback to update CG parameters. The authors discussed application to systems containing the peripheral membrane protein RAS and proximal components of RAF kinase on complex eight-component lipid bilayers with ∼1.5 million atoms in the context of MuMMI.
Authors: Yoon, Hong-Jun, Peluso, Alina, Durbin, Eric B, Wu, Xiao-Cheng, Stroup, Antoinette, Doherty, Jennifer, Schwartz, Stephen, Wiggins, Charles, Coyle, Linda, Penberthy, Lynne
TITLE: Automatic Information Extraction from Childhood Cancer Pathology Reports , JAMIA Open , 2 , 5 : ooac049 , 2022
PUBLICATION DATE: 07-01-2022
ABSTRACT: Objectives The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. The authors developed deep learning-based information extraction models from cancer pathology reports based on the ICD-O-3 coding standard. In this article, the authors described extending the models to perform ICCC classification. Materials and Methods The authors developed two models, ICD-O-3 classification and ICCC recoding (Model 1) and direct ICCC classification (Model 2), and four scenarios subject to the training sample size. The authors evaluated these models with a corpus consisting of 29,206 reports with age at diagnosis between 0 and 19 years from six state cancer registries. Results The findings suggest that the direct ICCC classification (Model 2) is substantially better than reusing the ICD-O-3 classification model (Model 1). Applying the uncertainty quantification mechanism to assess the confidence of the algorithm in assigning a code demonstrated that the model achieved a micro-F1 score of 0.987 while abstaining (not sufficiently confident to assign a code) on only 14.8% of ambiguous pathology reports. Conclusions The experimental results suggest that the machine learning-based automatic information extraction from childhood cancer pathology reports in the ICCC is a reliable means of supplementing human annotators at state cancer registries by reading and abstracting the majority of the childhood cancer pathology reports accurately and reliably. Lay Summary ICCC is the coding standard designed to categorize childhood cancers. However, the community has not extensively studied machine learning-based ICCC classification, mainly owing to the limited volume of the pediatric cancer corpus; pediatric cancer is much less prevalent than adult cancers. Under the oversight of the National Childhood Cancer Registry project, the authors developed a deep learning-based text comprehension model for classifying ICCC from childhood cancer pathology reports. The authors performed a comparison study between the following approaches: Classifying ICD-O-3 codes and then recoding into ICCC and Classifying ICCC codes directly. The authors observed that the direct approach exhibited a substantially higher accuracy score. The authors are aware that the low-precision models are not appropriate for this exercise because they degrade the credibility of the model-based decisions. The authors applied an uncertainty quantification algorithm to the ICCC classification model. The authors achieved nearly perfect accuracy scores, while the model passed over 14.8% of ambiguous cases. This result means the authors’ machine learning model can serve human annotators at state cancer registries by processing 85.2% of the childhood cancer pathology reports automatically.
Authors: Zhu, Yitan, Brettin, Thomas S., Partin, Alexander, Xia, Fangfang, Shukla, Maulik, Yoo, Hyunseung, Cancino, Andrea, Larsen, Brian, Shaxted, Jenna, Salahudeen, Ameen, White, Kevin, Stevens, Rick L.
TITLE: Multifactorial Drug Response Modeling Based on Cancer Organoid Data , Journal of Clinical Oncology , 16_suppl , 40 : e13544 , 2022
PUBLICATION DATE: 06-02-2022
ABSTRACT: Background Prediction of drug response based on cancer molecular profiles is of paramount importance for precision oncology. The community built most existing drug response prediction models using drug screening data of immortalized cancer cell lines, which usually have altered genomic profiles compared with patient tumors. Recently, patient-derived organoids (PDOs) are emerging as a promising platform better representing patient tumors. The authors built computational drug response prediction models based on PDO drug screening data, which is the first study of its type to the authors’ knowledge. Methods The authors successfully developed 27 PDO lines of colorectal cancer and 20 PDO lines of head and neck (H&N) cancer. They generated transcriptomics, copy number variation (CNV), and targeted DNA mutation data for the PDO lines. They screened the PDO lines with 36 drugs of diversified mechanisms. They took the area under the dose response curve as the response measurement. The authors used the LightGBM algorithm to build response prediction models based on cancer molecular data and drug chemical descriptors/fingerprints. To investigate the influence of different factors on the prediction performance, including different cancer types, cancer molecular features, drug features, data preprocessing methods, and others, the authors applied a multifactorial analysis scenario to build and evaluate 3,384 prediction models constructed with all possible combinations of the factors. For example, the authors built prediction models for H&N and colorectal PDOs separately and jointly. Results A prediction model built for H&N PDOs achieved the highest prediction performance among all prediction models, which was R2 of 0.790 in 10-fold cross-validation. The authors built the model using drug descriptors, CNVs, and expressions of “landmark” genes well-representing cellular transcriptomic changes identified in the Library of Integrated Network-Based Cellular Signatures (LINCS) project. The table below includes all the factorial differences that caused an average R2 change larger than 1%. All R2 changes are statistically significant (p-values < 1×10–50), evaluated by pair-wise t-tests comparing models built with the status of the factor changed. The prediction performance increased, from colorectal cancer to two cancer types combined, and to H&N cancer. Gene expression data, either whole-transcriptome or the subset of LINCS genes, boosted the prediction performance. Between the two different dyes used to stain dead cells, TO-PRO-3 provided a higher prediction performance than Caspase-3/7. Conclusions The highest drug response prediction performance achieved is R2 of 0.790. Cancer type, dye, and whether the authors used gene expressions in modeling are the factors most influential on prediction performance. Factor Comparison R2 change Cancer type H&N vs. colorectal 3.42% Cancer type H&N vs. both 2.30% Cancer type Both vs. colorectal 1.12% Gene expressions LINCS genes vs. none 3.38% Gene expressions All genes vs. none 2.94% Dye TO-PRO-3 vs. Caspase-3/7 2.59%
Authors: Partin, Alexander, Brettin, Thomas S., Zhu, Yitan, Shukla, Maulik, Xia, Fangfang, Yoo, Hyunseung, Dolezal, James M, Kochanny, Sara, Pearson, Alexander T., Evrard, Yvonne A., Doroshow, James H., Stevens, Rick L.
TITLE: Drug Response Prediction in Patient-derived Xenografts with Data Augmentation and Multimodal Deep Learning , Journal of Clinical Oncology , 16_suppl , 40 : e13572 , 2022
PUBLICATION DATE: 06-01-2022
ABSTRACT: Background  Prediction of drug response is a critical research area in precision oncology. The community has previously explored this area with large drug screening studies of cancer cell lines (CCLs). Patient-derived xenografts (PDXs) are an appealing platform for preclinical drug studies because the in vivo environment of PDXs helps preserve tumor heterogeneity and usually better mimics drug response of patients with cancer compared to CCLs.  Methods  The authors investigated multimodal neural network (NN) and data augmentation for drug response prediction in PDXs. The multimodal NN learned to predict response using drug descriptors, gene expressions (GE), and histology whole-slide images (WSIs) where the multi-modality refers to tumor features only. The NN used late integration where the authors used separate subnetworks are used to encode the input feature types before concatenation and prediction layers. The authors assessed median tumor volume per treatment group relative to the control group to create a binary variable representing response. The data included twelve single-drug and 36 drug-pair treatments resulting in 2,556 single-drug and 2,203 drug-pair response values. The authors used pathology and omics data from 487 PDXs from NCI's Patient Derived Models Repository as tumor feature model inputs. The authors explored whether the integration of WSIs with GE improves predictions as compared with models that use GE alone. The authors used the following methods to address the limited number of response values in the dataset: Homogenized drug representations which allowed the authors to combine single-drug and drug-pairs into a single dataset, Augmented drug-pair samples by switching the order of drug features which doubles the sample size of all drug-pair samples. These methods enabled the authors to combine single-drug and drug-pair treatments which resulted in 6,962 responses, allowing the authors to train multimodal and unimodal NNs without changing architectures or the dataset.  Results  The authors compared prediction performance of three unimodal NNs which use GE (um1, um2, and um3) to assess the contribution of data augmentation methods. NN um1 that used the full dataset which includes the original and the augmented drug-pair treatments as well as single-drug treatments significantly outperformed NNs (p-values < 0.01) that ignored either the augmented drug-pairs (um2) or the single-drug treatments (um3). In assessing the contribution of multimodal learning, results showed that the multimodal NN (mm) outperformed both unimodal NNs that ignored either the GE (um4) or the WSIs (um1). However, the improvement of mm over um1 is not statistically significant (p-value < 0.26).  Conclusions  The authors' results showed that data augmentation and integration of histology images and GE can help improve prediction performance of drug response in PDXs.  Model WSI GE Single-drug Drug-pairs Augmented drug-pairs MCC um1 - v v v v 0.29 um2 - v v v - 0.22 um3 - v - v v 0.19 um4 v - v v v 0.21 mm v v v v v 0.31
Authors: Mohammad Mirzaei, Navid, Tatarova, Zuzana, Hao, Wenrui, Changizi, Navid, Asadpoure, Alireza, Zervantonakis, Ioannis K., Hu, Yu, Chang, Young Hwan, Shahriyari, Leili
TITLE: A PDE Model of Breast Tumor Progression in MMTV-PyMT Mice , Journal of Personalized Medicine , 5 , 12 : 807 , 2022
PUBLICATION DATE: 05-17-2022
ABSTRACT: The evolution of breast tumors greatly depends on the interaction network among different cell types, including immune cells and cancer cells in the tumor. This study takes advantage of newly collected rich spatio-temporal mouse data to develop a data-driven mathematical model of breast tumors that considers cells’ location and key interactions in the tumor. The results show that cancer cells have a minor presence in the area with the most overall immune cells, and the number of activated immune cells in the tumor is depleted over time when there is no influx of immune cells. Interestingly, in the case of the influx of immune cells, the highest concentrations of both T cells and cancer cells are in the boundary of the tumor, as the authors use the Robin boundary condition to model the influx of immune cells. In other words, the influx of immune cells causes a dominant outward advection for cancer cells. The authors also investigate the effect of cells’ diffusion and immune cells’ influx rates in the dynamics of cells in the tumor micro-environment. Sensitivity analyses indicate that cancer cells and adipocytes’ diffusion rates are the most sensitive parameters, followed by influx and diffusion rates of cytotoxic T cells, implying that targeting them is a possible treatment strategy for breast cancer.
Authors: Wright, Rosalind J., Hanson, Heidi A.
TITLE: A Tipping Point in Cancer Epidemiology: Embracing a Life Course Exposomic Framework , Trends in Cancer , 4 , 8 : 280-282 , 2022
PUBLICATION DATE: 04-16-2022
ABSTRACT: The pathogenesis of multifactorial malignant diseases, with variable onset, severity, and natural history, reflects development-specific exposures and individual responses to these exposures influenced by underlying genetic predisposition. Embedded in life course theory, exposomics provide a framework to elucidate how environmental factors alter cancer risk, disease course, and response to treatment across the lifespan.
Authors: Yoon, Hong-Jun, Stanley, Christopher, Christian, J. Blair, Klasky, Hilda B., Blanchard, Andrew E., Durbin, Eric B., Wu, Xiao-Cheng, Stroup, Antoinette, Doherty, Jennifer, Schwartz, Stephen M., Wiggins, Charles, Damesyn, Mark, Coyle, Linda, Tourassi, Georgia D.
TITLE: Optimal Vocabulary Selection Approaches for Privacy-preserving Deep NLP Model Training for Information Extraction and Cancer Epidemiology , Cancer Biomarkers , 2 , 33 : 185-198 , 2022
PUBLICATION DATE: 02-14-2022
ABSTRACT: Background  With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information. Objective  The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing (NLP) and explore a proper way of securing patients’ information to mitigate confidentiality breaches. Methods  The target model is the multitask convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from multiple state population-based cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports: Words appearing in multiple registries, and Words that have higher mutual information. The authors performed membership inference attacks on the models in high-performance computing environments. Results  The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.
PROJECT: MOSSAIC