Showing 132 Results
Showing 1-10 of 132
Publications
Authors: Gutta V, Ganakammal SR, Jones S, Beyers M, Chandrasekaran S
TITLE: UNNT: A novel Utility for comparing Neural Net and Tree-based models , UNNT: A novel Utility for comparing Neural Net and Tree-based models. PLOS Computational Biology, 20(4): e1011504 , 2024
PUBLICATION DATE: 04-29-2024
ABSTRACT: The use of deep learning (DL) is steadily gaining traction in scientific challenges such as cancer research. Advances in enhanced data generation, machine learning algorithms, and compute infrastructure have led to an acceleration in the use of deep learning in various domains of cancer research such as drug response problems. In our study, we explored tree-based models to improve the accuracy of a single drug response model and demonstrate that tree-based models such as XGBoost (eXtreme Gradient Boosting) have advantages over deep learning models, such as a convolutional neural network (CNN), for single drug response problems. However, comparing models is not a trivial task. To make training and comparing CNNs and XGBoost more accessible to users, we developed an open-source library called UNNT (A novel Utility for comparing Neural Net and Tree-based models). The case studies, in this manuscript, focus on cancer drug response datasets however the application can be used on datasets from other domains, such as chemistry.
Authors: Long, Jiaxin, Ganakammal, Satishkumar R. , Jones, Sara E., Kothandaraman, Harish, Dhawan, Deepika, Ogas, Joe , Knapp, Deborah W., Beyers, Matthew, Lanman, Nadia A.
TITLE: cTULIP: application of a human-based RNA-seq primary tumor classification tool for cross-species primary tumor classification in canine , Frontiers in Oncology , 13 , 2023
PUBLICATION DATE: 07-19-2023
ABSTRACT: Introduction: The domestic dog, Canis familiaris, is quickly gaining traction as an advantageous model for use in the study of cancer, one of the leading causes of death worldwide. Naturally occurring canine cancers share clinical, histological, and molecular characteristics with the corresponding human diseases. Methods: In this study, we take a deep-learning approach to test how similar the gene expression profile of canine glioma and bladder cancer (BLCA) tumors are to the corresponding human tumors. We likewise develop a tool for identifying misclassified or outlier samples in large canine oncological datasets, analogous to that which was developed for human datasets. Results: We test a number of machine learning algorithms and found that a convolutional neural network outperformed logistic regression and random forest approaches. We use a recently developed RNA-seq-based convolutional neural network, TULIP, to test the robustness of a human-data-trained primary tumor classification tool on cross-species primary tumor prediction. Our study ultimately highlights the molecular similarities between canine and human BLCA and glioma tumors, showing that protein-coding one-to-one homologs shared between humans and canines, are sufficient to distinguish between BLCA and gliomas. Discussion: The results of this study indicate that using protein-coding one-to-one homologs as the features in the input layer of TULIP performs good primary tumor prediction in both humans and canines. Furthermore, our analysis shows that our selected features also contain the majority of features with known clinical relevance in BLCA and gliomas. Our success in using a human-data-trained model for cross-species primary tumor prediction also sheds light on the conservation of oncological pathways in humans and canines, further underscoring the importance of the canine model system in the study of human disease.
Authors: Blanchard, Andrew, Bhowmik, Debsindhu, Fox, Zachary, Gounley, John, Glaser, Jens, Akpa, Belinda S., Irle, Stephan
TITLE: Adaptive Language Model Training for Molecular Design , Journal of Cheminformatics , 1 , 15 : 59 , 2023
PUBLICATION DATE: 06-08-2023
ABSTRACT: The vast size of chemical space necessitates computational approaches to automate and accelerate the design of molecular sequences to guide experimental efforts for drug discovery. Genetic algorithms provide a useful framework to incrementally generate molecules by applying mutations to known chemical structures. Recently, the community has applied masked language models to automate the mutation process by leveraging large compound libraries to learn commonly occurring chemical sequences (that is, using tokenization) and predict rearrangements (that is, using mask prediction). Here, the authors consider how they can adapt language models to improve molecule generation for different optimization tasks. The authors use two different generation strategies for comparison, fixed and adaptive:  The fixed strategy uses a pre-trained model to generate mutations;  The adaptive strategy trains the language model on each new generation of molecules selected for target properties during optimization.  The results show that the adaptive strategy allows the language model to more closely fit the distribution of molecules in the population. Therefore, for enhanced fitness optimization, the authors suggest the use of the fixed strategy during an initial phase followed by the use of the adaptive strategy. The authors demonstrate the impact of adaptive training by searching for molecules that optimize both heuristic metrics, drug-likeness and synthesizability, as well as predicted protein binding affinity from a surrogate model. The results show that the adaptive strategy provides a significant improvement in fitness optimization compared to the fixed pre-trained model, empowering the application of language models to molecular design tasks.
Authors: Partin, Alexander, Brettin, Thomas, Zhu, Yitan, Dolezal, James M., Kochanny, Sara, Pearson, Alexander T., Shukla, Maulik, Evrard, Yvonne A., Doroshow, James H., Stevens, Rick L.
TITLE: Data Augmentation and Multimodal Learning for Predicting Drug Response in Patient-derived Xenografts from Gene Expressions and Histology Images , Frontiers in Medicine , 10 , 2023
PUBLICATION DATE: 03-07-2023
ABSTRACT: Patient-derived xenografts (PDXs) are an appealing platform for preclinical drug studies. A primary challenge in modeling drug response prediction (DRP) with PDXs and neural networks (NNs) is the limited number of drug response samples. The authors investigate multimodal neural network (MM-Net) and data augmentation for DRP in PDXs. The MM-Net learns to predict response using drug descriptors, gene expressions (GE), and histology whole-slide images (WSIs). The authors explore whether combining WSIs with GE improves predictions as compared with models that use GE alone. The authors propose the following data augmentation methods which allow the authors to train multimodal and unimodal NNs without changing architectures with a single larger dataset: Combine single-drug and drug-pair treatments by homogenizing drug representations, and Augment drug-pairs which doubles the sample size of all drug-pair samples. The authors compared unimodal NNs which use GE to assess the contribution of data augmentation. The NN that uses the original and the augmented drug-pair treatments as well as single-drug treatments outperforms NNs that ignore either the augmented drug-pairs or the single-drug treatments. In assessing the multimodal learning based on the Matthews correlation coefficient (MCC) metric, MM-Net outperforms all the baselines. The results show that data augmentation and integration of histology images with GE can improve prediction performance of drug response in PDXs.
PROJECT: IMPROVE
Authors: Partin, Alexander, Brettin, Thomas S., Zhu, Yitan, Narykov, Oleksandr, Clyde, Austin, Overbeek, Jamie, Stevens, Rick L.
TITLE: Deep learning methods for drug response prediction in cancer: Predominant and emerging trends , Frontiers in Medicine , 10 , 2023
PUBLICATION DATE: 02-15-2023
ABSTRACT: Cancer claims millions of lives yearly worldwide. While many therapies have become available in recent years, in general, cancer remains unsolved. Exploiting computational predictive models to study and treat cancer holds great promise in improving drug development and personalized design of treatment plans, ultimately suppressing tumors, alleviating suffering, and prolonging lives of patients. A wave of recent papers demonstrates promising results in predicting cancer response to drug treatments while utilizing deep learning methods. These papers investigate diverse data representations, neural network architectures, learning methodologies, and evaluations schemes. However, deciphering promising predominant and emerging trends is difficult due to the variety of explored methods and lack of standardized framework for comparing drug response prediction models. To obtain a comprehensive landscape of deep learning methods, the authors conducted an extensive search and analysis of deep learning models that predict the response to single drug treatments. The authors curated 61 deep learning-based models and generated summary plots. The analysis revealed observable patterns and prevalence of methods. This review allows the community to better understand the current state of the field and identify major challenges and promising solution paths.
PROJECT: IMPROVE
Authors: Stanton, Liam G., Oppelstrup, Tomas, Carpenter, Timothy S., Ingólfsson, Helgi I., Surh, Michael P., Lightstone, Felice C., Glosli, James N.
TITLE: Dynamic Density Functional Theory of Multicomponent Cellular Membranes , Physical Review Research , 1 , 5 : 013080 , 2023
PUBLICATION DATE: 02-06-2023
ABSTRACT: The authors present a continuum model trained on molecular dynamics (MD) simulations for cellular membranes composed of an arbitrary number of lipid types. The authors constructed the model within the formalism of dynamic density functional theory. The community can extend the model to include features such as the presence of proteins and membrane deformations. This framework represents a paradigm shift by enabling simulations that can access length scales on the order of microns and time scales on the order of seconds, all while maintaining near fidelity to the underlying MD models. These length and time scales are significant for accessing biological processes associated with signaling pathways within cells. The authors considered, as an application, membrane interactions with RAS, a protein implicated in roughly 30% of human cancers. The authors presented and verified simulation results with MD simulations, and discussed implications of this new capability.
PROJECT: MOSSAIC, ADMIRRAL
Authors: Blanchard, Andrew, Zhang, Pei, Bhowmik, Debsindhu, Mehta, Kshitij, Gounley, John, Reeve, Samuel Temple, Irle, Stephan, Pasini, Massimiliano Lupo
TITLE: Computational Workflow for Accelerated Molecular Design Using Quantum Chemical Simulations and Deep Learning Models , Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation , 1690 : 3-19 , 2023
PUBLICATION DATE: 01-18-2023
ABSTRACT: The community needs efficient methods for searching the chemical space of molecular compounds to automate and accelerate the design of new functional molecules such as pharmaceuticals. Given the high cost in both resources and time for experimental efforts, computational approaches play a key role in guiding the selection of promising molecules for further investigation. Here, the authors construct a workflow to accelerate design by combining approximate quantum chemical methods [that is, density-functional tight-binding (DFTB)], a graph convolutional neural network (GCNN) surrogate model for chemical property prediction, and a masked language model (MLM) for molecule generation. The authors use property data from the DFTB calculations are used to train the surrogate model; The authors use the surrogate model to score candidates generated by the MLM. The surrogate reduces computation time by orders of magnitude compared to the DFTB calculations, enabling an increased search of chemical space. Furthermore, the MLM generates a diverse set of chemical modifications based on pre-training from a large compound library. The authors use the workflow to search for near-infrared photoactive molecules by minimizing the predicted HOMO-LUMO gap as the target property. The results show that the workflow can generate optimized molecules outside of the original training set, which suggests that iterations of the workflow could be useful for searching vast chemical spaces in a wide range of design problems.
Authors: Jones, Sara, Beyers, Matthew, Shukla, Maulik, Xia, Fangfang, Brettin, Thomas, Stevens, Rick, Weil, M. Ryan, Ganakammal, Satishkumar Ranganathan
TITLE: TULIP: An RNA-Seq-based Primary Tumor Type Prediction Tool Using Convolutional Neural Networks , Cancer Informatics , 21 : 1-10 , 2022
PUBLICATION DATE: 12-05-2022
ABSTRACT: Background With cancer as one of the leading causes of death worldwide, accurate primary tumor type prediction is critical in identifying genetic factors that can inhibit or slow tumor progression. The community has tried to categorize primary tumor types with gene expression data using machine learning, and more recently with deep learning, in the last several years. Methods In this paper, the authors developed four one-dimensional (1D) Convolutional Neural Network (CNN) models to classify RNA-Seq count data as one of 17 highly represented primary tumor types or 32 primary tumor types regardless of imbalanced representation. Additionally, the authors adapted the models to take as input either all Ensembl genes (60,483) or protein coding genes only (19,758). Unlike previous work, the authors avoided selection bias by not filtering genes based on expression values. RNA-Seq count data expressed as FPKM-UQ of 9,025 and 10,940 samples from The Cancer Genome Atlas (TCGA) were downloaded by the authors from the Genomic Data Commons (GDC) corresponding to 17 and 32 primary tumor types respectively for training and validating the models. Results All four 1D-CNN models had an overall accuracy of 94.7% to 97.6% on the test dataset. Further evaluation indicates that the models with protein coding genes only as features performed with better accuracy compared to the models with all Ensembl genes for both 17 and 32 primary tumor types. For all models, the accuracy by primary tumor type was above 80% for most primary tumor types. Conclusions The authors packaged all four models as a Python-based deep learning classification tool called TULIP (TUmor cLassIfication Predictor) for performing quality control on primary tumor samples and characterizing cancer samples of unknown tumor type. The community needs further optimization of the models to improve the accuracy of certain primary tumor types.
Authors: Ma, Xiaoyu, Sardy, Sylvain, Hengartner, Nick, Bobenko, Nikolai, Lin, Yen Ting
TITLE: A Phase Transition for Finding Needles in Nonlinear Haystacks with LASSO Artificial Neural Networks , Statistics and Computing , 6 , 32 : 99 , 2022
PUBLICATION DATE: 10-22-2022
ABSTRACT: To fit sparse linear associations, a LASSO sparsity inducing penalty with a single hyperparameter allows the community to recover the important features (needles) with high probability in certain regimes even if the sample size is smaller than the dimension of the input vector (haystack). More recently learners known as artificial neural networks (ANN) have shown great successes in many machine learning tasks, in particular fitting nonlinear associations. Small learning rate, stochastic gradient descent algorithm, and large training set help the community to cope with the explosion in the number of parameters present in deep neural networks. Yet the community has developed and studied few ANN learners to find needles in nonlinear haystacks. Driven by a single hyperparameter, the authors’ ANN learner, for sparse linear associations, exhibits a phase transition in the probability of retrieving the needles, which the authors do not observe with other ANN learners. To select a penalty parameter, the authors generalize the universal threshold of Donoho and Johnstone (Biometrika 81(3):425–455, 1994) which is a better rule than the conservative (too many false detections) and expensive cross-validation. In the spirit of simulated annealing, the authors propose a warm-start sparsity inducing algorithm to solve the high-dimensional, non-convex, and non-differentiable optimization problem. The authors perform simulated and real data Monte Carlo experiments to quantify the effectiveness of their approach.
Authors: Stahlberg, Eric A., Abdel-Rahman, Mohamed, Aguilar, Boris, Asadpoure, Alireza, Beckman, Robert, Borkon, Lynn L., Bryan, Jeffrey, Cebulla, Colleen, Chang, Young Hwan, Chatterjee, Ansu, Deng, Jun, Dolatshahi, Sepideh, Gevaert, Olivier, Greenspan, Emily J., Hao, Wenrui, Hernandez-Boussard, Tina, Jackson, Pamela R., Kuijjer, Marieke, Lee, Adrian, Macklin, Paul, Madhavan, Subha, McCoy, Matthew D., Mirzaei, Navid Mohammad, Razzaghi, Talayeh, Rocha, Heber, Shahriyari, Leili, Shmulevich, Ilya, Stover, Daniel G., Sun, Yi, Syeda-Mahmood, Tanveer, Wang, Jinhua, Wang, Qi, Zervantonakis, Ioannis
TITLE: Exploring Approaches for Predictive Cancer Patient Digital Twins: Opportunities for Collaboration and Innovation , Frontiers in Digital Health , 2022
PUBLICATION DATE: 10-06-2022
ABSTRACT: Cancer patient digital twins will soon reach their potential to predict cancer prevention, diagnosis, and treatment in individual patients. The community will realize this based on advances in high performance computing, computational modeling, and an expanding repertoire of observational data across multiple scales and modalities. In 2020, the US National Cancer Institute and the US Department of Energy, through a trans-disciplinary research community at the intersection of advanced computing and cancer research, initiated team science collaborative projects to explore the development and implementation of predictive Cancer Patient Digital Twins (CPDT). The community launched several diverse pilot projects to provide key insights into important features of this emerging landscape and to determine the requirements for the development and adoption of cancer patient digital twins. Projects included: Exploring approaches to using a large cohort of digital twins to perform deep phenotyping and plan treatments at the individual level, Prototyping self-learning digital twin platforms, Using adaptive digital twin approaches to monitor treatment response and resistance, Developing methods to integrate and fuse data and observations across multiple scales, and Personalizing treatment based on cancer type. Collectively, these efforts have yielded increased insights into the opportunities and challenges facing cancer patient digital twin approaches and helped define a path forward. Given the rapidly growing interest in patient digital twins, this manuscript provides a valuable early progress report of several CPDT pilot projects commenced in common, their overall aims, early progress, lessons learned, and future directions that will increasingly involve the broader research community.