IMPROVE Benchmark Dataset
(IMPROVE)

Dataset Description
Dataset Description

The IMPROVE Benchmark Dataset comprises of four kinds of data – 1) cell line response data, 2) cell line multi-omics data, 3) drug feature data, and 4) data partitions.  

1. Cell line response data were extracted from five sources. These are:

  • Cancer Cell Line Encyclopedia (CCLE)
  • Cancer Therapeutics Response Portal version 2 (CTRPv2)
  • Genomics of Drug Sensitivity in Cancer version 1 (GDSC1)
  • Genomics of Drug Sensitivity in Cancer version 2 (GDSC2)
  • Genentech Cell Line Screening Initiative (GCSI)

A unified dose response fitting pipeline was used on the multi-dose viability data to calculate various dose-independent response metrics such as the area under the dose response curve (AUC) and the half-maximal inhibitory concentration (IC50).   

2. The multi-omics data of cell lines were extracted from the Dependency Map (DepMap) portal of CCLE. The types of data included are gene expression, mutation, DNA methylation, copy number variation, protein expression, and miRNA expression. Data preprocessing, such as discretizing copy number variation and mapping between different gene identifier systems, was performed.  

3. Drug information was retrieved from PubChem. Based on the drug SMILES (Simplified Molecular Input Line Entry Specification) strings, we calculated their molecular fingerprints and descriptors using the Mordred and RDKit Python packages.  

4. Data partition files were generated using the IMPROVE benchmark data preparation pipeline. They indicate, for each modeling analysis run, which samples should be included in the training, validation, and testing sets, for building and evaluating the drug response prediction (DRP) models.  

For more details, refer to Benchmark Data for Cross-Study Analysis.

Content Type
Content Type
Cell Line Response
Cell Line Multi-omics
Drug Feature
Data Partitions