GDA Help

Tutorial

A step-by-step tutorial to a complete analysis in GDA is available here.
The tutorial shows how to use Genomics and Drugs integrated Analysis (GDA) webserver and performs all types of integrated analysis of genomic profiles and drug sensitivities in cancer cell lines. All queries used as examples in the tutorial have already been run in GDA to allow a faster retrieval of results.

Database content

GDA contains drug response data for 50,816 compounds with mutations and gene expression profiles across 115 cancer cell lines. Drug response data were derived from the NCI-60 drug screening; mutation calls were retrieved from the CCLE portal; exome sequencing data for the NCI-60 cell lines were obtained from CellMiner. Raw gene expression data of CCLE and NCI-60 cell lines were downloaded from Gene Expression Omnibus series GSE36133 and GSE32474, respectively. Prior to analysis, the two series have been merged into a compendium of 231 samples representing the 73 cell lines that had at least one sample in at least one series. In the compendium, more than 80% of the cell lines comprises at least 4 replicates. Expression values were generated from intensity signals using the multi-array average procedure (RMA) and the Brainarray custom CDF based on Entrez genes for Affymetrix HG-U133 Plus2 arrays. Drug structures were retrieved from the PubChem FTP site in SMILES (Simplified Molecular-Input Line-Entry) format and matched to NCI-60 drug names using the CID (PubChem Compound Identifier) name format. The central objects of the database, used to link NCI-60 pharmacological information to CCLE and NCI-60 genomic data, are the common cancer cell lines between the NCI-60 drug screening and the CCLE and NCI-60 genomic repositories, comprising 50, 60, and 73 cell lines for CCLE variant calls, NCI-60 exome sequencing, and CCLE/NCI-60 gene expression data, respectively.

Analysis modules

GDA is composed of 4 main analysis modules:
1. from gene to drug: drugs active in cancer cell lines bearing specific gene mutations;
2. from drug to gene: gene mutations characterizing cancer cell lines that are responsive to a selected compound;
3. from signature to drug: drugs active in cancer cell lines bearing the activation of a specific gene signature;
4. from drug to signature: up- and down-regulated genes in cancer cell lines that are responsive to a specific compound.
Queries are performed through drop-down menus and either checkboxes or radio buttons, depending on the type of input. In the from gene to drug and from drug to gene modules, genes and compound are selected via a drop-down menu that auto-completes based on the gene mutations and drugs present in the database. In the from signature to drug, gene lists can be pasted into a dedicated input text box or uploaded as a text file using HUGO symbols for the gene names.
Results from the analyses can be fed into additional GDA modules (as the drug clustering, the Maximum Common Structure and the differential gene expression analyses) or sent to external web services (as Enrichr, L1000CDS2, and PubChem) for functional annotation and comparison.

Statistical analysis

Drug response data were transformed into relative sensitivities (RS) as described in Yu et al., Cancer Cell, 2012. Specifically, the GI50 value, i.e., the drug concentration required for 50% growth inhibition in vitro (IC50) as named by NCI-60 DTP, was transformed into relative sensitivities by mean centering, in logarithmic scale, the GI50 of each compound in each cell line (i.e., RS=log2GI50-average(log2GI50) where the average is taken across all cell lines). Based on the RS values, each combination of drug and cell line was classified as responsive (or non-responsive) if the RS was lower (or higher) than two standard deviations of the distribution of all RS in the given cell line.
Depending on the type of analysis, cancer cell lines are first divided in cases (e.g., in the from gene to drug analysis, cell lines treated with the given compound and bearing the set of mutations selected by the user) and controls (e.g., all the other cell lines treated with the given compound) and then both categories are further divided in responsive and non-responsive. The resulting contingency table is used to calculate: i) the score, given by the fraction of responsive cell lines in cases multiplied by the fraction of non-responsive cell lines in controls; and ii) the p-value, using a one-tailed Fisher’s exact test for the enrichment of responses in cases as compared to non-responses in controls. The Fisher’s test has been implemented in R using the fisher.test function of the stats package.
Gene expression levels have been calculated in R with the rma function of the affy package using quantile normalization and median polish summarization. Differential expression analysis has been implemented using samr (Significance Analysis of Microarray) R package and t.test R function. Signature scores have been obtained summarizing the standardized expression levels of signature genes into a combined score with zero mean (Adorno et al., Cell, 2009).

Output

The from gene to drug analysis results the list of all drugs that are active on the analyzed cell lines bearing the selected set of mutations and the list of drug families that are significantly enriched, given the set of active drugs. In the table of active drugs, compounds are identified in terms of compound ID (linking to the PubChem), name, drug family, mechanism of action (MoA), score, and statistical significance. Results can be downloaded in tabular form for storage and external analyses, as well as visualized using different graphical representations. In this module, plots display i) the score as a function of the p-value for each drug and each drug family (with red dots highlighting significantly active drugs); ii) the distribution of the drug score for each drug in each family; iii) the distribution of relative sensitivities (RS) in cell lines bearing a given mutation and responsive to a given drug as compared to the distribution of relative sensitivities in wild type and non-responsive cell lines; and iv) the expression levels of each gene used in the query in cell lines with mutations in the gene and responsive to a given drug as compared to the expression levels of the gene in wild type and non-responsive cell lines.
The from signature to drug analysis returns the same outputs of the from gene to drug listing all drugs that are active on cell lines with the up-regulation of the input gene signature. In this module, plots display: i) the score as a function of the p-values for each drug and each drug family (with red dots highlighting significantly active drugs); ii) the distribution of the drug score for each drug in each family; iii) the distribution of relative sensitivities (RS) in cell lines with an active (e.g., upregulated) gene signature and responsive to a given drug as compared to the distribution of relative sensitivity in cell lines with an inactive (e.g., downregulated) and non-responsive cell lines; and iv) the expression levels of the gene signature in cell lines with an active gene signature and responsive to a given drug as compared to cell lines with an inactive gene signature and non-responsive to a given drug. Directly from the result page, it is possible to access the Drugs clustering (see the from gene to drugs module for details) or, once a drug is selected, the from drug to gene analysis module (see the from drug to gene module for details).
All significant drugs identified by the from gene to drug and from signature to drug modules can be fed to the drug clustering analysis, which results an interactive clustering tree of the significant drugs grouped by structural similarity. Once selected, each node of the tree returns a list of drugs that can be further used in the Maximum Common Structure analysis to retrieve a common scaffold shared by all compounds belonging to the group.
The from drug to gene module results the chemical structure of the selected compound (if available and linking to PubChem) with the number of mutations found in responsive and non-responsive cell lines and an interactive volcano plot showing the score and p-value of gene mutations and SNPs, present in cancer cell lines that are responsive (green bubbles) or non-responsive (red bubbles) to the selected compound. Results are also visualized by pie-charts and a table (default for responsive cell lines). The pie charts report: i) the distribution of responsive (or non-responsive) cell lines bearing mutations; ii) the distribution of tissues for responsive (or non-responsive) cell lines bearing mutations; and iii) the distribution of mutations types in cell lines responsive (or non-responsive) to the selected drug. The table lists tissue, gene symbol, cell line, variant classification, mutation type, chromosome coordinates, SNP ID, score and p-value of each mutation in cell lines responsive (or non-responsive) to the selected drug. Tables of results for responsive and non-responsive cell lines can be downloaded as Excel spreadsheets.
The from drug to signature analysis returns the lists of genes over-expressed in responsive (Group A) and non-responsive (Group B) cancer cell lines. These lists can be functionally annotated using Enrichr, compared to results of the LINCS project through the LINCS L1000 characteristic direction signatures search engine (L1000CDS2), or directly used to generate gene signatures for the from signature to drug module of GDA.