GEMS Executive Summary

(vers. 2.0 from 12-3-2004)




GEMS (which is an acronym for Gene Expression Model Selector) is a system that constructs in a supervised fashion diagnostic and outcome prediction models from array gene expression data. Examples of such models are: (a) models that detect cancer, (b) models that determine the correct subtype of cancer or (c) models that predict survival after treatment. Models that support such complex decision making are widely recognized as having the potential to revolutionize medicine in the years to come. In addition to the decision support models, GEMS can be used to select a small number of genes that are as good or better than the full gene set for  diagnosis and/or outcome prediction. These biomarkers (genes) are also useful for discovery purposes (e.g., they suggest plausible causes and treatments of various types of cancer). Finally, GEMS provides estimates of the models’ performance (e.g., accuracy) in future applications (i.e., when applied on patients not used to build the models but who come from the same patient population as the ones used to build the models), and allows users to run the models for individual patients.


Building such models (a) requires specialized training in statistics, and/or bioinformatics and/or pattern recognition, (b) takes several weeks to months to accomplish in typical academic settings, and (c) may suffer from pitfalls introduced by human analysts such as overfitting the data (i.e., building models that are very good for the training set but perform poorly on future independent patient cases). GEMS performs these tasks quickly, automatically, without overfitting, and without requiring the user to have expertise in data analysis.


More precisely stated the input and output of the system are:




1. A training dataset with rows corresponding to patients and columns corresponding to gene expression measurements;and a column with true outcome or diagnostic label for each patient

1. A model that outputs the correct diagnosis or outcome given the gene expression values for a new patient

2. Optional: names for the genes and gene accession numbers

2. An estimate of the performance of the model output in #1 in future (independent) application of the model

3. File names for storing the results

3. (In application mode): the model’s diagnoses or predictions and overall performance

4. Various choices of methods and parameters for the analysis

4. A reduced set of genes required for the diagnosis or outcome prediction

5. (In application mode): the name of a previously prepared model file

5. Links from the genes to literature and other resources




GEMS was validated using the most stringent gold standard technique of independent (i.e., cross-data set) validation. In this method the system is used to build a model from dataset 1, then the model is applied on dataset 2. The two datasets come from different labs and hospitals and in our experiments obtained using different microarray technologies


It was found that GEMS: (a) matches or exceeds the performance of human analysts, (b) builds automatically models in minutes, (c) estimates the models’ performance correctly, and (d) selects gene markers that generalize from one dataset to another.


Competitive Advantages


1. GEMS’s learning algorithms were chosen from ~20 algorithms after an extensive algorithmic evaluation using 11 publicly available datasets spanning 74 cancer types.

2. Thorough validation:

(a) GEMS was tested by re-analyzing the above datasets.

(b) GEMS was validated with 5 “fresh” datasets against human experts using cross-validation.

(c) GEMS was validated with two pairs of datasets using independent (cross-dataset) validation. The validation involved both model and gene marker generalizability. In total GEMS was validated with 16 datasets.

3. Fully automated, yet provides many optional features for the seasoned analyst.

4. Includes proprietary gene selection and causal discovery algorithms with well-defined properties, theoretical guarantees for correctness and excellent empirical performance

5. Client-server architecture


  • Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 2005 Mar 1;21(5):631-43. [Pubmed]
  • Statnikov A, Tsamardinos I, Dosbayev Y, Aliferis CF. GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int J Med Inform 2005 Aug;74(7-8):491-503. [Pubmed]

  • Copyright © 2002-2007, Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA