(vers. from 2-23-2005)
Discovery of causal knowledge is crucial for advancing research, developing new technology, and making sound policy, financial, and marketing decisions. Biologists need to know the factors that cause a disease to devise new therapeutic procedures. Public health policy makers need to know the factors that cause an increase in the number of medical errors in order to reduce them. Epidemiologists seek the factors causing disease in order to prevent it. Launching a new advertisement campaign requires knowing the factors that affect consumer behavior regarding the product. Increasing the number of visitors to a web site requires knowledge of what attracts them to the site.
Classically-trained statisticians often quote the maxim Ďassociation is not causationí to indicate that causal discovery is impossible without experiments. For example, simply observing a high occurrence of yellow stains on the fingers in patients with lung cancer relative to normal subjects does not imply a causal relation between cancer and staining (in reality heavy smoking is causing both to often co-occur). Similarly, observing that two items tend to be purchased together in high frequency does not necessarily imply that increasing the sales of the first item will be followed by an increase of the sales of the second item.
Unfortunately, discovering causal relations strictly by randomized experimentation is inefficient and often impractical, unethical, or simply impossible. Recent advances in computational causal discovery theory and algorithm research and development mathematically prove and experimentally show respectively the feasibility of causal discovery from observational data alone under broad conditions. In fact the Nobel Prize in Economics in 2003 was awarded to C. W. J. Granger for his test for detecting causality in observational econometric time series. The acceptance and application of causal discovery methods are steadily gaining ground. The following are just a few of important references in this emerging and exciting branch of science and technology:
I. Causation, Prediction, and Search by Peter Spirtes, Clark Glymour, Richard Scheines (Second Edition)
II. Causality : Models, Reasoning, and Inference by
III. Learning Bayesian Networks by Richard E. Neapolitan
IV. Computation, Causation, and Discovery by Clark Glymour (Editor), Gregory F. Cooper (Editor)
Causal Explorer (CE) is a library of causal
discovery algorithms authored by the researchers at the Discovery Systems Laboratory
of the Department of Biomedical Informatics at
In addition to the causal discovery methods, CE contains related variable (feature) selection. The variable selection algorithms reduce the dimensionality of the data by selecting the smallest most predictive subset of variables. Thus, they can be used to construct smaller and some times more accurate predictive or classification models that are less costly to operate and easier to interpret and understand. The variable selection algorithms in CE are based on theories of causal discovery and the selected variables have specific causal interpretation (e.g., they are the direct causes or direct effects of the variable of interest, or alternatively the Markov Blanket of the variable of interest).
The CE code emphasizes efficiency, scalability, and quality of discovery. The implementations of previously published algorithms included in CE are more efficient than their original implementations. CE also includes algorithms never before translated to computer programs.
A unique advantage of CE is the inclusion of very large scale and high quality, proprietary algorithms, developed by the Discovery Systems Laboratory researchers (patent pending). Example papers describing DSLís novel causal and variable selection algorithms are:
1. Using local causal structure to select variables for classification across several biomedical domains with very high dimensionality:
Aliferis CF, Tsamardinos I, Statnikov A. HITON: a novel Markov blanket algorithm for optimal variable selection. AMIA 2003 Annual Symposium Proceedings 2003;21-5. [Article]
2. Recovering local causal structure even when the available sample is low and the number of variables large:
Tsamardinos I, Aliferis CF, Statnikov A. Time and sample efficient discovery of Markov blankets and direct causal relations. Proceedings of the Ninth International Conference on Knowledge Discovery and Data Mining (KDD) 2003;673-8. [Article] ††††††††††
3. Learning a complete network of causal relations among all variables efficiently and with high accuracy:
The benefits of using the included algorithm have also been examined on a theoretical basis by the DSL researchers. They have shown that other state-of-the art predictive technologies like Support Vector Machines, while successful for classification and prediction, are not suitable for causal discovery (see the following [Article]). They have also shown and explored the formal link between causality and variable selection that led to the design of the novel, proprietary, and optimal variable selection algorithms with Markov Blanket induction, which also has local causal interpretability (see the following [Article]).
Causal Explorer is a library of local and global causal discovery algorithms. Several of those algorithms can also be used for variable selection for classification.
Local causal discovery algorithms determine from observational data which predictors (variables/observed quantities) causally affect or are affected by a target variable of interest (under certain conditions). The causal relations inferred are direct, i.e., if variable A is found to directly causally affect/be affected by variable B, no other predictor(s) measured in the dataset causally interferes between A and B.†
Global causal discovery algorithms determine from observational data all direct causal associations among the variables/observed quantities and their orientation.†
Causal Explorer can be used to:
(a) Discover the direct causal or probabilistic relations around a target variable of interest (e.g., disease is directly caused by and directly causes a set of variables/observed quantities).
(b) Discover the set of all direct causal or probabilistic relations among the variables (Bayesian Network Learning).
(c) Discover the Markov Blanket of a target variable of interest, i.e., the smallest subset of variables that contains all necessary information to predict the target variable; the Markov Blanket variables is the smallest subset required to build optimal prediction models (under certain broad conditions) and corresponds to the direct causes, direct effects, and direct causes of direct effects of the target variable.
Such algorithms have been frequently employed in analysis of data in psychology, medicine, biology, weather forecasting, animal breeding, agriculture, financial modeling, information retrieval, natural language processing, and other fields. They can be used to automatically construct Decision Support Systems from data (e.g., for medical diagnosis), or to generate plausible causal hypotheses (e.g., which gene regulates which).
The algorithms in Causal Explorer include the state-of-the-art in the field and have been compared against each other in some of the largest computational experimental studies in the literature (as an example see papers 1, 2, and 3 above; more publications are available at the web site of the Discovery Systems Laboratory at http://www.dsl-lab.org/). The results of our studies confirm the applicability of the methods, provide suggestions as to which algorithm to be used in different situations, and illustrate the expectations regarding the performance of each algorithm. All algorithms have well-characterized properties in terms of under what conditions they are guaranteed to return correct results.
1. Causal Explorer contains our proprietary algorithms HITON, MMMB, MMPC, and MMHC.
a. HITON has been shown to be a very effective variable selection algorithm, tested in a variety of datasets in biomedicine with superior results against several other state-of-the-art algorithms in the field. HITON selects significantly smaller variable subsets than the other algorithms we compared it with, without sacrificing prediction power; in addition, the variables selected have a causal interpretation (they belong in the Markov Blanket of the variable to be predicted).
b. MMMB and MMPC are local causal algorithms, showed to outperform the previous state-of-the-art algorithms of similar type.
c. MMHC is a Bayesian Network learning algorithm showed to outperform the previous state-of-the-art algorithms in a very extensive empirical evaluation study.
2. Contrary to state-of-the-art methods used extensively in large-scale data mining (e.g., association rules, decision trees, regression, various feature selection procedures) all algorithms provide theoretical guarantees for correctness while scaling-up to tens of thousands, or hundreds of thousands of variables.
3. Causal Explorer provides the most extensive palette of algorithms of this type, including the best algorithms in existence in addition to our proprietary ones.
4. Causal Explorer provides some of the most efficient implementations of state-of-the-art algorithms.
5. The algorithms in Causal Explorer have been tested in extensive studies to provide suggestions guided by empirical results as to the appropriateness of the algorithms in different situations.