Online Supplement to "Using SVM Weight-Based Methods to Identify Causally Relevant and Non-Causally Relevant Variables"

Alexander Statnikov1 (alexander.statnikov@vanderbilt.edu), Douglas Hardin1,2 (doug.hardin@vanderbilt.edu), Constantin Aliferis1,3 (constantin.aliferis@vanderbilt.edu)

1 Department of Biomedical Informatics, 2 Department of Mathematics, 3 Department of Cancer Biology, Vanderbilt University, Nashville, TN 37232, USA.

Abstract: We conducted simulation experiments to study SVM weight-based ranking and variable selection methods using two network structures that are often encountered in biological systems and are likely to occur in many other settings as well. We attempted to recover both causally and non-causally relevant variables using SVM weight-based methods under a variety of experimental settings (data-generating network, noise level, sample size, and SVM penalty parameter). Our experiments show that SVMs produce excellent classifiers that often assign higher weights to irrelevant variables than to the relevant ones. Likewise, the application of the recursive variable selection technique SVM-RFE, does not remedy this problem. More importantly, we found that when it comes to identifying causally relevant variables, SVM weight-based methods can fail by assigning higher weights or selecting (in the context of SVM-RFE) variables that are relevant but non-causally so. Furthermore, even irrelevant variables can have higher weights or can be selected more frequently than the causally relevant ones. We show that this problem is not linked to the specific variable selection techniques studied but rather that the maximum margin inductive bias, as typically employed by SVM-based methods, is locally causally inconsistent. New SVM methods may be needed to address this issue and this is an exciting and challenging area of research.

Download PDF file of the paper

Parameters settings used in experiments 1. Variable ranking by SVM weights executed for all parameter settings 2. Classification performance of top-ranked variables by SVM weights executed only for networks with ≤ 100 relevant and irrelevant variables and sample sizes = {100, 200, 500} 3. Variable selection using SVM-RFE executed only for networks with ≤ 100 relevant and irrelevant variables and sample sizes = {100, 200, 500} 4. Classification performance of variables selected by SVM-RFE executed only for networks with ≤ 100 relevant and irrelevant variables and sample sizes = {100, 200, 500}