SHAP feature importances tested
I am currently reading Advances in Financial Machine Learning by Marcos Lopez de Prado and the author emphasises examining the trained models before putting any faith in them - something I wholeheartedly agree with. Since interpreting models is important, Marcos put several methods of examining feature importances to the test in a hope to determine weaknesses and strengths of each method.
I have interest in examining tree-based models and I briefly talked about it in my previous posts 1, 2, and I have become an advocate for using the SHAP library for computing feature importances. Marcos examined permutation feature importance, mean impurity decrease and single-feature importances (where a classifier is trained on a single feature at a time), and determined that the first two do quite well: they rank feature that are really important higher than non-important features.
Unfortunately, SHAP is missing from his analysis, and I decided to replicate his test on a synthesised data for the library.
Creating a synthetic dataset
Following the methodology of Marcos I created a dataset with 10 informative
features I_*
, 10 redundant features (those are linear combinations of the
informative) R_*
and 20 non-informative N_*
. I then trained a LightGBM
classifier and computed the SHAP values. LightGBM
library computes SHAP
values without installing extra dependencies:
In the above, I sum SHAP contributions for each row. Notice I had to take absolute value as contributions can be negative. Finally, lets plot the SHAP feature importances using Altair:
In the above bar chart we see that all informative and redundant features score higher than non-informative. This is a manifestation of consistency of SHAP values: more important features should score higher. The red line is the mean score. We see that certain informative and redundant features, specifically I_2, I_9, R_6, R_9 and R_5 are below the average. I don’t think that’s a particularly bad sign, although Marcos typically remarks on such observation as it being a negative aspect of the method.
Conclusion
So far nothing wrong with SHAP values detected. For complete treatment I am including mean impurity decrease (MID) and permutation (based on ROC AUC score) importances. I can remark that all 3 methods are in rough agreement, so perhaps this test isn’t very informative.
And the code: