SHAP feature importances tested

I am currently reading Advances in Financial Machine Learning by Marcos Lopez de Prado and the author emphasises examining the trained models before putting any faith in them - something I wholeheartedly agree with. Since interpreting models is important, Marcos put several methods of examining feature importances to the test in a hope to determine weaknesses and strengths of each method.

I have interest in examining tree-based models and I briefly talked about it in my previous posts 1, 2, and I have become an advocate for using the SHAP library for computing feature importances. Marcos examined permutation feature importance, mean impurity decrease and single-feature importances (where a classifier is trained on a single feature at a time), and determined that the first two do quite well: they rank feature that are really important higher than non-important features.

Unfortunately, SHAP is missing from his analysis, and I decided to replicate his test on a synthesised data for the library.

Creating a synthetic dataset

import pandas as pd
from sklearn.datasets import make_classification

n_samples = 10000
n_features = 40
n_informative = 10
n_redundant = 10

X_train, y_train = make_classification(n_samples=n_samples,
                                       n_features=n_features,
                                       n_informative=n_informative,
                                       n_redundant=n_redundant,
                                       shuffle=False)

col_names = [f'I_{i}' for i in range(n_informative)]
col_names += [f'R_{i}' for i in range(n_redundant)]
col_names += [f'N_{i}' for i in range(n_features - n_informative - n_redundant)]

df_train = pd.DataFrame(X_train, columns=col_names)

Following the methodology of Marcos I created a dataset with 10 informative features I_*, 10 redundant features (those are linear combinations of the informative) R_* and 20 non-informative N_*. I then trained a LightGBM classifier and computed the SHAP values. LightGBM library computes SHAP values without installing extra dependencies:

import numpy as np
from lightgbm import LGBMClassifier

classifier = LGBMClassifier()
classifier.fit(df_train, y_train)
shap_values = classifier.predict(df_train, pred_contrib=True)[:, :-1]
shap_feature = np.abs(shap_values).sum(axis=0)
shap_feature = shap_feature / shap_feature.sum()
feature_importances = pd.DataFrame(
    {'SHAP importance': shap_feature, 'feature name': col_names})

In the above, I sum SHAP contributions for each row. Notice I had to take absolute value as contributions can be negative. Finally, lets plot the SHAP feature importances using Altair:

import altair as alt

base = alt.Chart(feature_importances)

bar = base.mark_bar().encode(
    x='SHAP importance:Q',
    y=alt.Y("feature name:O",
            sort=alt.EncodingSortField(
                     field='SHAP importance',
                     order='descending')))

rule = base.mark_rule(color='red').encode(
    x='mean(SHAP importance):Q')

(bar + rule).properties(width=630)

In the above bar chart we see that all informative and redundant features score higher than non-informative. This is a manifestation of consistency of SHAP values: more important features should score higher. The red line is the mean score. We see that certain informative and redundant features, specifically I_2, I_9, R_6, R_9 and R_5 are below the average. I don’t think that’s a particularly bad sign, although Marcos typically remarks on such observation as it being a negative aspect of the method.

Conclusion

So far nothing wrong with SHAP values detected. For complete treatment I am including mean impurity decrease (MID) and permutation (based on ROC AUC score) importances. I can remark that all 3 methods are in rough agreement, so perhaps this test isn’t very informative.

And the code:

from sklearn.metrics import roc_auc_score

# Permutations

base = roc_auc_score(y_train, classifier.predict_proba(X_train)[:,1])
importances = []

for i in range(X_train.shape[1]):
    A = X_train.copy()
    A[:, i] = np.random.shuffle(A[:, i])
    proba = classifier.predict_proba(A)[:,1]
    importances.append(base - roc_auc_score(y_train, proba))

importances = np.array(importances) / np.array(importances).sum()
perm_importances = pd.DataFrame(
    {'permutation importance': importances, 'feature name': col_names})

# MID

mid = classifier.feature_importances_ / np.sum(classifier.feature_importances_)
mid_importances = pd.DataFrame(
    {'MID importance': mid, 'feature name': col_names})