Notes on Jeremy Howard's videos
This weekend I got round watching lessons 4 and 5 of the “Introduction to Machine Learning for Coders” course by fast.ai lead by Jeremy Howard and was impressed to the point of breaking my 478 days long writer’s block.
After having worked on the problem of fitting and interpreting Random Forests for almost 2 years it’s really interesting to see another experienced data scientist revealing his “bag of tricks”.
In this blog post I will write what I learned from the videos.
“Trick” 1: Spearman rank-order correlation hierarchical clustering of features
Using the code below
One can produce the following graph to examine which features are the most correlated:
There are two new things for me:
- Instead of looking at Pearson correlation coefficient we look at Spearman’s rank correlation, i.e. Pearson correlation between the rank values of those two variables. This helps with comparing variables that are on different scales.
Instead of drawing a typical heatmap (example below), hierarchical clustering is used:
In hierarchical or agglomerated clustering, we look at every pair of objects and say which two objects are the closest. We then take the closest pair, delete them, and replace them with the midpoint of the two. Then repeat that again and again.
The resulting plot looks more informative and easier to read than a simple
heatmap: variables that are joined together later are most correlated (e.g.
SaleElapsed are very correlated whilst
TireSize are not correlated).
“Trick” 2: PDPbox
When sanity checking a trained model it’s common to ask what the effect of
varying a particular feature is. For example, time since registration is
typically a significant feature in detecting fraudulent accounts: newly
registered accounts are more risky. Thus you would hope that holding everything
else constant whilst reducing the value of
timeSinceRegistered should yield
higher prediction probabilities of an account being fraudulent. Turns out
there’s a library that does just that. Using
PDPbox one can make so-called
Individual Conditional Expectation (ICE) plots:
here each line represents a row in the dataset and the yellow line is the mean prediction.
We see that with
YearMade increasing the predictions tend to increase. Since
they are too many rows drawn on the plot, we can cluster rows and see that they
are distinct patterns of how the target increases:
What’s new for me:
- I have witnessed a colleague of mine computing ICE values in his jupyter notebook but didn’t know that a library that does just that exists!
- The idea of clustering rows according to how they vary when input is varied is new to me. Perhaps one could further drill down into each cluster and see why that’s the case.
“Trick” 3: treeinterpreter
Looks like Jeremy is a fan of Saabas’ treeinterpreter
library that allows one to compute feature contributions behind every
prediction and he goes into explantation of how the method work in his lessons.
However I have always been sceptical of the method: although my scepticism is
only intuitive. Regardless
treeinterpreter wasn’t production-ready in 2016:
it was simply too slow for computing contributions in a live system. This
resulted in me inventing a different method that is more like
permutation method but applied to a single row. The method, also known
leaveoneout, turned out to work quite well in practice: it is fast and has
been incredibly useful in pointing out leaky features (since those contribute
This means that when a model is changed such that a feature has a higher impact on the model’s output, (inconsistent) methods can actually lower the importance of that feature.
I suspect that
leaveoneout is also inconsistent and thus should be abandoned
in favor of
SHAP values proposed by Scott. In addition,
SHAP library is
production-ready as the library is very fast.
SHAP feature importances are
backed up by Shapley value solution from game theory and thus have a bunch of
nice properties as a result.
What’s new for me:
- Nothing :). I wrote to Jeremy to consider educating about SHAP values instead of promoting treeinterpreter and he replied.
“Trick” 4: Detecting data-leakage and feature drifts.
Oftentimes predictions on the validation set look radically different to the
predictions on the training set. For example I once dealt with a situation when
the model scored everyone in the validation set as fraudulent despite the
proportion of fraudsters in the training set being only 1-2%. Such things happen
when train and validation set are radically different and is
especially common when validation set is more recent than the training set. One
reason for a discrepancy could be feature drifts: recent data is simply
different to training data. Another reason for such disparity is data
leakage. Most common source of data leakage in my practice is using data from
the future when extracting features. For example, the inclusion of
lifetime_duration_of_a_customer feature when predicting probability of churn.
For newly registered customers the value of such feature is unkown and thus the
model won’t generalise to most recent customers. So how do we detect that
the validation set is different from the training set and, most importantly,
figure out why it is different?
One idea that I came up with was to examine the predictions paths of decision
trees. Perhaps validation set ends up in
leaf1 10% of the time whilst
training set hits
leaf1 only 5% of the time. The resulting software package,
which I named
NodeProp points out feature drifts quite well but has many
false positives: I get the impression that my algorithm still needs some
tweaking for it to be more robust. Unfortunately I no longer have access to the
source code and thus can’t tweak it.
How does Jeremy deal with this issue?
He creates a new target variable
is_validation_set and sets to
False for training subsets. Then he trains a random forest and
examines which features are the most important. Ideally one shouldn’t be able
to predict what is in the validation set and what is the training set. However
if it is possible, it is crucial to examine which features allow such
prediction. I find this technique to be ingeniously simple!
“Trick” 5: Make sure your validation set generalises to production.
Imagine the blissful moment when, after all the data extraction and preprocessing, you finally have all the data prepared and ready for modelling - and you start playing around with various models, features and hyperparameters. But wait, you are still having doubts whether any of this work will generalise to production. After all, what works offline often doesn’t work once deployed, as you probably have learned from experience. How do you gain more confidence in deploying your own models? I have developed a habit of manually inspecting a few predictions before deploying anything to live: however such process1 is far from rewarding. Jeremy advocates building 5 completely different models and making sure that their performance aligns monotonically. So the top model on validation should be the top model in production. The second best on validation should be the second best in production and so on. This way one can be sure that the performance on the validation generalises well into production. One must trust their validation set to be representative before optimising the performance on it (duh!).
I have thoroughly enjoyed watching just 2 lessons and I can’t wait to watch more material by Jeremy Howard. Another meta-lesson is how much of practising data science consists of learning various techniques rather than just learning how to code against API provided by sklearn. This is a reminder to myself to keep watching educational videos and keep learning from other practitioners. And I hope to blog about what I learn more often :).
P.S. special thanks to Hiromi Suenaga for her excellent notes.