## What the hell is t-SNE?

The other day I presented a t-SNE plot to a software engineer. “But what is it”, I was asked. Good question, I thought to myself.

# Introduction

t-distributed Stochastic Neighbor Embedding (t-SNE) is a technique of dimensionality reduction. Data that has more than 3 dimensions is extremely common but is impossible to visualise. t-SNE is often used to reduce the data to 2 or 3 dimensions to produce beautiful plots like this: Above t-SNE was used to visualise MNIST dataset which consists of single digits like this: As we can see t-SNE correctly detected structure in the data and clustered points correspondingly. But how?

# Watch me build something like t-SNE from scratch

I am not intending to re-build the entire algorithm here. Firstly, it’s been done by before by another blogger who loves this shit. Secondly, I found sklearn’s implementation self-explanatory for a motivated reader. I’d like to explain t-SNE for someone who is not that motivated. So let’s build something like t-SNE but not quite.

First, for each pair of points, $$i, j$$, in our dataset, convert the Euclidian distance between them into a conditional Gaussian probability. This means for each sample we concentrate most of the mass on its nearest neighbours.

$p_{j|i} = \frac{\exp \left ( - || x_i - x_j || ^2 \right ) }{\sum_{k \neq i} \exp \left ( - || x_i - x_k || ^2 \right )} \hspace{2em} (1)$

For example, let’s start with generating 7 random points in 2 dimensions. This is our data. In practice the data is typically multidimensional, but it’s better to illustrate the intuition behind t-SNE using this simple example.

CODE Next, we compute the conditional probabilities:

CODE

j p_{j|0} p_{j|1} p_{j|2} p_{j|3} p_{j|4} p_{j|5} p_{j|6}
0 0 0.8373 0.1219 0 0 0.5947 0
1 0.774 0 0.034 0 0 0.402 0
2 0.0059 0.0018 0 0.9952 0.0009 0.0032 0.0012
3 0 0 0.8022 0 0 0 0.0005
4 0 0 0.0072 0.0002 0 0 0.9983
5 0.2201 0.161 0.0248 0 0 0 0
6 0 0 0.0099 0.0046 0.9991 0 0

Above, each column adds up to 1, meaning for each point we have a probability distribution. As expected, points that are closest receive significantly more weight than points that are further away, e.g. point 0 is close to 1 and the $$p_{1|0}$$ value is 77.4%.

Now, to find a t-SNE-like representation of our points, we are going to try to find new 7 points, but instead of using a Gaussian density, we will use Student’s t-distribution with 1 degree of freedom (also known as Cauchy). So the challenge is to find new 7 points, $$y_{\cdot}$$, such that their transformed Euclidean distances, $$q_{j|i}$$,

$q_{j|i} = \frac{ \left ( 1 + || y_i - y_j || ^2 \right ) ^ {-1} }{\sum_{k \neq l} \left ( 1 + || y_k - y_l || ^2 \right ) ^ {-1} }$

are very close to $$p_{j|i}$$. Let’s just do it by brute-forcing1. I’ll start with exact same data-points but iteratively adjust their positions until the matrix of q’s is close to the matrix of p’s. A process that I visualised in the video below:

CODE

Let’s compare the new points to the original points:

CODE We see that points that were close together got even closer and points that were far apart got spread out. This is the underlying idea behind t-SNE: to warp the space in this manner. This happened because the tail of the t-distribution is much heavier than the tail of the Gaussian distribution: This means that we have to push distant points further away and pull close points closer for our new probabilities to match the original probabilities. The motivation behind applying such transformation is to address the so-called “crowding problem”. If we were to distribute points uniformly at random on a hypersphere with many dimensions, most points would end up close to the edges. This is because the proportion of volume close to the edges rapidly approaches one as we increase the number of dimensions:

$\frac{r^{d} - (r-1)^{d}}{r^d} \xrightarrow[d \to \infty]{} 1,$

where $$r^d$$ is the volume of a hypersphere and $$r^d - (r-1)^d$$ is the volume of the crust of the hypersphere.

The above limit means if we projected linearly we’d see points crowded around the edges of a circle. With warping we pull some of those closer to the center and push many points far way from the origin.

# Conclusion

If you have made this far, hopefully you have a better understanding of how t-SNE works. Well, at least I do. Now I am going to go over details that I glossed over, but briefly.

t-SNE actually deals with joint probabilities, defined as

$p_{ij} = \frac{p_{i|j} + p_{j|i}}{2N},$

where $$N$$ is the size of the dataset. Division by $$2N$$ simply makes sure probabilities add up to 1. This makes the algorithm more robust to outliers. Note that joint probabilities are symmetric. When looking for $$q_{ij}$$ Kullback-Leibler divergence is used, as it’s the information-theoretic way of defining how similar two distributions are one to another. In equation (1) I used Gaussian distribution with variance 2. However, in real t-SNE variance is estimated for each point using perplexity parameter, defined as 2 raised to the power of Shannon’s entropy. The perplexity is constant for each point: this way the variance is determined by the density of the region around each point. Finally, when $$q$$’s are searched for, a gradient descent with momentum is used.

To complete the post, I am including real t-SNE applied to the 7 datapoints:

CODE # Appendix

Animation code can be found here:

CODE

1. scipy.optimize.minimize uses Broyden–Fletcher–Goldfarb–Shanno algorithm, which tries to approximate gradient and the second derivative: so it’s not quite brute force.