What the hell is tSNE?
The other day I presented a tSNE plot to a software engineer. “But what is it”, I was asked. Good question, I thought to myself.
Introduction
tdistributed Stochastic Neighbor Embedding (tSNE) is a technique of dimensionality reduction. Data that has more than 3 dimensions is extremely common but is impossible to visualise. tSNE is often used to reduce the data to 2 or 3 dimensions to produce beautiful plots like this:
Above tSNE was used to visualise MNIST dataset which consists of single digits like this:
As we can see tSNE correctly detected structure in the data and clustered points correspondingly. But how?
Watch me build something like tSNE from scratch
I am not intending to rebuild the entire algorithm here. Firstly, it’s been done by before by another blogger who loves this shit. Secondly, I found sklearn’s implementation selfexplanatory for a motivated reader. I’d like to explain tSNE for someone who is not that motivated. So let’s build something like tSNE but not quite.
First, for each pair of points, \( i, j \), in our dataset, convert the Euclidian distance between them into a conditional Gaussian probability. This means for each sample we concentrate most of the mass on its nearest neighbours.
\[p_{ji} = \frac{\exp \left (   x_i  x_j  ^2 \right ) }{\sum_{k \neq i} \exp \left (   x_i  x_k  ^2 \right )} \hspace{2em} (1)\]For example, let’s start with generating 7 random points in 2 dimensions. This is our data. In practice the data is typically multidimensional, but it’s better to illustrate the intuition behind tSNE using this simple example.
CODE
Next, we compute the conditional probabilities:
CODE
j  p_{j0}  p_{j1}  p_{j2}  p_{j3}  p_{j4}  p_{j5}  p_{j6} 

0  0  0.8373  0.1219  0  0  0.5947  0 
1  0.774  0  0.034  0  0  0.402  0 
2  0.0059  0.0018  0  0.9952  0.0009  0.0032  0.0012 
3  0  0  0.8022  0  0  0  0.0005 
4  0  0  0.0072  0.0002  0  0  0.9983 
5  0.2201  0.161  0.0248  0  0  0  0 
6  0  0  0.0099  0.0046  0.9991  0  0 
Above, each column adds up to 1, meaning for each point we have a probability
distribution. As expected, points that are closest receive significantly more
weight than points that are further away, e.g. point 0
is close to 1
and
the \( p_{10} \) value is 77.4%.
Now, to find a tSNElike representation of our points, we are going to try to find new 7 points, but instead of using a Gaussian density, we will use Student’s tdistribution with 1 degree of freedom (also known as Cauchy). So the challenge is to find new 7 points, \(y_{\cdot}\), such that their transformed Euclidean distances, \(q_{ji}\),
\[q_{ji} = \frac{ \left ( 1 +  y_i  y_j  ^2 \right ) ^ {1} }{\sum_{k \neq l} \left ( 1 +  y_k  y_l  ^2 \right ) ^ {1} }\]are very close to \(p_{ji}\). Let’s just do it by bruteforcing^{1}. I’ll start
with exact same datapoints but iteratively adjust their positions until the
matrix of q
’s is close to the matrix of p
’s. A process that I visualised in
the video below:
CODE
Let’s compare the new points to the original points:
CODE
We see that points that were close together got even closer and points that were far apart got spread out. This is the underlying idea behind tSNE: to warp the space in this manner. This happened because the tail of the tdistribution is much heavier than the tail of the Gaussian distribution:
This means that we have to push distant points further away and pull close points closer for our new probabilities to match the original probabilities. The motivation behind applying such transformation is to address the socalled “crowding problem”. If we were to distribute points uniformly at random on a hypersphere with many dimensions, most points would end up close to the edges. This is because the proportion of volume close to the edges rapidly approaches one as we increase the number of dimensions:
\[\frac{r^{d}  (r1)^{d}}{r^d} \xrightarrow[d \to \infty]{} 1,\]where \(r^d\) is the volume of a hypersphere and \(r^d  (r1)^d\) is the volume of the crust of the hypersphere.
The above limit means if we projected linearly we’d see points crowded around the edges of a circle. With warping we pull some of those closer to the center and push many points far way from the origin.
Conclusion
If you have made this far, hopefully you have a better understanding of how tSNE works. Well, at least I do. Now I am going to go over details that I glossed over, but briefly.
tSNE actually deals with joint probabilities, defined as
\[p_{ij} = \frac{p_{ij} + p_{ji}}{2N},\]where \(N\) is the size of the dataset. Division by \(2N\) simply makes sure probabilities add up to 1. This makes the algorithm more robust to outliers. Note that joint probabilities are symmetric. When looking for \(q_{ij}\) KullbackLeibler divergence is used, as it’s the informationtheoretic way of defining how similar two distributions are one to another. In equation (1) I used Gaussian distribution with variance 2. However, in real tSNE variance is estimated for each point using perplexity parameter, defined as 2 raised to the power of Shannon’s entropy. The perplexity is constant for each point: this way the variance is determined by the density of the region around each point. Finally, when \(q\)’s are searched for, a gradient descent with momentum is used.
To complete the post, I am including real tSNE applied to the 7 datapoints:
CODE
See more:
Appendix
Animation code can be found here:
CODE

scipy.optimize.minimize
uses Broyden–Fletcher–Goldfarb–Shanno algorithm, which tries to approximate gradient and the second derivative: so it’s not quite brute force. ↩