Visual aids in higher dimensions

Training a neural network is an iterative process of deciding on the model architecture, tuning hyperparameters, running through large training data sets, hoping that the model ultimately converges to a low loss. The only feedback available through this process is a scalar loss value. If the loss does not converge, it could be because of a variety of factors. Is this because the model does not have enough parameters? Could it be because the parameters of the network are not initialized correctly? Is the training data encoded correctly? Any of these things could be wrong. Traditional debugging techniques in programming are of limited use here. In a neural network, what you build - the program, lives in the weights, not the fixed logic shuffling inputs through layers of these weights. Humans are not great at interpreting vectors of floating-point numbers. That’s why we turn to visualization techniques to make sense of this high-dimensional data.

Why is Visualization hard in Neural Networks?

We live in the 3D world and understanding up to 3 dimensions is natural. There is a famous quote attributed to Geoffrey Hinton, “To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say ‘fourteen’ to yourself very loudly. Everyone does it”. This underscores a limitation in our ability to imagine anything higher than 3D very well. There are a few clever tricks that can be employed to help us out, all of them ultimately involve casting highly dimensional data into lower dimensions. Let’s take a look at a few examples.

Principal Component Analysis

PCA is one such trick that can be used to take data with a higher number of dimensions and compress or transform it into lower dimensions. PCA is often used to visualize highly dimensional data sets and see how similar or different two points in the data set are. For example, let’s look at real estate market data. A property can be described by its location, value, number of bedrooms, bathrooms, square footage, land area, year built, type of roof, type of foundation, style of construction and so on. As you can see this data is fairly dimensional. There may be interesting insights you might get from this data, for example correlation between geography and type of foundation, correlation between year built and style. However, with the number of dimensions involved in the data, it is hard to visualize. Your intuitions can take you only so far and you might miss out on the unseen connections. This is where PCA comes in.

Picture your data sprawling across an N-dimensional space. Could we find directions where it stretches the most—where its variance peaks? PCA does just that, pinpointing principal components: the axes capturing the data’s biggest swings. Imagine a plane spanned by the top two, holding most of the action—like sales and tax rates in a real estate dataset. For visualization, add a third component, letting us see a 3D slice on a 2D screen. With these three vectors, we grasp the directions that pack the most information. To simplify, we project all N dimensions onto them, compressing the data into a form we can explore. That’s PCA’s core: finding these key axes and mapping data onto them. Here, we’ve focused on 3D for visualization, but PCA scales to any \(L \le N\), often losing some detail to reveal the bigger picture.

Visualizing Embedding Tables

Check out this visualization of the MiniLM-L6 embedding table from Sentence Transformers—a window into a transformer-based neural network’s mind. User input gets tokenized into numeric IDs, each a word or sub-word in the model’s vocabulary. These tokens look up their high-D vectors in the embedding table, a layer that maps them to a 384-dimensional space of floating-point numbers in MiniLM-L6. Below, PCA compresses this into an interactive 3D view—zoom, pan, and rotate to explore how words cluster by semantic ties. Inspired by DeepLearning.ai’s RAG and embeddings course, it’s a peek at how networks ‘see’ language.

In reality PCA is a linear transformation method that preserves global structure but does not explicitly aim to keep similar points close. Hence the visualization might not always give the cues you were hoping to find. There is another algorithm, t-SNE (t-Distributed Stochastic Neighbor Embedding) which gives better results. It preserves local structure by mapping similar points in space close together while allowing dissimilar points to be farther apart. It optimizes pairwise similarity between points in high and low dimensions. So visualizing this problem using t-SNE can give excellent results. Below is an example in 2D.

Another view in 3D form where TSNE is used to reduce down to 3 dimensions.

The Embedding Projector from the TensorFlow folks is another excellent visualization tool you might want to check out.

Visualizing the weights of the embedding table matrix gives you some intuition for how the neural network training process learns word meanings. The network might have multiple dimensions (384 in the example we observed) at its disposal to capture relative scores along any of the possible ways we can look at words. For example, words used in war might have its own dimension, food might have its own dimension and so on. The beauty is that these are artifacts of training over a data set and not the result of a conscious choice. Seeing this representation build up during training indicates progress.

The Math behind PCA

The math behind PCA is quite interesting and perhaps useful to understand so I will spend a bit of time here. It is sufficient to understand that the basic process transforms \(n\) dimensional data into \(l\) dimensional data by multiplying the data vector (of n dimensions) with an \(l \times n\) matrix. However, the math is not strictly required so feel free to skip ahead. See deeplearningbook.org, section 2.12 for a more thorough treatment of PCA.

Our data set is a collection of m points \(\{x^{(1)},...,x^{(m)}\}\; in \;\;R^{n}\) and we would like to compress these points into \(R^{l}\) where \(l \lt n\) using PCA.

We essentially encode the \(m\) points each of vector size \(n\) as \(m\) points of vector size \(l\).
For each point \(x^{(i)} \in R^{n}\;\;\) we will find a corresponding code vector \(c^{(i)} \in R^{l}\\\)
Say \(f(x) = c\) represents the encoding function Let \(g(f(x)) \approx x\) represent the decoding function that attempts to remap c to x.

In PCA the decoding function \(g(c)\) is chosen to be a matrix multiplication of \(D\in R^{n,l}\) with \(c\).
Additionally in PCA the columns of D are constrained to be orthogonal to each other.

Since a large variations of D can meet this requirement because we could scale individual values of \(D_{:,i} \;and\; c_{i}\;\) to get the same output. We further constrain the vectors of D to have a unit norm.

The optimal code point \(c^{*}\) for each input point \(x\) can be viewed as finding the c that reduces the distance between x and g(c) which is the decoded form of x. This can be computed as the L2 norm. So try to find the optimal c that minimizes \(\|x - g(c)\|_{2}\) .

Hence \(c^{*} = \underset{c}{\operatorname{argmin}} \|x - g(c)\|_{2}\)

The same will be true even if we decide to use squared L2 norm.

Hence \(c^{*} = \underset{c}{\operatorname{argmin}}\;\; \|x - g(c)\|_{2}^{2}\)

Since \((x - g(c))\) is a single vector its transpose multiplied by itself gives its square L2 norm.

Simplifies to \((x - g(c))^{T} \; (x - g(c))\\\) \(x^{T}x - x^{T}g(c) - g(c)^{T}x + g(c)^{T}g(c)\\\) \(x^{T}x - 2x^{T}g(c) + g(c)^{T}g(c)\;\;\;\;\) since \(g(c)^{T}x\) is a scalar and hence same as \(x^{T}g(c)\)

\(c^{*} = \underset{c}{\operatorname{argmin}}\;\;x^{T}x - 2x^{T}g(c) + g(c)^{T}g(c)\\\) \(c^{*} = \underset{c}{\operatorname{argmin}}\;\;- 2x^{T}g(c) + g(c)^{T}g(c)\;\;\;\) ;since \(x^{T}x\) is independent of c
\(c^{*} = \underset{c}{\operatorname{argmin}}\;\;-2x^{T}Dc + (Dc)^{T}Dc\\\) \(c^{*} = \underset{c}{\operatorname{argmin}}\;\;-2x^{T}Dc + c^{T}D^{T}Dc\\\) \(c^{*} = \underset{c}{\operatorname{argmin}}\;\;-2x^{T}Dc + c^{T}I_{l}\;c\;\;\; ;since\;\; D^{T}D\;\;is\;\; I_{l}\\\) \(c^{*} = \underset{c}{\operatorname{argmin}}\;\;-2x^{T}Dc + c^{T}c\;\;\; \\\)

Using vector calculus to optimize where derivative equated to 0 finds the lowest point

\(\triangledown_{c}(-2x^{T}Dc + c^{T}c) = 0\\\) \(-2D^{T}x + 2c = 0\\\) \(c = D^{T}x\)

What this tells you us that the encoder function and the decoder functions are matrix transforms operating on the same matrix D.

\(f(x) = D^{T}x = c\)
\(g(c) = Dc = x^{*} \;\) where \(x^{*}\) is an approximation of x.

Intuition behind PCA

The remaining math is now to figure out how to select the matrix D. This math is a bit more involved so I will leave you with an intuition and skip to the solution. If you were to view the matrix D, and think of what the operation \(D^{T}x\) really means, you will observe that this can be viewed as a transformation of the vector x into a new coordinate space using a change of basis matrix defined by \(D^{T}\). The goal with PCA is to identify the principal components where most of the data spread lies when selecting D. This can be done by analyzing the spread of data along the \(n\) dimensions in the data set using the covariance matrix.

The covariance matrix is written as \(\Sigma\) Let \(X\) be an \(m \times n\) matrix where each row is a data point and each column represents a feature and the data has been normalized to be zero mean.

So \(\Sigma = \frac{1}{m-1}X^{T}X\)

Or \(\Sigma_{ij} = \frac{1}{m-1} \sum_{k=1}^{m}(X_{k,i} - \bar{X_{i}})(X_{k,j} - \bar{X_{j}})\)

Each element of this matrix represents the covariance between feature \(i\) and \(j\).

Think back to the properties of eigen vectors and eigen values of a matrix. The eigen vector represents a vector whose direction is invariant to transformation if this matrix were used as a COB matrix. Its eigen value represents what happens to this vector. A negative value reverses its direction. A value greater than one scales up the vector. The covariance matrix represents a set of vectors that captures the interactions between the various features or dimensions in the data set. Using this matrix as a COB matrix has the effect of stretching/shrinking along variance directions. If a particular feature has high variance, it gets amplified and if a feature has a low variance, it gets compressed. If features are correlated, the transformation tilts the data along their shared variance direction, else the data remains aligned to the axes.

From this, it makes perfect sense that the eigen vectors of \(\Sigma\) corresponding to the first \(l\) largest eigen values represent the principal components of the data set. These \(l\) eigen vectors each of size \(n\) when stacked together form the encoder matrix \(D\).

View PCA Code

# lets create a data set of 2D points that are highly and positively correlated
import matplotlib.pyplot as plt
import random
n = 50
x = np.random.rand(n)
y = np.array([i + random.random()*.3 for i in x])

plt.clf()
plt.scatter(x,y, s=3, color='blue')

# lets find the covariance matrix of this data set
cov_matrix = np.cov(x,y)
print("covariance matrix\n", cov_matrix)

# lets transform the data set by treating the covariance matrix as a linear transformation
data = np.stack((x,y), axis=0)

transformed_data = np.dot(cov_matrix, data)
plt.scatter(transformed_data[0,:], transformed_data[1,:], s=2, color='pink')

transformed_data

# lets find the eigenvectors of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print(eigenvalues)
print(eigenvectors)

# the eigen vectors are the columns of the eigenvectors matrix, so dont assume that row 0 is the first eigenvector

# lets plot the eigenvectors on top of the data
plt.plot(eigenvectors[0,0], eigenvectors[1,0], marker='o', color='red')
plt.quiver(0, 0, eigenvectors[0,0], eigenvectors[1,0], angles='xy', scale_units='xy', scale=1, color='red', width=0.002)
plt.plot(eigenvectors[0,1], eigenvectors[1,1], marker='p', color='purple')
plt.quiver(0, 0, eigenvectors[0,1], eigenvectors[1,1], angles='xy', scale_units='xy', scale=1, color='purple', width=0.002)

# lets view the original points translated usign the eigenvectors
eigen_transformed_data = np.dot(eigenvectors, data)
plt.scatter(eigen_transformed_data[0,:], eigen_transformed_data[1,:], s=3, color='green')
plt.grid()
plt.show()

data.shape

Using a Neural Network to Find the Transform Matrix

We learned how the basic idea in PCA is to find an encoding function that compresses the data from \(n\) dimensions to \(l\) dimensions using a \(n \times l\) matrix. We also identified that this matrix is selected such that the reverse operation using a transpose of this matrix experiences minimal loss. How about we try to discover the weights of this matrix using neural network training?

Looking at what PCA does, it effectively uses the covariance matrix to come up with the eigen vectors which are in turn used to transform the data set. The setup is c = f(x) where f(x) = Wx where W is the matrix of eigenvectors. Conversely to get back to the original data set, we can use the transpose of W, i.e. x* = W^Tc

Here we try to come up with the W matrix by using machine learning techniques. We will use the data set and train the W matrix parameters to minimize the reconstruction error. The reconstruction error is the difference between the original data set and the reconstructed data set. The reconstruction error is minimized by adjusting the W matrix parameters. The W matrix parameters are adjusted by using the gradient descent algorithm.

View Neural Network Code

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable

# lets create a simple autoencoder that has 2 input neurons and 1 output neuron
class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        self.encoder = nn.Linear(2, 1)
        self.decoder = nn.Linear(1, 2)
        self.decoder.weight.data = self.encoder.weight.data.t() # tie the weights as per the PCA explanation
        
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x
    

# lets create the autoencoder
autoencoder = Autoencoder()

# lets create the optimizer
optimizer = optim.Adam(autoencoder.parameters(), lr=0.01)

# lets create the loss function
loss_fn = nn.MSELoss()

# lets create the data set
n = 50
x = np.random.rand(n)
y = np.array([i + random.random()*.3 for i in x])
data = np.stack((x,y), axis=1)
data = torch.tensor(data, dtype=torch.float32)

# lets train the autoencoder
for i in range(10000):
    optimizer.zero_grad()
    reconstruction = autoencoder(data)
    loss = loss_fn(reconstruction, data)
    loss.backward()
    optimizer.step()
    if i % 100 == 0:
        print(loss.item())

# lets plot the original data set
plt.clf()
plt.scatter(data[:,0].detach().numpy(), data[:,1].detach().numpy(), s=3, color='blue')

# lets plot the reconstructed data set
plt.scatter(reconstruction[:,0].detach().numpy(), reconstruction[:,1].detach().numpy(), s=3, color='red')
plt.grid()
plt.show()

#lets print the W matrix
print(autoencoder.encoder.weight)

#lets build the covariance matrix over the data set

cov_matrix = np.cov(data.detach().numpy().T)

# lets build the eigenvectors of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print(eigenvalues)
print(eigenvectors)

# Wow, just wow. The eigenvectors are the same as the W matrix!!!!!

While a trivial example, this still shows that the weight matrix of the network was trained to [-0.6811, -0.7322] and the largest eigen vector of the covariance matrix of this data set was computed as [-0.68113166, -0.73216095]! The actual values you see when you train might be different, but your results should still line up to this understanding.

Conclusion

We discussed the need for dimensionality reduction to help with visualization, saw how the embedding table of a transformer network can be visualized and also went into some of the math behind PCA. We did not go into other aspects of neural networks where visualization is key. For example, visualization of the loss surface is a very interesting topic. Here the techniques used do not rely on direct reduction in dimensions. These rely on observing changes to the loss function as we slide slowly from one point in the N dimensional parameter space to another. I found this excellent blog post Neural Network Loss Visualization which explains the paper Visualizing the Loss Landscape of Neural Nets. There is also a lot more to be explored on the parallels between the covariance matrix used in PCA and the Hessian matrix used in visualizing the curvature of the loss landscape. Hint, both involve eigen vectors! Maybe another time. Thank you for reading.