(Code available at harningle/useful-scripts)

When I was doing a Kaggle competition a couple days ago, I realised I’d never seriously studied principal component analysis (PCA), and I already forget a lot of linear algebra… So here it is. We will start with brute force math, and end with the geometric intuition.

Brute force

Suppose we have a dataset $X$ of $N$ observations (rows) and $K$ features (columns). The intuition of PCA is to reduce dimensionality by focusing on certain “important” dimensions, or to find a low dimensional representation of the original data that captures as many variations of the original data as possible. E.g., we are analysing food consumption, and $x_1$ is beef, $x_2$ is chicken, and $x_3$ is fish. It’s absolutely fine to have three variables, but if you ask me to produce one indicator, then probably a simple average of $x_1$ to $x_3$, i.e. $meat = \frac{1}{3} beef + \frac{1}{3} chicken + \frac{1}{3} fish$, capture the essence. PCA tries to answer two questions: (1) is the simple average $\frac{1}{3}$ the best, or what are the best weights if we want to use one variable to represent the original data, and (2) after $meat$, what’s remaining in the original dataset, and how to capture it?

First component

Formally, let $X = (\vec{x_1}, \cdots, \vec{x_K})$, where each feature $\vec{x_j}$ is a $N\times 1$ column vector. We want to use $L < K$ variables to represent the original data.1 To make life easier, in this lower dimensional representation, each new variable $z_l$ takes the form of a linear combination of the original features, i.e.

\[\vec{z_l} = \sum\limits_{j= 1, \cdots, K} w_{j, l} \vec{x_j}\]

, where the weight is a unit/length-$1$ vector:

\[\sum\limits_j w_{j, l}^2 = 1\]

.2 In matrix notation, this is equivalent to

\[\underset{N\times L}{Z} = \underset{N\times K}{X}\underset{K\times L}{W}\]

. $Z = (\vec{z_1}, \cdots, \vec{z_L})$ is the set of new features/dimensions, and $W = (\vec{w_1}, \cdots, \vec{w_L})$ is the weight matrix: each unit column vector $\vec{w_l} = (w_{1, l}, \cdots, w_{K, l})^T$ is the weights that transform the original $\vec{x}$’s to $\vec{z_l}$, which is also called loading.

We further require that $z_1$ should capture as much variance as possible, and then after $z_1$, $z_2$ should capture as much remaining variance as possible, and so on. In other words, $\vec{z_1}$ should be the most “important” dimension, $\vec{z_2}$ the second most “important” dimension, etc. To compute variance easier, we de-mean all $\vec{x}$’s, so the variance is simply $\vec{x}^2$. Therefore, $\vec{w_1}$ solves

\[\begin{aligned} \max\text{Var}(\vec{z_1}) &= \max\vec{z_1}^T\vec{z_1} \\ &= \max (X\vec{w_1})^TX\vec{w_1} \\ &= \max \vec{w_1}^TX^TX\vec{w_1} \end{aligned}\]

. Note that $\vec{w_1}$ is a unit vector. The constraint is thus

\[\sum w_{1, 1}^2 + \cdots + w_{k, 1}^2 = \vec{w_1}^T\vec{w_1} = 1\]

. To save notation, we omit the subscript $1$ of $\vec{w_1}$ from here onwards. The Lagrangian of this constrained optimisation is

\[\mathcal{L} = \vec{w}^TX^TX\vec{w} + \lambda (1 - \vec{w}^T\vec{w})\]

. Setting all derivatives to zero, we get

\[\begin{aligned} &\frac{\partial\mathcal{L}}{\partial \vec{w}} = 2{\color{red}X^TX\vec{w}} - 2{\color{blue}\lambda \vec{w}} = 0 \\ &\frac{\partial\mathcal{L}}{\partial \lambda} = \vec{w}^T\vec{w} - 1 = 0 \end{aligned}\]

, or

\[\begin{align} {\color{red}X^TX\vec{w}} &= {\color{blue}\lambda \vec{w}} \label{eq:pca_foc} \\ \vec{w}^T\vec{w} &= 1 \nonumber \end{align}\]

. ${\color{red}X^TX\vec{w}} = {\color{blue}\lambda \vec{w}}$ is effectively the definition of eigenvalues and eigenvectors. Therefore, $\lambda$ is an eigenvalue of $X^TX$, and $\vec{w}$ is the corresponding eigenvector. Plug this back into $\max \vec{w}^T\color{red}X^TX\vec{w}$, we have $\max \vec{w}^T\color{blue}\lambda \vec{w}$, and it’s obvious that we want $\lambda$ to be as big as possible. So the maximum is achieved when $\lambda$ is the largest eigenvalue, and $\vec{w}$ is the accompanied eigenvector (normalised to unit length). Eigenvalue decomposition or SVD can find them quite easily.

What’s left after the first component

Suppose we already got $\vec{z_1}$ and $\vec{w_1}$. Now to solve for the second most “important” dimension $\vec{z_2}$, we first need to understand what we know about $X$ after $\vec{z_1}$ and $\vec{w_1}$. For any given data point $\vec{x_i}$, i.e. the $i$th row in $X$, its first principal component is $z_{i, 1} = \vec{x_i}\vec{w_1}$.3 That is, we project the original data point $\vec{x_i}$ to the direction $\vec{w_1}$. In the above meat example, suppose the data point is $\vec{x_i} = (6, 3, 3)$, i.e. $6$ beef, $3$ chicken, and $3$ fish, and the weights are $\vec{w_1} = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})^T$, then its first component is $meat_i = z_{i, 1} = 6\times \frac{1}{3} + 3\times \frac{1}{3} + 3\times \frac{1}{3} = 4$.4 Given this $4\ meat$, what do we know about the original food consumption? Since the weight is $\frac{1}{3}$, so $4\ meat$ basically means $\frac{4}{3}\ beef$ and $\frac{4}{3}\ chicken$ and $\frac{4}{3}\ fish$. That is, if the weights are $\vec{w_1}$, our best knowledge about the original data point $\vec{x_i}$ is $z_{i, 1}\vec{w_1}^T = \vec{x_i}\vec{w_1}\vec{w_1}^T$.5

Therefore, after the first component, the unexplained part of $\vec{x_i}$ is $\tilde{\vec{x_i}} = \vec{x_i} - \vec{x_i}\vec{w_1}\vec{w_1}^T = \vec{x_i}(I - \vec{w_1}\vec{w_1}^T)$. This applies to any given row in $X$, so the remaining part of $X$ is $\tilde{X} = X(I - \vec{w_1}\vec{w_1}^T)$.

Second component

By recursion, the second most “important” dimension of $X$ should be the first most “important” dimension of $\tilde X$. So replace $X$ with $\tilde X$ in ($\ref{eq:pca_foc}$), we get

\[\begin{aligned} (I - \vec{w_1}\vec{w_1}^T)^TX^TX(I - \vec{w_1}\vec{w_1}^T)\vec{w_2} &= \lambda_2 \vec{w_2} \\ (X^TX - \vec{w_1}\vec{w_1}^TX^TX - X^TX\vec{w_1}\vec{w_1}^T + \vec{w_1}\vec{w_1}^TX^TX\vec{w_1}\vec{w_1}^T)\vec{w_2} &= \lambda_2 \vec{w_2} \end{aligned}\]

. Plug $X^TX\vec{w_1} = \lambda_1 \vec{w_1}$ (as well as $\vec{w_1}^TX^TX = \lambda_1 \vec{w_1}^T$) into the above long equation, we reach a quite simple one:

\[\begin{aligned} (X^TX - \vec{w_1}\lambda_1 \vec{w_1}^T - \lambda_1 \vec{w_1}\vec{w_1}^T + \vec{w_1}\vec{w_1}^T\lambda_1 \vec{w_1}\vec{w_1}^T)\vec{w_2} &= \lambda_2 \vec{w_2} \\ (X^TX - \lambda_1\vec{w_1}\vec{w_1}^T - \lambda_1 \vec{w_1}\vec{w_1}^T + \lambda_1\vec{w_1}\underset{\vec{w_1}\text{ is unit vector}}{\underbrace{\vec{w_1}^T\vec{w_1}}}\vec{w_1}^T)\vec{w_2} &= \lambda_2 \vec{w_2} \\ (X^TX - \lambda_1\vec{w_1}\vec{w_1}^T - \lambda_1 \vec{w_1}\vec{w_1}^T + \lambda_1\vec{w_1}\vec{w_1}^T)\vec{w_2} &= \lambda \vec{w_2} \\ \underset{\text{LHS}}{\underbrace{(X^TX - \lambda_1\vec{w_1}\vec{w_1}^T)\vec{w_2}}} &= \underset{\text{RHS}}{\underbrace{\lambda_2 \vec{w_2}}} \\ \end{aligned}\]

. It’s now clear that $\vec{w_2}$ and $\lambda_2$ is the eigenvector and eigenvalue of $X^TX - \lambda_1\vec{w_1}\vec{w_1}^T$. Let’s guess $\vec{w_2}$. We know it must not be in the same direction as $\vec{w_1}$. Otherwise, we are repeatedly getting the information in the same direction. Let’s confirm this by contradiction: what if $\vec{w_2} = \gamma\vec{w_1}$? In such hypothetical case,

\[\begin{aligned} \text{LHS} &= (X^TX - \lambda_1\vec{w_1}\vec{w_1}^T)\vec{w_2} \\ &= (X^TX - \lambda_1\vec{w_1}\vec{w_1}^T)\gamma\vec{w_1} \\ &= \gamma X^TX\vec{w_1} - \lambda_1\gamma\vec{w_1}\vec{w_1}^T\vec{w_1} \\ &= \gamma\lambda_1\vec{w_1} - \lambda_1\gamma\vec{w_1} \\ &= 0 \end{aligned}\]

So if $\vec{w_2}$ is parallel to $\vec{w_1}$, then LHS is precisely zero, so it can’t be the eigenvector. Now what if $\vec{w_2}$ is orthogonal to $\vec{w_1}$? Let $\vec{w_2} = \vec{v}$ be a vector orthogonal to $\vec{w_1}$, and plug $\vec{w_2}$ into the above equation:

\[\begin{aligned} \text{LHS} &= (X^TX - \lambda_1\vec{w_1}\vec{w_1}^T)\vec{w_2} \\ &= X^TX\vec{v} - \lambda_1\vec{w_1}\underset{\text{orthogonal}}{\underbrace{\vec{w_1}^T\vec{v}}} \\ &= X^TX\vec{v} - 0 \\ &= X^TX\vec{v} \end{aligned}\]

, while

\[\text{RHS} = \lambda_2 \vec{w_2} = \lambda_2 \vec{v}\]

. So if LHS is equal to RHS, then by definition, $\vec{v}$ and $\lambda_2$ is an eigenvector and eigenvalue of $X^TX$. And to get the maximum, we want the eigenvalue to be as large as possible. We know the largest eigenvalue of $X^TX$ is associated with the first component, and the second component can’t be parallel to the first component. Therefore, the best eigenvalue here would be the second largest eigenvalue. By recursion, the $i$th component is coming from the $i$th largest eigenvalue and eigenvector.

The geometry

A (way) better way to see this is through geometry. Suppose the original dataset has three features, and looks like the figure below.

(A) Rawdata

(B) Top, front, and end view

We can easily tell that $x_3 = fish$ isn’t very informative: almost all data points are vertically bounded between $(-2.5, 2.5)$ and it seems quite randomly distributed. Secondly, $x_1$ and $x_2$ are very correlated: when $x_1 = beef = 5$, $x_2 = chicken$ is also roughly $5$. If I’m asked to find a lower dimensional representation of this dataset, I would use a 45-degree line like below.

Best projection

The first component is essentially the red line above. We begin with three dimensions: $x$ axis $x_1$ ($beef$), $y$ axis $x_2$ ($chicken$), and the vertical $z$ axis $x_3$ ($fish$). The goal is to find some axis or direction or line that can represent the original dataset as good as possible. Intuitively, the best direction/axis is the line which all data points are very close to it. Let the best line to have the direction $\vec{w}$, which is a unit vector. Let’s first de-mean every variable. Then it should be obvious that the best line should go through the origin.6 The distance from a given data point $\vec{x_i}$ to this line $\vec{w}$ is $\color{maroon}\vec{x_i}^T - \vec{x_i}\vec{w}\vec{w}$.7 This is precisely the “unexplained” part in the brute force above.

So far, for each data point, we know how far it is from a line. Now let’s minimise the sum of distances squared to determine the best line. To save notation, let $z_i = \vec{x_i}\vec{w}$.8

\[\begin{aligned} \min \sum\limits_i (\vec{x_i}^T - \vec{x_i}\vec{w}\vec{w})^2 &= \min \sum\limits_i (\vec{x_i}^T - z_i\vec{w})^2 \\ &= \min \sum\limits_i (\vec{x_i}^T - z_i\vec{w})^T(\vec{x_i}^T - z_i\vec{w}) \\ &= \min \sum\limits_i \vec{x_i}\vec{x_i}^T - \vec{x_i}z_i\vec{w} - \vec{w}^Tz_i\vec{x_i}^T + \vec{w}^Tz_iz_i\vec{w} \\ &= \min \sum\limits_i \underset{\text{constant}}{\underbrace{\vec{x_i}\vec{x_i}^T}} - z_i\underset{z_i}{\underbrace{\vec{x_i}\vec{w}}} - z_i\underset{z_i^T}{\underbrace{\vec{w}^T\vec{x_i}^T}} + z_i^2\underset{\text{unit length}}{\underbrace{\vec{w}^T\vec{w}}} \\ &= \min \sum\limits_i - z_i^2 - z_i\underset{z_i\text{ is a scalar, so transpose is itself}}{\underbrace{z_i^T}} + z_i^2 \\ &= \min \sum\limits_i -z_i^2 - z_i^2 + z_i^2 \\ &= \max \sum\limits_i z_i^2 \\ &= \max \vec{z}^T\vec{z} \end{aligned}\]

. This is exactly the same as maximising the variance in the brute force. So the rest optimisation stays the same.

Best projection gives max variance, and vice versa

Now we’ve reached an important link between maximising the variance and projection. Finding a weight $\vec{w}$ to maximising $X\vec{w}$ is the same as finding a line/direction $\vec{w}$ to minimising the distances from data points to this line.

The second component lies in the subspace of projection residuals

From the geometrical perspective, it’s also straightforward that all components should be orthogonal to each other. Recall the distance from a data point to the best line:

\[\underset{\text{original data point}}{\underbrace{\vec{x_i}}} = \underset{\text{projection on }\vec{w_1}\text{/first component}}{\underbrace{\color{navy}\vec{x_{i, /\mkern-5mu/}}}} + \underset{\text{distance to }\vec{w_1}\text{/residual}}{\underbrace{\color{maroon}\vec{x_{i, \perp}}}}\]

. The first component direction already extracts everything along the direction of $\vec{w_1}$, as the result of projection, or as the result of variance maximisation, so whatever left should be along the residual direction, which by definition is perpendicular to $\vec{x_1}$. Therefore, the second component should be located in the residual subspace, and thus is orthogonal to the first component direction.

  1. Of source we want $L$ to be strictly less than $K$. Otherwise, why not just use all original $K$ features then we keep 100% info. lol. 

  2. At first glance, it’s natural to assume $\sum w_j = 1$, i.e. a weighted average where all weights add up to one. However, as we will see below, from a geometrical perspective, this is not good. $Z = XW$ is applying a linear transformation $W$ to $X$, and a linear transformation is sometimes a projection. We are thus essentially projecting the original data into another linear space. The most straightforward way to determine a vector space is by its basis, and the most convenient basis is the orthonormal basis. So it’s easier for us to assume $\vec{w}$ has a length of one. Another reason is that it’s easier to do the math if we assume $\vec{w}$ has a length of one. It doesn’t matter if we have a variable like $1, 3, 5, 10$ or $2, 6, 10, 20$. They just differ by some constant. Whichever $\vec{w}$ we have, we can always rescale/normalise it to unit length. So why not assume it’s a unit vector in the first place? And the length is simple to compute using $\vec{w}^T\vec{w}$, while $\sum w_j = 1$ is tricky to represent using matrix or vector. 

  3. We write $\vec{x_i}$ as a $1\times K$ row vector. 

  4. The $\frac{1}{3}$ weights here are hypothetical. $(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$ isn’t a unit vector. In real PCA, it must have unit length. But here for illustration only, we use $\frac{1}{3}$. 

  5. Formally, assume the weights are $\vec{w_1}$. Our “best guess” of the original data point $\vec{x_i}$ must take some form of $\alpha \vec{w_1}^T$. In another word, all we know is the direction $\vec{w_1}$, so the best we can do is to pick some point along $\vec{w_1}$ to approximate the original $\vec{x_i}$. The reconstruction error, or MSE of our guess, is $(\vec{x_i}^T - \alpha \vec{w_1})^2$. Setting its first derivative to zero, we have $-2\vec{w_1}^T(\vec{x_i}^T - \alpha \vec{w_1}) = 0$. So $\alpha = \frac{\vec{w_1}^T}{\vec{w_1}^T\vec{w_1}}\vec{x_i}^T$. The weights are constrained to be a unit vector, so the denominator is one, and thus $\alpha = \vec{w_1}^T\vec{x_i}^T = \vec{x_i}\vec{w_1}$. Using this $\alpha$, our best guess of the original $\vec{x_i}$ is $\hat{\vec{x_i}} = \vec{x_i}\vec{w_1}\vec{w_1}^T$. 

  6. This can be prove using brute force in 2D case: assume two features $x$ and $y$ have zero mean, and the best line to be $y = kx + b$. Brute force compute the distance from a data point to this line, and minimise the sum of distances squared. The objective function is $\sum \frac{(kx_i - y_i + b)^2}{k^2 + 1}$. Set $\frac{\partial}{\partial b} = \frac{1}{k^2 + 1}\sum 2(kx_i - y_i + b)$ to zero. Because both $x$ and $y$ have zero mean, the sums of all $x$’s and $y$’s are zero, so $\frac{\partial}{\partial b} = \frac{1}{k^2 + 1}\sum 2(k\cdot 0 - 0 + b) = 0$, or $b = 0$. 

  7. $\vec{x_i}$ can be decomposed to a vector parallel to $\vec{w}$ and a vector perpendicular to $\vec{w}$. The distance from the data point to the line is the length of the orthogonal component. The parallel component has a length of the dot product $\vec{x_i}\vec{w}$, and has the direction $\vec{w}$, so it gives $\vec{x_i}\vec{w}\vec{w}$. Then the perpendicular component is the original $\vec{x_i}$ minus this parallel component, which is $\vec{x_i}^T - \vec{x_i}\vec{w}\vec{w}$. 

  8. $z_i$ is a scalar: if the raw data point is $6$ beef, $3$ chicken, and $3$ fish, and the weights are $\vec{w} = (\frac{1}{3}, \frac{1}{3}, \frac{1}{3})^T$, then $z_i = 6\times \frac{1}{3} + 3\times \frac{1}{3} + 3\times \frac{1}{3} = 4$.