SVD, PCA
In this post, I wrote about the singular value decomposition and the principal component analysis.
SVD, PCA
Principal component analysis (PCA) is a statistical technique I had always heard about from countless sources, including many undergraduate applied statistics textbooks, Coursera courses on data science, and etc. Nevertheless, it was one of those topics which I had never had a firm grasp on, while pretending to understand it. But after deciding to take a linear algebra-related course this semester, I decided to change my way and go over the topic in more detail, which led me to write this post. The sources that I referenced include: my own lecture notes on Youtube videos by professors Steve Brunton and Gil Strang, as well as some personal memos taken on textbooks under CC BY 4.0 license, provided by LibreTexts.
Singular Value Decomposition
Introduction
Let $A$ be an $m \times n$ matrix of rank $k$. Suppose there exists a $B_{1} = {\vec{v}_{1}, \vec{v}_{2}, \ldots \vec{v}_{n}}$, a list of eigenvectors of positive semi-definite matrix $A^\top A$,
\[A^{\top}A \cdot v_{i} = \lambda_{i} v_{i} \tag{1}\]and $B_{2} = {\vec{u}_{1}, \vec{u}_{2}, \ldots \vec{u}_{n}}$, a list of eigenvectors of positive semi-definite matrix $AA^{\top}$,
\[AA^{\top} \cdot u_{i} = \lambda_{i} u_{i} \tag{2}\]such that
\[\left\{\begin{array}{ll} A {v}_i = \sigma_i {u}_i & (1 \leq i \leq k) \\ A^\top {u}_i = \sigma_i {v}_i & (1 \leq i \leq k) \end{array}\right. \tag{3}\]for each $ \sigma_{i} = \sqrt{\lambda_{i}}\;\;$ in ${\sigma_1 … \sigma_k}$, where \(\sigma_1 \dots \sigma_k \in \mathbb{R}, \;\; (\sigma_1 > \sigma_2 > \dots > \sigma_k > 0)\)
Then define matrices $V$, $U$, and $\Sigma$ in the following manner, where $V$ is a matrix whose columns are eigenvectors of $A^{\top}A$, and $U$ is a matrix whose columns are eigenvectors of $AA^{\top}$.
\[V = [\vec{v}_1 | \vec{v}_2 | ... \vec{v}_n] \\ U = [\vec{u}_1 | \vec{u}_2 | ... \vec{u}_m] \tag{4}\] \[\] \[\Sigma=\left[\begin{array}{cccc|cccc} \sigma_{1} & 0 & \cdots & 0 & 0 & 0 & \cdots & 0 \\ 0 & \sigma_{2} & & 0 & 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_{k} & 0 & 0 & & 0 \\ \hline 0 & 0 & \cdots & 0 & 0 & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 & 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & 0 & 0 & \cdots & 0 \\ \end{array}\right] \tag{5}\]From $(3)$, substitute each $A v_{i}$ with $\sigma_{i} u_{i}$.
\[\begin{aligned} AV &=\left[A \vec{v}_{1}\left|A \vec{v}_{2}\right| \ldots A \vec{v}_{n}\right] \\ &=\left[\sigma_{1} \vec{u}_{1}|\ldots| \sigma_{k} \vec{u}_{k} \mid \vec{0} \ldots \vec{0} \right] \\ \end{aligned} \tag{6}\]Using the definition of the matrix $\Sigma$ above, we can express $AV$ as $U\Sigma$.
\[\begin{aligned} AV&=\left[\vec{u}_{1}|\ldots| \vec{u}_{m}\right] \Sigma \\ &=U \Sigma \\ \end{aligned} \tag{7}\]And we immediately arrive at the more famous form of the S.V.D. formula, where we use the fact that $V$ is an orthonormal basis of $R_{n}$ (and thus $V^{-1} = V^\top$), to get the final form.
\[\therefore A V=U \Sigma \;\; \longrightarrow \;\; A=U \Sigma V^{-1}=U \Sigma V^{\top} \tag{8}\]- $V$ : consists of eigenvectors of $A^{\top}A$
- $U$ : consists of eigenvectors of $AA^{\top}$.
- $\Sigma$ : diagonal matrix, whose elements are $\sigma_{i} = \sqrt{\lambda_{i}}$, with $\lambda_{i}$ as eigenvalue of $AA^{\top}$
Summary
SVD:
- Serves as a general-purpose tool for dimensionality reduction
- Helps break down high dimensional data (such as high resolution video/image) into key features.
- Used in numerical / linear algebra for data processing
- Generalized version of a fast Fourier transform (FFT) that is better tailored to the specific data
- Used to solve matrix systems ($Ax = b$) for non-square matrix $A$
- Basis for PCA (statistical high dimension → dominant patterns/correlations)
Advantages of SVD:
- Simple, interpretable linear algebra technique that can be used on any data matrix to get understandable features.
- Used to distill high dimensional data into key features.
Applications of SVD:
- Google’s Pagerank algorithm
- Recommendation systems
- Facial recognition systems
- Correlation systems
Principal Component Analysis
Introduction
Let’s quickly take a look at how to find the principal components of a matrix. Suppose we have a data matrix $X = \mathbf{[ x_{1}, x_{2}, x_{3}, x_{4}]}$,
with $\mathbf{x_{i}}$ as columns corresponding to distinct categories of the data matrix:
\[\begin{align} \mathbf{{x}_{i}} &= \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{in} \end{bmatrix} \end{align}\]Now define $\bar{x}_{i}$ as the average of the values in column $\mathbf{x_{i}}$
\[\bar{x}_{i} = \frac{1}{n} \sum _{k=1}^{n} x_{ik}\]For each column, let $\mathbf{\bar{x}_{i}}$ be an $n$ dimensional vector filled with $\bar{x}_{i}$,
\[\begin{align} \mathbf{\bar{x}_{i}} &= \begin{bmatrix} \bar{x}_{i} \\ \bar{x}_{i} \\ \vdots \\ \bar{x}_{i} \end{bmatrix} \in R^n \end{align}\]and let $\mathbf{t}_{i}$ be defined as an $n$ dimensional vector with the following formula.
\[\begin{align} \mathbf{t}_{i} = \frac{\mathbf{x}_{i} - \mathbf{\bar{x}}_{i}}{s_{x_{i}}} \end{align} \;\; (i = 1,2,3,4)\]For matrix $T$ with $\mathbf{t_{i}}$ as columns, we can find the covariance matrix $C$ with the standard formula for covariance matrices.
\[T = \mathbf{[t_{1}, t_{2}, t_{3}, t_{4}]}\] \[\begin{align} C = \frac{1}{m-1} \cdot {T}^\top T \end{align}\]Then since $C$ is a symmetric matrix, there exists a diagonal matrix $D$ (whose diagonals are eigenvalues of $C$) and an eigenvector matrix $P$, such that $C$ can be eigendecomposed into $P$ and $D$.
\[\begin{align} \exists \;\; P, D \;\; \text{s.t.} \;\; C = PDP^{-1} = PD{P}^\top \end{align}\]Amongst the eigenvalues of $C$, let $u_{1}$ be the eigenvector corresponding to the largest eigenvalue of $C$. Then for matrix $T$ defined above, we arrive at the first principal component, which is a vector denoted $y_1$.
\[y_{1} = T \cdot \vec{u}_{1}\]The second, third, and fourth principal components can be found in a similar way — that is, as the dot product between $T$ and the eigenvectors corresponding to the second, third, and fourth largest eigenvalues.
Summary
PCA:
- Bedrock dimensionality reduction method for statistics
- Commonly used in DS/ML applications to uncover low dimensional patterns in data following unknown distribution to build models
- Based on a 1901 paper by Karl Pearson, so PCA technique is old and well established.
Sources:
-
The Singular Value Decomposition. Rice University, 2 Apr. 2021, https://math.libretexts.org/@go/page/21868.
-
“Singular Value Decomposition (SVD): Mathematical Overview.” YouTube, uploaded by Steve Brunton, 19 Jan. 2020, www.youtube.com/watch?v=nbBvuuNVfco.
-
“6. Singular Value Decomposition (SVD).” YouTube, uploaded by MIT OpenCourseWare, 16 May 2019, www.youtube.com/watch?v=rYz83XPxiZo.
-
“Principal Component Analysis (PCA).” YouTube, uploaded by Steve Brunton, 28 Jan. 2020, www.youtube.com/watch?v=fkf4IBRSeEc.