Principal Component Pursuit

This project is hosted on GitHub.

Brief summary #

While looking for ways to work with and extract information from cryo-ET tomograms (noisy 3D images) in the summer of 2024, I read a book about ways of making the most of messy data using linear algebra and optimization. In the process, I learned about Robust Principal Component Analysis (RPCA), which can be performed with Principal Component Pursuit (PCP).

Here I apply Principal Component Pursuit to a video I shot in my lab. Using this method, I am able to extract the background and foreground of a video, using nothing more than linear algebra and a simple convex optimization problem.

The top video is the original, which I shot of myself in my lab.
The middle video is the “background” (not moving) component of the video.
The bottom video is the “foreground” (moving) component of the video.

This project is inspired by a common application of PCP: identifying video segments in surveillance camera feeds where something of interest is happening. The video above simulates this in just a few seconds.

I have yet to successfully apply this method to tomograms, but my advisor and I thought it was fascinating, so we feel like we succeeded anyway! I implemented the method in Julia, a language I am coming to love. See my implementation code on GitHub.

If you would like to learn more details, read the longer summary below.

Longer summary #

What is this? #

This is an implementation and demonstration of Principal Component Pursuit, a way to solve the Robust Principal Component Analysis problem, which is here applied to a short video I shot.

What am I seeing? #

I recorded the top video in my lab. I wanted a video with a mostly static background and something moving in the foreground.

Using the method, which I will describe next, I separate the video into a static component and a moving component. The static component is the middle video. The moving component is the bottom video. In other words, the bottom video added to the middle video yields the top video.

How does it work? #

Each frame of a grayscale video can be thought of as a matrix of grayscale values. For each frame, I take that matrix and flatten it into a vector. Thus, each frame of the video can be represented as a long vector with as many elements as there are pixels in a frame. I will call this vector a “frame vector”.

By representing the frames as vectors, the entire video can itself be represented as a matrix. This is done by stacking all of the frame vectors side by side into a huge matrix, with as many columns as there are frames in the video. I will call this matrix a “video matrix”.

In a video with a mostly static background and something moving in the foreground, the video matrix is almost low rank, since the frame vectors are mostly the same (since most of the pixels don’t change). But it isn’t, because of the movement in the foreground. Nevertheless, we can find a video matrix that is nearly equal to the original video matrix but is in fact low rank. In simpler terms, we can extract the background of the video.

Let $\bf{Y}$ be the original video matrix. We want a video matrix $\bf{L}$ that is nearly equal to $\bf{Y}$, differing only by a sparse (meaning most of the elements are 0) matrix $\bf{L}$. In other words, we want to find $\bf{L}$ and $\bf{S}$ such that $\bf{Y} = \bf{L} + \bf{S}$, while minimizing $\text{rank}(\bf{L})$ and $\Vert\bf{S}\Vert_0$ (where $\Vert\cdot\Vert_0$ gives the number of non-zero elements of its input, which technically is not a proper mathematical norm).

One could set this up as an optimization problem

$$ \begin{aligned} \text{minimize}\quad&\text{rank}(\bf{L}) + \lambda \Vert\bf{S}\Vert_0 \\ \text{subject to}\quad&\bf{Y} = \bf{L} + \bf{S} \end{aligned} $$

for some tuning parameter $\lambda \in \mathbb{R}$, but the objective is not convex, making this very difficult to solve.

Rather, we set up the problem using convex surrogate norms:

$$ \begin{aligned} \text{minimize}\quad&\Vert\bf{L}\Vert_* + \lambda \Vert\bf{S}\Vert_1 \\ \text{subject to}\quad&\bf{Y} = \bf{L} + \bf{S}, \end{aligned} $$

where $\Vert \cdot \Vert_*$ is the nuclear norm, meaning, the sum of the singular values of the input matrix, and $\Vert \cdot \Vert_1$ is the standard matrix 1-norm (the maximum column sum). (See “Note: Why these norms?” below.)

This new problem is convex! It is easy to solve with off-the-shelf convex optimizers. I have opted to implement the optimizer myself, but other libraries like CVXPY (in Python) or Convex.jl (for Julia) should work fine.

By solving

$$ \begin{aligned} \text{minimize}\quad&\Vert\bf{L}\Vert_* + \lambda \Vert\bf{S}\Vert_1 \\ \text{subject to}\quad&\bf{Y} = \bf{L} + \bf{S}, \end{aligned} $$

we find video matrices $\bf{L}$ and $\bf{S}$ that, for all intents and purposes, separate the original video matrix $\bf{Y}$ into “background” and “foreground” components respectively. Problem solved!

Note: Why these norms? #

First, we want to minimize the rank of $\bf{L}$. When the rank of $\bf{L}$ is minimized, we hope that most of its singular values are zero. Perhaps this will shed some intuition on why the nuclear norm makes sense here.

Second, we want to minimize the number of nonzero elements of $\bf{S}$ (what I called $\Vert \cdot \Vert_0$ above). When the number of nonzero elements of $\bf{S}$ is minimized, we would hope that the maximum column sum of $\bf{S}$ is quite small. Hopefully this clarifies why the 1-norm is a reasonable choice.

For more formal justification for these choices of norm, consult Wright and Ma’s textbook High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications.