# Fun with GMMs

*Generative Models* have been all the rage in AI lately, be it image generators like Stable Diffusion or text generators like ChatGPT. These are examples of fairly sophisticated generative systems. But whittled down to basics, they are a means to:

- (a) concisely represent patterns in data, in a way that …
- (b) they can
*generate*later what they have “seen”.

A bit like an artist who witnesses a scenery and later recreates it on canvas using her memory; her memory acting as a generative model here.

In this post, I will try to illustrate this mechanism using a specific generative model: the **Gaussian Mixture Model (GMM)**. We will use it to capture patterns *in images*. Pixels will be our data, and patterns are how they are “lumped” together. Of course, this lumping is what humans perceive as the image itself. Effectively then, much like our artist, we will use a generative model to “see” an image and then have it reproduce it later. **Think of this as a rudimentary, mostly visual, tutorial on GMMs**, where we focus on their representational capability. Or an article where I mostly ramble but touch upon GMMs, use of probabilities, all the while creating fancy art like the ones below!

My reason for picking GMMs is that they are an intuitive gateway to the world of generative modeling. They are also well-studied: the first paper that talks about GMMs was published in 1894! GMMs also make some of the math convenient; we’ll see examples of this soon. Library-wise scikit has a nice implementation of GMMs, which I have used for this post. scipy has convenient functions to work with the Gaussian distribution - which is also used here.

I know I said this is going to be “mostly” visual. I won’t be abnegating *all* math here : we will discuss *some* necessary math, that is just about enough to understand how to model images. Here’s the layout of the rest of this post.

## Crash Course in G and MM

Some basics first, so that we can comfortably wield GMMs as tools. If you are familiar with the Gaussian distribution (also known as the *Normal* distribution), GMMs, conditional and marginal probabilities, **feel free to skip ahead to the next section**.

### What are GMMs?

Think of GMMs as a way to approximate functions of special kind: *probability density functions (pdf)*. *pdf*s are a way to express probabilities over variables that take continuous values. For ex., you might want to use a *pdf* to describe the percentage \(x\) of humidity in the air tomorrow, where \(x\) can be any real number between 0 and 100, i.e., \(x \in [0, 100]\). The *pdf* value at \(x\), denoted by \(p(x)\) (often called the “density”), in some loose sense represents how likely the value is, but is **not** a probability value in itself. However, if you sum these values for a contiguous range or an “interval” of \(x\)s, you end up with a valid probability value. In the case of continuous values, such sums are *integrals*; so, the probability of humidity being within 40%-60% is \(\int_{40}^{60} p(x)dx\). Some well known *pdf*s are Gaussian (the star of our show, also known as the *Normal* distribution), Poisson and Beta.

Let’s talk about the Gaussian for a bit. For a variable \(x\), the density \(p(x)\) can be calculated using this formula:

\[p(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\]We won’t be using this formula. But note that it depends on the following two parameters:

- \(\mu\): the
*mean*or average value of the distribution. - \(\sigma\): the
*standard deviation*, e.g., how spread out is the distribution. Often we talk in terms of the square of this quantity, the*variance*\(\sigma^2\).

Some common shorthands:

- A Gaussian distribution with parameters \(\mu\) and \(\sigma\) is \(\mathcal{N}(\mu, \sigma^2)\).
- The
**standard Normal**is one with \(\mu=0\) and \(\sigma=1\), and is denoted by \(\mathcal{N}(0, 1)\). - Instead of saying \(p(x)\) is given by \(\mathcal{N}(\mu, \sigma^2)\), we may concisely write \(\mathcal{N}(x;\mu, \sigma^2)\) for the density of \(x\).
- To say \(x\) was
*sampled*from such a distribution, we write \(x \sim \mathcal{N}(\mu, \sigma^2)\). This is also equivalent to saying that the density of such \(x\)s are given by \(\mathcal{N}(x;\mu, \sigma^2)\) (the expression in the previous point). What is used depends on the context: do we want to highlight sampling or density?

Now, consider a practical scenario: you have some data, that doesn’t look anything like one of the commonly known *pdfs*. How would you mathematically express its density? In a time-honoured tradition, we will use something that we know to *approximate* what we don’t know. We’ll use a bunch or a **mixture** of superposed Gaussians (hence the name). To specify such a mixture, we need:

**Number**of components, \(K\).**Parameters**or*shape*of the components. We will denote component \(i\) by \(\mathcal{N}(\mu_i, \sigma_i^2)\), where we need to figure out the values for \(\mu_i\) and \(\sigma_i\).**Weights**\(w_i\) of the components, or how much a component contributes to the overall density. We require two properties here (we’ll see why soon): (a) \(0 \leq w_i \leq 1\), and (b) \(\sum_i^K w_i = 1\).

The resultant density \(p(x)\) of a point due to this mixture is given by the sum of individual Gaussian densities multiplied by their weights:

\[\begin{aligned} p(x) = \sum_{i=1}^{K} w_i \mathcal{N}(x;\mu_i, \sigma_i^2) \end{aligned}\]The physical interpretation is that \(x\) may have been generated by the \(i^{th}\) component with probability \(w_i\) (which explains the properties needed of \(w_i\) - they’re effectively probabilities), and then for this component, its density is given by \(\mathcal{N}(x;\mu_i, \sigma_i^2)\). The net *pdf* value for a specific value for \(x\) due to component \(i\) then is \(w_i \mathcal{N}(x;\mu_i, \sigma_i^2)\). Since \(x\) may be generated by any component, the overall \(p(x)\) is given by the sum above.

This turns out to be a powerful device, since with the right number of components, shapes and weights, one may approximate arbitrary distributions. The plot below shows different `2`

-component GMMs, constructed out of the same two components (red dashed lines), but with different weights (mentioned in the legend). You can see that this alone produces various different *pdf*s (solid lines).

Let’s go back to the original question. In the figure below, we consider some data points on the x-axis (shown with small vertical lines, known as a rugplot). The corresponding histogram is also shown. We try to approximate this *target* distribution with a GMM with `3`

components, shown with the red dashed lines. The black solid line shows the final *pdf* obtained using appropriate weights (not shown), which you’d note, is quite close to the histogram. This is a GMM in action.

Here we won’t discuss how we find the number of components, or their parameters, or their weights - the first one is a hyperparameter (so you either determine it based on some task specific metric or use a heuristic such as AIC), and the latter are typically determined using a popular algorithm known as Expectation Maximization (EM), for which we use scikit’s implementation. Here’s an interesting bit of trivia: the EM algorithm was proposed in 1977, which was much after the first use of GMMs (which, as mentioned above, was in 1894)!

Interestingly, it is also possible to use an *infinite* number of components, but that’s again something we won’t cover here.

### Sampling from a GMM

As a quick stopover, let’s ask ourselves the “inverse” question: how do we generate data from a GMM \(\sum_{i=1}^{K} \mathcal{N}(x;\mu_i, \sigma_i^2)\)? Think about the physical interpretation we discussed earlier; that’s going to be helpful now. To generate one data instance \(x\), we follow this two-step process:

- Use \(w_i\) as probabilities to pick a particular component \(i\).
- And then generate \(x \sim \mathcal{N}(\mu_i, \sigma_i^2)\).

Repeat the above steps till you have the required number of samples.

This ends up producing instances from the components in proportion to their weights. The following video shows this for a `2`

-component GMM. For sampling a point, first a component is chosen - this choice is temporarily highlighted in red. And then an instance is sampled - shown in blue. The black line shows the *pdf* of the GMM, but if you sampled enough instances and sketched its empirical distribution (using, say, a kdeplot), you’d get something very close to this *pdf*.

### Multivariate Gaussians

Of course, GMMs can be extended to an arbitrary number of \(d\) dimensions, i.e., when \(x \in \mathbb{R}^d\). We augment the notation to denote such a *pdf* as \(p(x_1, x_2, .., x_d)\), where \(x_i\) denotes the \(i^{th}\) dimension of a data instance \(x\). The GMM now is composed of *multivariate* Gaussians \(\mathcal{N}(\mu, \Sigma)\). The symbol for the mean is still \(\mu\), but now \(\mu \in \mathbb{R}^d\). Instead of the variance, we have a *covariance matrix*, where \(\Sigma_{i,j}\) - the entry at the \((i,j)\) index - denotes the covariance between values of \(x_i\) and \(x_j\) across the dataset. Since covariance doesn’t depend on the order of variables, \(\Sigma_{i,j}=\Sigma_{j, i}\) and thus \(\Sigma \in \mathbb{R}^{d \times d}\) is a *symmetric* matrix.

And yes, unfortunately the symbol for *summation* \(\sum\), which is `\sum`

in LaTeX, and the *covariance matrix* \(\Sigma\) - `\Sigma`

- look very similar. I was tempted to use a different symbol for the latter, but then I realized it might just be better to bite the bullet and get used to the common notation.

The GMM operates just the same, i.e., for \(x \in \mathbb{R}^d\), we have:

\[\begin{aligned} p(x) = \sum_{i=1}^{K} w_i \mathcal{N}(x;\mu_i, \Sigma_i) \end{aligned}\]### Contour Plots

We want to be able to visualize these high-dimensional GMMs.
In general, faithful visualization of high-dimensional structures is hard, but, fortunately, there is a well known technique applicable to two-dimensions: *contour plots*. In the figure below, (a) shows some data points in 2D. I have used a `2`

-component two-dimensional GMM to model this data. Instead of showing the *pdf* on a \(z\)-axis in a 3D plot, we calculate the values \(z=p(x_1, x_2)\) values at a whole lot of points in the input space, and then *connect* the points with identical \(z\) values. We might also color the lines formed by these connections, to indicate how high or low \(z\) is. These lines are knowns as *contours* and (b) shows such a *contour* plot.

Note the legend in (b) - it has negative numbers. This is because (b) actually shows contours for \(log\; p(x_1, x_2)\) - which is a common practice (partly because it leads to better visualization, and partly because the quantity \(log\; p\) itself shows up in the math quite a bit).

### Some Properties

So far the only thing we have used the GMM for is to sample data. This is useful, but sometimes we want to use the model to answer specific questions about the data, e.g., what is the mean value of the data? We’ll briefly look at this aspect now.

**Expectation**

The **expectation** or the **expected value** of variable is its average value weighted by the probabilites of taking those values:

The subscript \(p\) in \(E_p[x]\) denotes the distribution wrt which we assume \(x\) varies; we’ll drop this since its often clear from context. While the integral ranges from \({-\infty}\) to \({+\infty}\), it just indicates that we should account for all possible values \(x\) can take.

For a Normal distribution, \(E[x]\) is just its mean \(\mu\). We can use this fact to conveniently compute the expectation for a GMM:

\[\begin{aligned} E[x] &= \int_{-\infty}^{+\infty} x \Big( \sum_{i=1}^{K} w_i \mathcal{N}(x;\mu_i, \Sigma_i) \Big) dx \\ &= \int_{-\infty}^{+\infty} \Big( \sum_{i=1}^{K} w_i x \mathcal{N}(x;\mu_i, \Sigma_i) \Big) dx \\ &=\sum_{i=1}^{K} \Big( \int_{-\infty}^{+\infty} w_i x \mathcal{N}(x;\mu_i, \Sigma_i) dx \Big) \\ &=\sum_{i=1}^{K} \Big( w_i \int_{-\infty}^{+\infty} x \mathcal{N}(x;\mu_i, \Sigma_i) dx \Big) \\ &=\sum_{i=1}^{K} w_i \mu_i \\ \end{aligned}\]**Conditional Distributions**

It’s often useful to talk about the behavior of certain variables while holding other variables constant. For example, in a dataset comprised of `Age`

and `Income`

of people, you might want to know how `Income`

varies for `25`

year olds. You’re technically looking for a *conditional* distribution: the distribution of certain variables like `Income`

, given a *condition* such as `Age=25`

over the other variables.

The density obtained after conditioning on a certain subset of dimensions, e.g., \(x_{d-2}=a,x_{d-1}=b,x_{d}=c\), is written as (note the `|`

character):

What do such conditonal distrbutions look for the Gaussian? Consider a simple case of two variables \(x_1, x_2\), and conditioning on \(x_1\):

Look at the various conditional distributions \(p(x_2 \vert x_1)\), for different \(x_1\). In the figure, this is equivalent to “slicing” through the Gaussian at specific values of \(x_1\), as shown by the three solid lines. Do these distributions/solid lines suspiciously look like Gaussians? They are! The conditional distributions of a Gaussian are also Gaussian. We wouldn’t go into proving this interesting property, but its important to know it is true.

**NOTE**:
the above animation only serves to provide a good mental model and doesn’t present the whole picture. You can delve deeper into nuances this visualization misses here, but if this is your first encounter with the notion, you can skip it for now.

But what does this say about GMMs? Because of the above property, the conditional distribution for a GMM is also a GMM. This new GMM has different weights and components. You can find a derivation here. This insight will be helpful when we deal with color images.

OK, our short stop is over … on to actual problem-solving!

## Why Images?

If GMMs are about modeling data, how do images fit in? Let’s take the example of black-and-white images: you can think of every single black pixel as a data point existing at its location. Look at the image below - the pixels that make up the shape of the dog are the data we will use to fit our GMM.

For color images things get a bit complicated - but not that much, and we’ll find a way to deal with it later.

## Black and White Images

Let’s start with black-and-white images since they are easier to model. Our training data for the GMM is essentally the collection of all coordinates where we have black pixels, as shown below:

The white regions in the above image DO NOT contribute to the dataset. The learned GMM only knows where data is present. In this case the components are 2-dimensional Gaussians since we are only modeling coordinates in a 2D plane.

### Sampling

We now generate our first images <drum roll>! The recipe is simple:

- We create our training data as shown above, and fit a 2-dimensional GMM. We set the number of components \(K=500\). No particular reason, I just picked an arbitrary high number.
- Create a contour plot for the GMM - which gives us our first fancy image. Art!

Going to back to our previous example, we have:

We can also generate the image by *sampling* from the GMM. We have seen how to sample earlier - its pretty much the same now, except we have more components and each point we sample has `2`

dimensions, and thus can be placed on a plane.

We’ll sample multiple points \((x_1, x_2)\), and at these locations, we place a blue pixel. We show what these samples look like for different sample sizes in (a) and (b), for ~20k and ~40k points respectively.

OK, lets dial up the complexity a bit. We’ll use an image of a butterfly this time, that has more details. Plots (a) and (b) are the same as before, but we increase the sample size in (c) and (d) to ~50k and ~100k respectively to handle the increased image details.

Next, we will make the image even harder to learn - which effectively means the image has a lot of detail, which makes it computationally challenging for the model to learn. We’ll use an image of the famous print Great Wave off Kanagawa. Here, you can see the outputs have lost quite a bit of detail.

These look cool - but there is another way to use our GMMs to generate these images. Let’s look at it next.

### Grid-plotting

Instead of sampling from the GMM, we can go over a grid of coordinates \((x, y)\) in the input space, and color the pixel at the location based on the value for \(p(x, y)\). The below image shows the process. The blank circles show locations yet to be colored.

The above image simplifies one detail: \(p(x, y)\) values are *pdf* values, so do not have an upper-bound (but our grayscale ranges within `[0,1]`

). We collect all these values for our grid of points, scale them to lie in the range `[0,1]`

before we plot our image.

Let’s start with the image of the dog. We see its almost perfectly recovered! We probably expect this by now given its a simple sketch.

However, notice the width of the lines. Do they seem thicker? That’s because pixels in the close neighborhood of the actual sketch lines also end up getting a black-ish color - afterall, we are using a probabilistic model. As details in an image grow, you will see more of this effect. Below, we have the butterfly image - you can actually see smudging near the borders, making it look like a charcoal sketch.

The waves image is tricky because of the high amount of detail - and its not surprising that we obtain a heavily smudged image, to the point it looks like a blob!:

But fear not - we have another trick up our collective sleeves!

### Custom Mapping for Grid Plots

How do we do better? A simple hack is to change the mapping of probabilities to color values, so that the numerous points with low probabilities are visibly *not* dark. We can replace the implicit linear mapping we’ve been using with a bespoke non-linear mapping that achieves this outcome - in the figure below, (a) shows this with a piecewise linear function, where the dashed orange line is the default linear mapping (shown for reference). The x-axis shows the scaled *pdf* values of pixels, and they y-axis shows the score they are mapped to on the gray colormap, which ranges between `[0, 255]`

. Our version allows for very dark colors for only relatively high scaled *pdf* values.

Below we compare the linear mapping in (a) to our custom mapping in (b). In (b) we do end up recovering some details; you can actually see Mt. Fuji now! We can also explore playing with the number of components, point sizes, alpha values etc.

There are probably numerous ways to tweak the steps in this section to make the images more aesthetic (and it can get quite addictive!), but lets stop for now and instead direct our attention to modeling color images.

## Color Images

Finally, the pièce de résistance!

The challenge with color images is we can’t just fit a model to the locations of certain pixels: ALL locations are important, because they all have colors. Additionally, we also need to model the actual colors - in the previous case, this is not something we explicitly modeled.

Let’s proceed by breaking down the problem. We can think of every location in the image as having 3 colors: red (R), green (G), blue (B). These are also known as *channels*. In the image below, the RGB channels for a color image are shown.

We need to model these RGB values at each location. For this we use a *5-dimensional* GMM, where the variables are the \(x, y\) coordinates followed by the \(r, g, b\) values present at that coordinate. The following image shows how the training data is created - a notable difference from the black-and-white case is that *all* locations are used.

These GMMs are going to be larger because of both (a) a higher number of dimensions, and (b) a lot more points to model. I am also going to use a larger number of components since there is more information to be learned.

### Sampling

So we have our 5-dimensional GMM now (trained with \(K=1000\) components this time). We can generate our first set of images by sampling as before: sample a point \((x, y, r, g, b)\), and color the pixel at location \((x,y)\) with the colors \((r,g,b)\). A minor practical detail here is that you would need to clip the \(r,g,b\) sampled values to lie in the right range - because the GMM, not knowing these are colors, might generate values beyond the valid range of `[0, 255]`

.

But it works. Behold!

### Grid-plotting

Remember we were able to generate images earlier by plotting points on a grid? How do we do that here? Well, the answer is a little complicated but doable.

Here’s the challenge: at a particular coordinate \((x_1, y_1)\), you can ask your GMM for a conditional *distribution*: \(p(r,g,b \vert x=x_1, y=y_1)\). Yes, you get a full-blown different GMM over \(r,g,b\) values at every coordinate (because, like we discussed, the conditional of a GMM is also a GMM). How do we color a location with a distribution? - well, here I just use its mean value. Effectively, given a specific coordinate \((x_1, y_1)\), I calculate the mean values for the \((r,g,b)\) tuple:

The below schematic shows this is what that looks like:

This is computationally more expensive than the black-and-white case. When you use the strategy on actual images, you end up with images with “smudgy brush strokes” - something you may have noticed with other AI-generated art. But the results are not bad at all!

## Notes

A list of miscellaneous points that I kept out of the main article to avoid cluttering.

- I fixed a lot of the GMM parameters to some sweet-spot between the training being not memory intensive and the outputs being aesthetically pleasing. Of course, lots of tweaking possible here.
- Can we use discriminative classifiers? Yes - as an example, for the black-and-white case, given a location \((x, y)\) you could predict a label in \(\{0, 1\}\) for white and black colored pixels respectively. Here’s an example where I used LightGBM as the classifier:

- Images were created using: matplotlib, seaborn, Inkscape, ezgif, gifsicle, Shutter.
- Thanks to pixabay for making their images available royalty-free. Specifically, I used these images:
- I extracted black-and-white images out of the original color images of the Dog and Butterfly.
- The images in the color section are from here: Kingfisher, Springbird, Boat in ocean.

- As mentioned earlier, the image of “The Great Wave off Kanagawa” is from wikipedia.
- For various properties of the Gaussian, Bishop’s book “Pattern Recognition and Machine Learning” is a commonly cited reference (PDF). Properties of Gaussians can be found in Section 2.3 and GMMs are discussed in Section 9.2.

This was a fun article to write - so much fun that I kept playing around with various parameter settings and broader creative ideas (sadly, mostly unfinished) to the point that this has been languishing as a draft for more than a year now! GMMs themselves are not used as much anymore, but I think of them as a good place to start if you want to understand how to use distributions for practical problems. GMMs aside, it is good to be familiar with the properties of Gaussians since they keep cropping up in various places such as *Probabilistic Circuits*, *Gaussian Processes*, *Bayesian Optimization* and *Variational Autoencoders*.