Monday, 6 June 2016

Expectation Maximization for Gaussian Mixture Model in OpenCV

I recently wrote code for Gaussian Mixture Model (GMM) based clustering in C++. As always, I found it much convenient to use OpenCV for manipulating matrices. Although there already exist an implementation of Expectation Maximization-based GMM, I tried to understand it by writing my own implementation.

The basic idea of GMM is to first randomly assign each sample to a cluster. This provides initial mixture model for clustering. This is then optimized using Expectation - or the probability/score of assigning each sample to each component in GMM - and Maximization - or updating the characteristics of each mixture component with the given probability/score . An attractive attribute of GMM is its ability to cluster data that does not have clear boundaries for clusters. This is achieved by having a probability/score for each sample from each cluster component.

This implementation is accompanied by a very simple plotting code that I wrote in OpenCV C++. This helps visualize the clusters and their apparent boundaries. As always this code can be accessed from my GitHub repo: OpenCVGMM

Now let's have a look at how GMM works for different clustering cases. For simplicity, I am generating the data by randomly sampling from two Gaussians with different variances and means. The separating boundary is made ambiguous on purpose by placing the mean of the two sources close to each other and/or increasing the variance.

If there is a clear separation between the two classes, then the EM optimization converges quite well.

GMM EM Clustering with good separation
Now lets see what happens when we make the boundary a little ambiguous by increasing the variances of both sources.
GMM EM Clustering with an ambiguous boundary
Further increasing the variance of  each source totally blurs the boundary. However we see that we can still recover the two clusters although with much more difficulty.
GMM EM Clustering with increased uncertainty on boundary
And finally, we see that the GMM EM algorithm totally fails when there are no clusters to form ( in this case having the same source mean).
GMM EM Clustering of two sources with same means and variances

If you would like to get more insight into how GMM EM works, please have a look at the source code or follow the link to this presentation that I used for understanding GMM myself.