Plot 3D Gaussian distribution? The harder part is to plot the 3D ellipsoid which can be done by calculating the axes (radii) of the ellipsoid from its eigenvalues. Simultaneously, We will get its corresponding eigenvectors which tells how to rotate the ellipsoid. The function ellipsoid(.) can plot canonical ellipsoid, and hence we need to rotate the canonical ellipsoid using the eigenvectors. That is it. Here are some codes adapted from Rajiv Singh’s version.
% plot 3D ellipsoid % developed from the original demo by Rajiv Singh % http://www.mathworks.com/matlabcentral/newsreader/view_thread/42966 % 5 Dec, 2002 13:44:34 % Example data (Cov=covariance,mu=mean) is included. Cov = [1 0.5 0.3 0.5 2 0 0.3 0 3]; mu = [1 2 3]'; [U,L] = eig(Cov); % L: eigenvalue diagonal matrix % U: eigen vector matrix, each column is an eigenvector % For N standard deviations spread of data, the radii of the eliipsoid will % be given by N*SQRT(eigenvalues). N = 1; % choose your own N radii = N*sqrt(diag(L)); % generate data for "unrotated" ellipsoid [xc,yc,zc] = ellipsoid(0,0,0,radii(1),radii(2),radii(3)); % rotate data with orientation matrix U and center mu a = kron(U(:,1),xc); b = kron(U(:,2),yc); c = kron(U(:,3),zc); data = a+b+c; n = size(data,2); x = data(1:n,:)+mu(1); y = data(n+1:2*n,:)+mu(2); z = data(2*n+1:end,:)+mu(3); % now plot the rotated ellipse % sc = surf(x,y,z); shading interp; colormap copper h = surfl(x, y, z); colormap copper title('actual ellipsoid represented by mu and Cov') axis equal alpha(0.7)
From my previous post, we know that the update equation for covariance matrix might not be numerically stable because of the matrix not being positive definite. An easy way to stabilize the algorithm is to add a relatively small positive number a.k.a. loading factor to the diagonal entries of the covariance matrix. But, Does the factor loading affect the likelihood or the convergence of the EM algorithm?
Apparently, adding the loading factor to the covariance matrix does impact the log-likelihood value. I made some experiments on the issue, and let me share the results with you as seen in the learning curve (log-likelihood curve) of ITSBN with EM algorithm below. The factor is applied to the matrix only when the determinant of the covariance matrix is smaller than . There are 5 different factors used in this experiment listed as follows; . The results show that the learning curves are still monotonically increasing* and level off near the end. Furthermore, we found that the level-off value are highly associated with the value of the factor. The bigger the factor, the smaller the level-off value. This suggested that we should pick smallest value of factor as possible in order to stay as close as the ideal learning curve as possible. Note that the loading factor is not added to the covariance matrix until the second iteration.
* Though I don’t think this is always the case because the factor is not consistently added to the matrix, and hence when it is added, it might pull the log-likelihood up to a low value. However, it is empirically shown that the log-likelihood is still monotonically increasing when the factor is big.
There are so many plausible reasons. One common reason is that there is at least one Gaussian component not having its cluster members in a close affinity. This situation occurs when the data clusters spread very narrow with respect to the distance between each cluster; in other words, when the intra-cluster distance is much smaller than inter-cluster distance. Let’s assume we have 3 data clusters A, B and C, with A and B are almost merged to each other and very far away from C. We want to cluster the data into 3 components using the EM algorithm. Suppose the initial locations of the 3 clusters are at the middle of the space among the three clusters, and it occurs that there is one centroid not having its “nearest” members. This also means that it is quite sufficient to use only 2 components to model the whole data rather than 3. Let’s assume the deserted centroid is labeled by the ID ‘2’. In which case, the posterior marginal distribution of each data sample will either have big value for label 1 or 3, but there is no sample give big value for label 2. In fact, to be more precise, the posterior marginal for the label 2 will be virtually zero for all the data samples. Unfortunately the update equation for a covariance matrix weights each atom (i.e., ) of updated covariance matrix with its corresponding class posterior marginal , and hence give zero matrix for covariance matrix of class label 2. So, as you have seen, it is not always an easy case to use EM to cluster the really-far-separated data.
I and my colleague were suggested by a reviewer to apply our accepted work on some real-world application. “Bro, we’ve got less than 4 days to apply our work on a real-world problem…what would we do?”, we spent 10 minutes discussing several possible problems such as automatic video segmentation, CD cover searching, human gesture recognition and some other funny-crazy ideas. Finally, with our curiosity and the time constraint we ended up with static hand posture recognition. Fortunately, the data set is not too difficult to find on internet. Millions thanks to Triesch and Von Der Malsburg for the wonderful hand posture database–that saved our lives.
Originally we found that calculating divergence measure of 2 Gaussian mixture models (GMM) can be done efficiently using Cauchy-Schwarz divergence () as it gives closed-form expression for any pair of GMMs. Of course, we can’t get this awesome property in Kullback-Leibler divergence ()…why? read our paper  ^_^ Yay! In short, formulation does not allow Gaussian integral trick, hence closed-form expression is not possible.
In this work, we use minimum divergence classifier to recognize the hand postures. Please see our paper for more details. We had finished our experiment on the second day, so we have some time left to make a fancy plot summarizing our work which we would like to share with you below. The classification accuracy using and are 95% and 92% respectively, and the former method also gives much better computational run-time, about 10 time faster. The figures below also suggest that our proposed method outperforms when it comes to clustering as the proposed method gives more discriminative power.
 K. Kampa, E. Hasanbelliu and J. C. Principe, “Closed-form Cauchy-Schwarz pdf Divergence for Mixture of Gaussians,” Proc. of the International Joint Conference on Neural Networks (IJCNN 2011). [pdf] [BibTex]
We make our code available for anyone under creative commons agreement [.zip]
We also collected some interesting links to the hand posture/gesture database here:
The following papers and documents can be helpful:
A Bimodal Face and Body Gesture Database for Automatic Analysis of Human Nonverbal Affective Behavior
Hatice Gunes and Massimo Piccardi Computer Vision Research Group,
University of Technology, Sydney (UTS)
A Color Hand Gesture Database for Evaluating and Improving Algorithms on Hand Gesture and Posture Recognition
FARHAD DADGOSTAR, ANDRE L. C. BARCZAK, ABDOLHOSSEIN SARRAFZADEH
Hand Detection and Gesture Recognition using ASL Gestures
Supervisor: Andre L. C. Barczak
Student: Dakuan CUI
A while ago, I was so amazed about the image segmentation results using Gaussian Mixture Models (GMMs) because GMM gives pretty good results on normal/natural images. There are some results on my previous post. Of course, GMM is not the best for this job, but hey look at its speed and easiness to implement–it’s pretty good in that sense. However, one problem with GMM is that we need to pick the number of components. In general, the more component numbers we assume, the better log-likelihood it would be for GMM. In that case, we would simply send the number of components to infinity, right? Well…but there is nothing good come out of that because the segment would not be so meaningful–in fact, we overfit the data, which is bad.
Therefore, Bayesian Information Criteria (BIC) is introduced as a cost function composing of 2 terms; 1) minus of log-likelihood and 2) model complexity. Please see my old post. You will see that BIC prefers model that gives good result while the complexity remains small. In other words, the model whose BIC is smallest is the winner. Simple as that. Here is the MATLAB code. Below are some results from sweeping the number of components from 2 to 10. Unfortunately, the results are not what I (and maybe other audiences) desire or expect. As a human, my attention just focuses on skier, snow, sky/cloud and perhaps in the worst case, the shadows, so the suitable number of components should be 3-4. Instead, the BIC assigns 9-component model the winner which is far from I expected. So, Can I say that the straightforward BIC might not be a good model for image segmentation, in particular, for human perception? Well…give GMM-BIC a break– I think this is too early to blame BIC because I haven’t use other more sophisticated features like texture, shape, color histogram which might improve results from using GMM-BIC. The question is what are the suitable features and the number of components that makes the segmentation results using GMM-BIC similar to human perception? MATLAB code is made available here.