## Gaussian Mean Shift Clustering

I have heard a lot of about Gaussian mean shift clustering as it is very popular, easy to implement, and one of the best for object tracking in video. Recently I have been working on structure learning, and one of my recent work is to say (in fact, boast, wahahahaha…just kidding) that structure learning can be a more general case of those classic clustering algorithms like k-mean, LBG-vector quantization, mean shift and EM algorithm. So I have to find some algorithms to compare with mine. The algorithm that I want to compare with has to come up with the number of clusters automatically once the covariance matrix or kernel width is known, so I think about LBG-VQ and Gaussian mean shift (GMS). When deriving GMS, I found that it’s very similar to the mean update equation of EM algorithm except that in EM we try to estimate the , so we take derivative w.r.t. . Unlike EM, in GMS, we take derivative w.r.t. each sample point in the space. However, in Gaussian distribution we have the term whose derivative and have opposite sign, therefore when equating the derivative to zero, we will get similar update equation.

the update equation as

which very similar to mean update in EM algorithm for mixture of Gaussian.

Here is a short report [pdf] on GMS which shows the preliminary results and how to derive GMS. MATLAB codes for GMS are available here.

Hi, I was wondering if you could reupload the matlab codes for Gaussian Mixture Mean-Shift? Neither the pdf nor the zip file appear to exist anymore. Excellent post, btw. Thanks!

Hi, Terence. Sorry for the inconvenience. Recently my original website http://www.whatdafact.com was hacked, which automatically forced me to erase the whole shebang of my data stored in the web host. I will re-upload all the data very soon and definitely let you know. Thanks again!

Hi Terence, the files are reuploaded already. Thanks!

Do you think this code is likely to work with a Gaussian Mixture of, say, a set of returned nearest neighbours?

Normally GMS works pretty well on Gaussian mixture models especially with substantial number of data samples. Therefore, I think the code should work fine on such data. Also the performance of GMS highly depends on the kernel size used in the algorithm. The kernel size determines the scale of the information we would like to extract from the data, so, for me, there is no right or wrong answer for the kernel size. Instead, a more important question would be “In what scale we would like to see the information from the data?”. And to pick the optimal scale is still an open problem. I usually got it from empirical experiment. What kind of data you are using? Is it supposedly generated from a Gaussian mixture? Good luck, Terence! ^_^

Well, what i’m investigating is non-uniqueness in the articulatory-acoustic inversion mapping through the number modes shown (meaning a single frame of sound can have a one to many mapping to how your mouth produces the sound, rather than one to one, which is a problem for, say, some speech recognition systems (maybe lip-reading?)).

So, I have an aligned set of articulatory data and acoustic data (z-score normalised to fall between -0.9 + 0.9), and for a given consonant (say, ‘f’) in the acoustic data, I’m taking an example frame from the acoustic data, using a KD-tree to partition the data and search for 1500 acoustically similar frames (think it uses euclidean distance), and then taking the indices of those frames and plotting the corresponding articulatory frames. When clustered, approx 5% of the data should show a multiple modes.

I’m hoping to use mean-shift to cluster these 1500 frames . So, I’ve used gmdistribution.fit to fit a gm to the data X(containing x&y coordinates, so 1500×2) (trying 2/3/4 components). The points probably need to fall within… i dunno, 0.02 for bandwidth i guess.

Terence, your work sounds very interesting. I think your hypothesis is right. I can say ‘A’ and ‘K’ with the same lip shape; hehehe now I know how to give a difficult time to lip readers.

Regarding the kernel size selection, “Silverman’s Rule of Thumb” is a good option when I have no idea of what the ideal kernel size would be. It’s a way to come up with a good kernel size for the data.

http://www.buch-kromann.dk/tine/nonpar/Nonparametric_Density_Estimation_onedim1.pdf

Hi, brilliant, thanks again!