## Gaussian Mean Shift Clustering

I have heard a lot of about Gaussian mean shift clustering as it is very popular, easy to implement, and one of the best for object tracking in video. Recently I have been working on structure learning, and one of my recent work is to say (in fact, boast, wahahahaha…just kidding) that structure learning can be a more general case of those classic clustering algorithms like k-mean, LBG-vector quantization, mean shift and EM algorithm. So I have to find some algorithms to compare with mine. The algorithm that I want to compare with has to come up with the number of clusters automatically once the covariance matrix or kernel width is known, so I think about LBG-VQ and Gaussian mean shift (GMS). When deriving GMS, I found that it’s very similar to the mean update equation of EM algorithm except that in EM we try to estimate the , so we take derivative w.r.t. . Unlike EM, in GMS, we take derivative w.r.t. each sample point in the space. However, in Gaussian distribution we have the term whose derivative and have opposite sign, therefore when equating the derivative to zero, we will get similar update equation.

the update equation as

which very similar to mean update in EM algorithm for mixture of Gaussian.

Here is a short report [pdf] on GMS which shows the preliminary results and how to derive GMS. MATLAB codes for GMS are available here.

## EM vs. mean-field variational approximation

When I derive mean-field (MF) approximation inference for many problems, I observed that MF gives very similar update equations to that of EM. The similarities not only appear on the results, but on the derivation process as well. This afternoon my adviser was out of the office ^_^, so I have some free time to figure out what’s going on between EM and MF. I decided to use a simple problem like Gaussian Mixture Model (GMM) as a model for explanation. Here is the writeup [link]

## Contrastive Divergence for Parameter Estimation in DSCRFs

Recently, I have been working on Deformable-Structure Conditional Random Fields (DSCRFs) for image classification which is about CRFs that can change the graph structure to fit the data (image) and make inference (of pixel label) simultaneously. One problem of this approach is when we estimate the parameters (I guess everyone in the field knows what I’m talking about ^_^), so I’m looking for some optimization algorithms to deal with this. Of course, first thing I’m thinking of is variational approximation. I have tried some already such as mean-field, structure variational, but there is one popular method, “Contrastive Divergence” (CD), that I have heard and wanted to try. I have read some papers on it, and here are what I really recommend to read.

- “Note on Contrastive Divergence” by Oliver Woodford [pdf]: For me, this paper is the best; precise, intuitive and make you hungry to know more!
- “Training Products of Experts by Minimizing Contrastive Divergence” by Geoffrey E. Hinton [pdf]: I guess this is the original paper of CD. This is the first paper I read on this topic, the paper did a good job to make me understand the math underlying CD, however, I did not have an intuitive idea of what CD really is after that first reading. Surprisingly, after I read [1], then come back to [2], I found that I can put pieces together and get a better intuition of this topic. So, I really recommend reading [1] before [2].

Video lecture related to this topic

Using Fast Weights to Improve Persistent Contrastive Divergence

Tijmen Tieleman

## Mean-Field Variational Approximation for Graphical Models

I remember the hard time I had when I tried to understand the mean-field variational approximation for Dynamic Tree Bayesian Networks (DTBNs). I tried to find a good paper to read. There were so many of them out there, so many that I did not know which one to begin with. Unfortunately I began with a bad one ^_^. I felt like I was walking in a deep forest, dark, lost and hopeless…hahaha. Now that I get through of it already, I would not want anyone to get lost like I did, and I would like to save some precious time for you. There are a whole bunch of great authors on this topic and I’m not having much time to write my own version of mean-field tutorial anyway. In stead I think it’s a good idea to compile a reading list on this topic so that you can enjoy walking on this beautiful forest as deep as you want. Please remember that this is just a starting point.

**Check list **

- familiar with graphical models –> You may refer to an old post of mine “A good introduction to graphical models“
- understand the concept of probabilistic inference. At least you can perform an inference on a simple graphical model
- Understand the core idea of more efficient inference methods like Forward-Backward (FB) algorithm, Belief Propagation (BP), Factor Graph (FG) –>
**READ [7] then [2]**and ask yourself the following:- know how they can be derived
- know why the methods do/don’t work on a circumstance
- know when/why the methods give the exact/approximate solution on a circumstance
- Understand how FB is derived and be able to use FB appropriately since FB will be used a lot in the examples later
- Realize that FB is a special case of BP, and BP is a special case of FG. At this point you might only adopt FG to use since it’s easy
- Know that the result of FG and FB is just the joint prob of the variable of interest and the evidence given a set of parameters

- Understand parameter estimation using maximum likelihood (ML) and Expectation-Maximization (EM) algorithm. –>
**READ [3] then [1]**from this point on having a good background on matrix calculus is very helpful.- Why do we need EM when we have ML?
- There are so many ways to write the EM, BUT….what is the benefit of writing EM update equations in terms of expectation of hidden variable w.r.t the prob. distribution of the hidden variables themselves?
- From the update equations obtained from ML and EM, find the similarity between the terms!!!
- Now that you see the similarity, you will insight exactly why we need EM algorithm and the beauty of the framework!!!
- From there you will see EM in disguise of ML, and here you will get the insight of how can you go beyond EM algorithm.
- Realize that M step of EM is very similar to that of ML, but the problem usually comes at the E step!!!
- E-step involves mainly on the inference technique. The intractable computation begin to surface at this step. That’s why I previously emphasized that we must have a good foundation of performing inference.
- See the connection between the expectation terms and the joint of the variable of interest with the evidence…They are almost the same thing!
- Practice deriving the update rules for parameter estimation in various situation/model/structure until you feel familiar with it.

- Mean-field approximation for factorial HMM (fHMM). Work on the mean-field approximation on a simple model. –>
**READ [1] and see the detail in [4]**- For an fHMM, try to derive the EM update equations to estimate the parameters. See which part (E-step or M-step) of the EM algorithm that is intractable?
- Why E-step is intractable?
- Is E-step of fHMM the same as that from HMM? Why?
- Get insight why and what behavior of the mean-field apprx makes the intractable tractable?
- We have a complete control over Q but we cannot enumerate P, how can we know that how Q and P are closed together when we cannot enumerate P? Explain it!
- Why do we use Dkl(Q|P) in stead of Dkl(P|Q)? Hint: It’s about the computational complexity.

- Advanced example: Mean-Field Dynamic Trees (MFDT). Extend your intuition of the mean field approximation. –>
**READ [5] and see the detail in [6]**.- Derive your own MFDT and compare to the original MFDT in the paper
- You might find that the MFDT derived in the original may not be correct!!! There are something wrong in it…please find it.
- Figure out how exactly you can use/code the update equations.
- Think forward on how could you make the approximation more accurate! –> At this point you may want to step ahead to “structured variational approximation”

**Bibliography**

[1] “An introduction to hidden Markov models and Bayesian networks” by Zoubin Ghahramani

[2] “Factor graphs and the sum-product algorithm” by F. R. Kschischang, B. J. Frey, H. A. Loeliger

[3] “A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models” by J. Bilmes

[4] “Factorial Hidden Markov Models” by Zoubin Ghahramani, Michael I. Jordan Machine Learning, Vol. 29, No. 2. (1 November 1997), pp. 245-273.

[5] “MFDTs: mean field dynamic trees” Adams, N.J. and Storkey, A.J. and Ghahramani, Z. and Williams, C.K.I. (2000)

[6] “Dynamic Trees: A Hierarchical Probabilistic Approach to Image Modelling” PhD Thesis by NJ Adams

[7] “Probabilistic reasoning in intelligent systems: Networks of plausible inference” (1988) J Pearl – Morgan Kauffmann, Los Altos, CA

## Statistical Physics Algorithms in Machine Learning

Recently I let myself exposed to so many types of problems in machine learning that need pretty high-level techniques to deal with. I was so amazed that most of the methods are originated from Statistical Physics!!! The first one that I encounter is mean field algorithm which is like an invitation for me to other methods in statistical Physics such as Bethe free energy approximation, Kikuchi approximation, saddle point approximation, calculus of variations, etc. I guess there are a lot more and they are very useful for machine learning especially for graphical models framework in which the number of configurations can be intractable.

There are some materials that I found interesting

**Constructing Free Energy Approximations and Generalized Belief Propagation Algorithms**

by Jonathan S. Yedidia, William T. Freeman, and Yair Weiss

**CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation**

by A. L. Yuille

**Statistical Physics Algorithms That Converge**

by A. L. Yuille and J. J. Kosowsky

**Statistical Physics, Mixtures of Distributions, and the EM Algorithm**

by Alan L. Yuille, Paul Stolorz, Joachim Utans

**Statistical Physics of Clustering Algorithms**

by Thore Graepel, Eckehard Sch, Prof Dr, Prof Dr, Klaus Obermayer

## Dependency-Structure Learning in Graphical Models

Given a graphical models we, human, can understand the dependency among the random variables pretty well. Also we can compute the distribution of the hidden variables given the observed variables and the structure of the graphical models by existing inference frameworks. However, there are lot of applications in which we don’t know the structure or the topology of the dependency among the random variables. The question is “How can we give the best guess to the structure of the dependency?” So far there have been a lot of works which I’m interested in.

These are some interesting publications on this topic

- Dynamic Trees: A Hierarchical Probabilistic Approach to Image Modelling by Nicholas J. Adams (Ph.D. thesis 2001)
- Dynamic Trees by by Christopher K. I. Williams, Nicholas J. Adams (NIPS 1999)
- MFDTs: Mean Field Dynamic Trees by Nicholas J Adams, Amos J Storkey, Zoubin Ghahramani, Christopher K I Williams (2000)
- Dynamic Trees: A structured Variational Approach Giving Efficient Propagation Rules by Amos Storkey (2000)
- Image modeling with position-encoding dynamic trees by by Amos J Storkey, Christopher K I Williams (2003)
- Tractable bayesian learning of tree belief networks by Marina Meila, Tommi Jaakkola (2000)
- Learning with Mixtures of Trees by Marina Meila, Michael Jordan (JMLR 2000)
- Learning with Mixtures of Trees by Marina Meila (Ph.D. thesis 1999)
- Estimating dependency structure as a hidden variable by Marina Meila, Michael Jordan (1998)
- Being Bayesian About Network Structure: A Bayesian Approach to Structure Discovery in Bayesian Networks by Nir Friedman and Daphne Koller (2003)

## Structure-Learning By Dynamic Tree Bayesian Networks (DTBNs)

IN DTs, to search for a good structure is a very important issue. Changing the structure randomly will make the program so slow, we wneed to find some ways to come up with some good structures.

1) Map structure to structure space. This way when we search for structure, it is as if we are walking on the surface where each point on the surface is one unique structure.

2) After that we can perform any search algorithm on it, for example steepest decent, SA or even ITL. I think I would like to try to use QMI-ED to update the structure.

3) You can imagine the following. When you are standing on he surface, the height of the surface represents the log likelihood. YOu can only see the surface near to you, not too far away. Then from the information about the neighborhoods of the point you are standing at, you will make the next move based on the information for the neighborhood.

4) If we are at the point x on the surface, and somehow we can manage to get the neighborhood of x, ne(x). From x and ne(x), we can use an appropriate kernel to make a smooth pdf out of them. From the smooth surface, we can do whatever we want, for example we can do SA or we can use steepest decent, whatsoever method.

5) SA + kernel = kernel annealing is also an interesting option.

— some literatures

chemistry – molecular stability –> how to find a good structure of molecule

A new measure of edit distance between labeled trees (pdf)

read combinatorics, ask Adrian….structure space.