## Mean-Field Variational Approximation for Graphical Models

I remember the hard time I had when I tried to understand the mean-field variational approximation for Dynamic Tree Bayesian Networks (DTBNs). I tried to find a good paper to read. There were so many of them out there, so many that I did not know which one to begin with. Unfortunately I began with a bad one ^_^. I felt like I was walking in a deep forest, dark, lost and hopeless…hahaha. Now that I get through of it already, I would not want anyone to get lost like I did, and I would like to save some precious time for you. There are a whole bunch of great authors on this topic and I’m not having much time to write my own version of mean-field tutorial anyway. In stead I think it’s a good idea to compile a reading list on this topic so that you can enjoy walking on this beautiful forest as deep as you want. Please remember that this is just a starting point.

**Check list **

- familiar with graphical models –> You may refer to an old post of mine “A good introduction to graphical models“
- understand the concept of probabilistic inference. At least you can perform an inference on a simple graphical model
- Understand the core idea of more efficient inference methods like Forward-Backward (FB) algorithm, Belief Propagation (BP), Factor Graph (FG) –>
**READ [7] then [2]**and ask yourself the following:- know how they can be derived
- know why the methods do/don’t work on a circumstance
- know when/why the methods give the exact/approximate solution on a circumstance
- Understand how FB is derived and be able to use FB appropriately since FB will be used a lot in the examples later
- Realize that FB is a special case of BP, and BP is a special case of FG. At this point you might only adopt FG to use since it’s easy
- Know that the result of FG and FB is just the joint prob of the variable of interest and the evidence given a set of parameters

- Understand parameter estimation using maximum likelihood (ML) and Expectation-Maximization (EM) algorithm. –>
**READ [3] then [1]**from this point on having a good background on matrix calculus is very helpful.- Why do we need EM when we have ML?
- There are so many ways to write the EM, BUT….what is the benefit of writing EM update equations in terms of expectation of hidden variable w.r.t the prob. distribution of the hidden variables themselves?
- From the update equations obtained from ML and EM, find the similarity between the terms!!!
- Now that you see the similarity, you will insight exactly why we need EM algorithm and the beauty of the framework!!!
- From there you will see EM in disguise of ML, and here you will get the insight of how can you go beyond EM algorithm.
- Realize that M step of EM is very similar to that of ML, but the problem usually comes at the E step!!!
- E-step involves mainly on the inference technique. The intractable computation begin to surface at this step. That’s why I previously emphasized that we must have a good foundation of performing inference.
- See the connection between the expectation terms and the joint of the variable of interest with the evidence…They are almost the same thing!
- Practice deriving the update rules for parameter estimation in various situation/model/structure until you feel familiar with it.

- Mean-field approximation for factorial HMM (fHMM). Work on the mean-field approximation on a simple model. –>
**READ [1] and see the detail in [4]**- For an fHMM, try to derive the EM update equations to estimate the parameters. See which part (E-step or M-step) of the EM algorithm that is intractable?
- Why E-step is intractable?
- Is E-step of fHMM the same as that from HMM? Why?
- Get insight why and what behavior of the mean-field apprx makes the intractable tractable?
- We have a complete control over Q but we cannot enumerate P, how can we know that how Q and P are closed together when we cannot enumerate P? Explain it!
- Why do we use Dkl(Q|P) in stead of Dkl(P|Q)? Hint: It’s about the computational complexity.

- Advanced example: Mean-Field Dynamic Trees (MFDT). Extend your intuition of the mean field approximation. –>
**READ [5] and see the detail in [6]**.- Derive your own MFDT and compare to the original MFDT in the paper
- You might find that the MFDT derived in the original may not be correct!!! There are something wrong in it…please find it.
- Figure out how exactly you can use/code the update equations.
- Think forward on how could you make the approximation more accurate! –> At this point you may want to step ahead to “structured variational approximation”

**Bibliography**

[1] “An introduction to hidden Markov models and Bayesian networks” by Zoubin Ghahramani

[2] “Factor graphs and the sum-product algorithm” by F. R. Kschischang, B. J. Frey, H. A. Loeliger

[3] “A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models” by J. Bilmes

[4] “Factorial Hidden Markov Models” by Zoubin Ghahramani, Michael I. Jordan Machine Learning, Vol. 29, No. 2. (1 November 1997), pp. 245-273.

[5] “MFDTs: mean field dynamic trees” Adams, N.J. and Storkey, A.J. and Ghahramani, Z. and Williams, C.K.I. (2000)

[6] “Dynamic Trees: A Hierarchical Probabilistic Approach to Image Modelling” PhD Thesis by NJ Adams

[7] “Probabilistic reasoning in intelligent systems: Networks of plausible inference” (1988) J Pearl – Morgan Kauffmann, Los Altos, CA

## Dependency-Structure Learning in Graphical Models

Given a graphical models we, human, can understand the dependency among the random variables pretty well. Also we can compute the distribution of the hidden variables given the observed variables and the structure of the graphical models by existing inference frameworks. However, there are lot of applications in which we don’t know the structure or the topology of the dependency among the random variables. The question is “How can we give the best guess to the structure of the dependency?” So far there have been a lot of works which I’m interested in.

These are some interesting publications on this topic

- Dynamic Trees: A Hierarchical Probabilistic Approach to Image Modelling by Nicholas J. Adams (Ph.D. thesis 2001)
- Dynamic Trees by by Christopher K. I. Williams, Nicholas J. Adams (NIPS 1999)
- MFDTs: Mean Field Dynamic Trees by Nicholas J Adams, Amos J Storkey, Zoubin Ghahramani, Christopher K I Williams (2000)
- Dynamic Trees: A structured Variational Approach Giving Efficient Propagation Rules by Amos Storkey (2000)
- Image modeling with position-encoding dynamic trees by by Amos J Storkey, Christopher K I Williams (2003)
- Tractable bayesian learning of tree belief networks by Marina Meila, Tommi Jaakkola (2000)
- Learning with Mixtures of Trees by Marina Meila, Michael Jordan (JMLR 2000)
- Learning with Mixtures of Trees by Marina Meila (Ph.D. thesis 1999)
- Estimating dependency structure as a hidden variable by Marina Meila, Michael Jordan (1998)
- Being Bayesian About Network Structure: A Bayesian Approach to Structure Discovery in Bayesian Networks by Nir Friedman and Daphne Koller (2003)

## Structure-Learning By Dynamic Tree Bayesian Networks (DTBNs)

IN DTs, to search for a good structure is a very important issue. Changing the structure randomly will make the program so slow, we wneed to find some ways to come up with some good structures.

1) Map structure to structure space. This way when we search for structure, it is as if we are walking on the surface where each point on the surface is one unique structure.

2) After that we can perform any search algorithm on it, for example steepest decent, SA or even ITL. I think I would like to try to use QMI-ED to update the structure.

3) You can imagine the following. When you are standing on he surface, the height of the surface represents the log likelihood. YOu can only see the surface near to you, not too far away. Then from the information about the neighborhoods of the point you are standing at, you will make the next move based on the information for the neighborhood.

4) If we are at the point x on the surface, and somehow we can manage to get the neighborhood of x, ne(x). From x and ne(x), we can use an appropriate kernel to make a smooth pdf out of them. From the smooth surface, we can do whatever we want, for example we can do SA or we can use steepest decent, whatsoever method.

5) SA + kernel = kernel annealing is also an interesting option.

— some literatures

chemistry – molecular stability –> how to find a good structure of molecule

A new measure of edit distance between labeled trees (pdf)

read combinatorics, ask Adrian….structure space.

## Finding a good structure for the Dynamic Trees Bayesian Networks

Today I have a chance to talk with Sudir and Shalom @ CNEL after our ITL class

The discussion was about finding a good structure for a DTs using ITL unsupervised learning. Shalom and I discussed about Sinisa’s work, using wavelets as features in DTs, how to make a different scale images that still have some correlation between the same pixels in the different scales.

- How to incorporate ITL unsupervised clustering into the Dynamic Trees Bayesian Networks. First I read the paper from Jenssen, the paper is very easy to understand and gives very clear idea of how we can use the ITL for making DAG unsupervisedly. However, the work is pretty much heuristic and computationally expensive since we have to tune the sigma and we have to find the direction of force for each point in the space which takes very long time. As Sudir’s work is more convenient since it gives a closed-form update rules for each point, therefore, I will use his segmentation method to build the structure. For this work, the only parameter we will have to play with is sigma; smaller sigma gives more number of clusters. On the other hand, bigger sigma gives smaller number of clusters. Consequently we will use the small sigma at the leaf nodes and big sigma at the root node.
- We may use the wavelets coefficients as features for finding DTs structure. There are 2 classic papers to read: 1) “Wavelet-based statistical signal processing using hidden Markov models” (1998) by Crouse, M.S. Nowak, R.D. Baraniuk, R.G. and 2) “
**Approximating discrete probability distributions with dependence trees”**(1968) Chow, C.; Liu, C. - We can make observations in each scale by using wavelet decompose the image into several scale, then we might be able to reconnect the links.
- We can also make observations in each scale by using Gaussian blur, resample (Sinisa’s work) applied to the image, then find some relationship between the upper-scale pixels and the lower-scale pixels. However, using Gaussian blur may not be a powerful method because it does not show much about the relationship across the scale, whereas, the wavelets and resampling might do a good job on that.
- If we really want to make a multi-scale image by bluring, we will have to design a blur function such that it will boost or remain the relationship between the across-scale pixels. How can we design such a blur function? Of course it might be an adaptive one!!!