## Bayesian Inference

Recently I have been looking for a good short resource for the fundamental Bayesian inference. There are tons of good relevant materials online, unfortunately, some are too long, some are too theoretic. Come on, I want to find something intuitive, making sense and understandable and readable within 30 minutes! After a couple of hours wandering on internet and reading some good relevant books, I decided to pick the best 3 books, of course in my opinion, that contributes to my understanding the most within half an hour.

- Bayesian Data Analysis by Andrew Gelman et al. — If you understand the concept of Bayesian framework and just want to review Bayesian inference, then the first pages of Chapter2 are sufficient. Good examples can be found in section 2.6.
- Bayesian core: a practical approach to computational Bayesian statistics by Jean-Michel Marin, Christian P. Robert — The book emphasizes the important concepts and has a lot of examples necessary to understand the basic ideas.
- Bayesian Field Theory by Jörg C. Lemm — This book provides great perspective of Bayesian framework to learning, information theory and physics. There are very a lot of figures and graphical models making the book easy to follow.

## Variational Bayesian Gaussian Mixture Model (VBGMM)

EM-algorithm for Mixture of Gaussian (EMGMM) has been a very popular model in statistics and machine learning. In EMGMM, the model parameters are still estimated using maximum likelihood (ML) method, however, recently there has been a need to put the prior probability on the model parameters. So, the GMM becomes a hierarchical Bayesian model whose root layer to leaf layer are the parameters, the mixture proportion and the observation respectively. Originally this hierarchical model can be infered using some challenging integration techniques or stochastic sampling techniques (e.g. MCMC). The latter case takes a lot of computational time to sample from the distribution.

Fortunately, there is an approximation technique that can help to make fast inference and give a good approximate solution for example mean-field variational approximation. The variational approximation is very well explained in chapter9 of the classic machine learning textbook [1] by Bishop. There is a very good example on the variational Bayesian Gaussian mixture models. In fact, Bishop did a great job on explaining and deriving VBGMM, however, for a beginner, the algebra of the derivation can be challenging. Since the derivation contains a lot of interesting things that can be applied to other variational approximations, and the text skipped some details, so I decided to “fill in” the missing part and make a derivation tutorial out of it which is available here [pdf]. I also made the details of the derivation of some examples prior to VBGMM section in the text [1] available as well [pdf]. Originally, VBGMM is firstly appeard in an excellent paper [2]. Again, for the introduction and for more detail on the interpretation of the model, please refer to the original paper or Bishop’s textbook.

Implementation of the VBGMM in MATLAB can be found at Prof. Kevin Murphy’s group in UBC [link] or [link]. The code requires Netlab toolbox (a bunch of very good MATLAB codes for machine learning).

[1] “Pattern Recognition and Machine Learning” (2006) by Christopher Bishop [link]

[2] “A Variational Bayesian Framework for Graphical Models” (NIP99) by Hagai Attias [link]

## Derivation of Inference and Parameter Estimation Algorithm for Latent Dirichlet Allocation (LDA)

As you may know that Latent Dirichlet Allocation (LDA) [1] is like a backbone in text/image annotation these days. As of today (June 16, 2010) there were 2045 papers cited the LDA paper. That might be a good indicator of how important to understand this LDA paper thoroughly. In my personal opinion, I found that this paper contains a lot of interesting things, for example, modeling using graphical models, inference and parameter learning using variational approximation like mean-field variational approximation which can be very useful when reading other papers extended from the original LDA paper.

The original paper explains the concept and how to set up the framework clearly, so I wold suggest the reader to read the part from the original paper. However, for a new kid in town for this topic, there might be some difficulties to understand how to derive the formula in the paper. Therefore my tutorial paper solely focuses on how to mathematically derive the algorithm in the paper. Hence, the best way to use this tutorial paper is to accompany with the original paper. Hope my paper can be useful and enjoyable.

You can download the pdf file here.

[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn.

Res., 3:9931022, 2003.

## Minimum Description Length (MDL)

The basic idea of MDL is about the minimum length of code or symbol used to represent an entity. There are so many applications that MDL is applicable to. For example, model selection, data/signal compression, structure learning, curve fitting, etc. Good references for a beginner are:

- MDL in Wiki [link]
- MDL on Scholarpedia [link]
- MDL on the web [link]
- You can find good tutorials there. I would recommend the tutorial “P.Grünwald, A tutorial introduction to the minimum description length principle. In:
*Advances in Minimum Description Length: Theory and Applications*(edited by P. Grünwald, I.J. Myung, M. Pitt), MIT Press, 2005 (80 pages).”

- You can find good tutorials there. I would recommend the tutorial “P.Grünwald, A tutorial introduction to the minimum description length principle. In:
- MDL tutorial by Prof. Rissanen “An Introduction to the MDL Principle” [pdf]

MDL has a strong connection with Kolmogorov Complexity. In terms of model selectin there might be some other topics that you may find interesting to make connection with MDL, for example, BIC, AIC and NML.

## Structure Learning and Model Selection Criteria

There are so many possible ways to learn the causality in a particular dataset. However, I think SOM is also an alternative…I will investigate this!

Also in autoregressive community, Granger causality plays quite an important role in time-series causal structure learning. There might be some connections between this Granger causality and other standard structure learnign algorithm in BN community.

One big problem for structure learning is the complexity of the data. Most of the time we would like to have the simplest model that works well for the dataset according to Occam’s razor. So we would like to have a function that penalizes the complexity. This problem is also referred as “model selection” problem. There are a lot of model selection criteria that I would like to investigate in order to use with Bayesian network structure learning. Here are some possible criteria:

AIC, BIC, CIC, MDL, NML, factorized NML