Archive

Posts Tagged ‘topic models’

Notes on Variational EM algorithm

July 18, 2010 Leave a comment

This note is aimed for a beginner for variational EM algorithm. The note contains the following contents:

  1. motivation of variational EM
  2. Derivation in details
  3. Geometrical meaning of using KL divergence in variational EM
  4. Traditional EM vs. variational EM

Please download the note here. Your comments and suggestions are very appreciated.

Advertisements

Derivation of Inference and Parameter Estimation Algorithm for Latent Dirichlet Allocation (LDA)

June 15, 2010 11 comments

As you may know that Latent Dirichlet Allocation (LDA) [1] is like a backbone in text/image annotation these days. As of today (June 16, 2010) there were 2045 papers cited the LDA paper. That might be a good indicator of how important to understand this LDA paper thoroughly. In my personal opinion, I found that this paper contains a lot of interesting things, for example, modeling using graphical models, inference and parameter learning using variational approximation like mean-field variational approximation which can be very useful when reading other papers extended from the original LDA paper.

The original paper explains the concept and how to set up the framework clearly, so I wold suggest the reader to read the part from the original paper. However, for a new kid in town for this topic, there might be some difficulties to understand how to derive the formula in the paper. Therefore my tutorial paper solely focuses on how to mathematically derive the algorithm in the paper. Hence, the best way to use this tutorial paper is to accompany with the original paper. Hope my paper can be useful and enjoyable.

You can download the pdf file here.

[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn.
Res., 3:9931022, 2003.

Python code for Retrieving URLs and Tags from Del.icio.us

Goal: To retrieve URLs and tags

Here we will show you the format of the output file. There are 3 output files: (1) urls.txt, (2)users.txt and (3)tags.txt.

  • In the text file urls.txt each line is a url, for example
http://sneerwell.blogspot.com/2009/10/top-9-best-free-web-hosting-for.html
http://sethgodin.typepad.com/seths_blog/2008/12/lesson-learned.html
http://weblogs.variety.com/season_pass/2008/10/mad-men-qa.html
  • In users.txt, each line is the Delicious username who tagged the retrieved urls. For example,
filipesouza
Vera Legisa
jeroenbijnens
anishmisty
213db
newmediabias
tiagomarques
Marcel van der Laan
genetjin
  • In tags.txt each line contains tags given to its corresponding url by several users. Each tag given by the same user is separated by space ( ) and individual users are separated by semicolon (;). For example, in the corresponding tags below, the 3rd line reads as there are 3 users tagged the 3rd url. The first user tags (madmen tv interview television); the second tag (MadMen Variety); the third tags (madmen). One line represents tags for one url.
web hosting ;erika ;best blogging hosting reference webhosting wordpress list ;
blog internet marketing technology creativity ;Career Innovation Business ;
madmen tv interview television ;MadMen Variety ;madmen ;

How we retrieve the information from Del.icio.us?

This code is developed based on DeliciousAPI by Michael G. Noll, for more detail please refer to Michael’s website. Details about installing the DeliciousAPI is given in my previous post. The way we retrieve the information is as following:

  1. Retrieve 100 URLs from Del.icio.us hotlist
  2. Retrieve all the usernames who tagged the URLs in the list
  3. For each username, retrieve all the URLs tagged by that username and add the URLs to the URL list
  4. Repeat step 2 and 3 until the number of URLs in the list is reached

Download the file here.

How to use the code

There are 2 Python (.py) files to use:

  1. retrieveURLsandUsers.py: Iteratively retrieve urls and corresponding users until the desired number of url is reached. This program will give you a list of (5000) urls in the file urls.txt.
  2. retrieveTags.py: Read each url from the list in urls.txt, and output corresponding tags for each url.

In the next section we will show how to use the program to run in different situations.

Run for the first time

1. Open retrieveURLsandUsers.py and input the approximate number of URLs you want to retrieve, for example if you want to retrieve 5000 urls, then you will say

number_URLs_desired = 5000

2. Open the file url_retrieve_log.txt which is a configuration file and input

1
0
0
0
0
0
0

3. Now you can run retrieveURLsandUsers.py.

Your IP blocked by Del.icio.us!!!

There are some chances that your IP will be blocked by the web server. You will have to accept it, and wait a couple hours until you can retrieve information from the web site again.

Do I have to run from the beginning again next time? No we can start from where we were blocked. When the website blocks our IP, all variables are stored in text files, all status variables are stored in url_retrieve_log.txt. For example

1       processID (1 or 2) that we will start with
334     url index that failed
177     url_list_start
350     url_list_end
23      user index that failed
12      user_list_start
350     user_list_end

Therefore as long as you don’t mess with the file, then you can always continue retrieving the information from the website.

In other words, when the web site raise your IP, then you will just run the program retrieveURLsandUsers.py, and it will do everything for you.

Once I got 5000 URLs already, what next?

Next, you will want to retrieve corresponding tags for each url you retrieved.

1. Open the log file tag_retrieve_log.txt, and if you run this program for the first time, you will see the following in the file.

whatever
0

2. Just leave it like that

3. Run the file retrieveTags.py and see the result.

4. The code will run smoothly for about 120 urls, then the message will show “[#] Oh boy…Your IP was blocked again by Del.icio.us”. Nothing to do but waiting for a couple of hours then run the same again.

5. The log file will record where your IP was blocked in order to start from the right place. For example, if you failed to retrieve the 401st url, the log will show

http://www.iwit.nl/
401

401 is the index of the url that you failed to retrieve, the first line is the corresponding url. Leave it alone, don’t mess with it.

Retrieve URLs and Tags from Del.icio.us with Python API on Ubuntu

April 30, 2010 Leave a comment

For DeliciousAPI, please refer to Michael G. Noll web page. My previous post on DeliciousAPI would be useful too.

  1. Install Python on your machine. Most of the time, Ubuntu will come with Python already.
  2. Install “Easy Install” for Python by running terminal and type
    sudo apt-get install python-setuptools python-dev build-essential

    For more details, please refer to SaltyCrane Blog.

  3. Install DeliciousAPI by running
sudo easy_install DeliciousAPI

Now you can use DeliciousAPI on Python already!!! Just go to Michael’s web page then copy the demo code to a txt file, say test.py. Then go to the folder containing the file and run it from the terminal by typing “python test.py”. For more detail about how to install the API on Windows7, please refer to my previous post.

Retrieve URLs and Tags from Del.icio.us (Delicious.com)

April 30, 2010 2 comments

Del.icio.us is a bookmarking website. For some people who want to work on topic modeling, the website can be a good data set to try. In this post, I would like to show how to retrieve the information (e.g. URLs, title, tags, users, comments, timestamps ) from Del.icio.us. There are so many way to do the job, but, for me, I think Python API is a good and easy way to do this.

Actually the whole API is available for free and described very well in Del.icio.us Python API web page by Michael G. Noll. However, some of us who is an absolute beginner may not understand how to install and use the API, so I would like to elaborate Michael’s guide in more detail. There are 4 big steps, here we go!

  1. Install Python: First of all, you will have to have Python engine installed in your machine
  2. Install Easy Install: Easy Install is a Python package that can save us a lot of time when installing any Python package.
  3. Install Michael’s Del.icio.us Python API on your machine
  4. Run the API, and have fun!

1. Install Python on your machine Windows 7 64-bit

  1. Download Python engine from the Python download page. Pick the installer that matches your machine and OS. For me, I have windows7 64-bit, so I will download “Python 2.6.5 Windows X86-64 installer (Windows AMD64 / Intel 64 / X86-64 binary [1] — does not include source)”
  2. Install the file on your machine, it should not take too long to download and install it on your machine. I found a good video tutorial on the Python installation and testing.
  3. Now you can play with Python IDLE to see if you install it properly.
  4. Add path to the Python folder
    1. Run command line as an administrator, please refer to this blog.
    2. add path by typing “set path=C:\Program Files\yourPythonFolder;%path%”. Note that yourPythonFolder MUST contain the file python.exe
    3. You can check if the path is included properly by typing “path”

2. Install Easy Install

For simplicity, we will next install a Python package called “Easy Install” which can save a lot of our time when install additional Python package. Easy Install will monitor the installation and can automatically download and install additional package needed. This way we don’t have to manually check what package to download and install.

  1. Please go to Easy Install web page, and click download the proper installation file. Note that if you use Windows7 64-bit, the only option you may use is the “Source” file (setuptools-0.6c11.tar.gz). One good thing about using the source file is that it always works regardless of what kind of machine or OS you are using…so I will go this way.
  2. Download the source and extract the file on a folder, say C:\Users\bot\Downloads\PythonFiles\setuptools-0.6c11
  3. Run command line as an administrator, and go to the folder C:\Users\bot\Downloads\PythonFiles\setuptools-0.6c11
  4. Now we have to install the Easy Install package by running the command “python setup.py install”. You will see so many things going on in the command line.
  5. You will find that Easy Install would be stored in the folder “C:\Program Files\Python 2.6.5 64-bit\Scripts”

3. Install Michael’s Del.icio.us Python API

We will use Easy Install to do the job

  1. Stay in the command line, go to the folder “C:\Program Files\Python 2.6.5 64-bit\Scripts” by typing “cd C:\Program Files\Python 2.6.5 64-bit\Scripts”
  2. Run the command “easy_install DeliciousAPI“, you will see Python downloading packages necessary for DeliciousAPI.

4. Run DeliciousAPI

Now that we installed the API already, now let’s run it.

  1. Go to Michael’s page, copy the demo code, save it as “test.py” and put it in any folder you want.
  2. You can run the test.py from command line by typing “python test.py”. You should be able to see the URLs, tags, everything pop up on your command line panel!

I would like to thank my friend Rohit Manokaran for helping me with this DeliciousAPI.

Topic Models

March 23, 2010 Leave a comment

Last week I met with a long-lost friend in ICASSP 2010 held in Dallas, TX. More precisely, the very nice friend of mine, Duangmanee (Pew), was my senior student when we were in the same high-school. In the conference, we had a good time (almost an hour) discussing about our lives, updates on our others friends (heeheehee…a nice way to say “gossips”) and our research, and I’m very lucky that Pew is an expert on Topic Models that I’m interested in. Since I’m a beginner on this topic, so I think I will have to learn some more fundamental works on this topic first prior to understanding Pew’s paper.

This post is my effort to list all good papers, notes and tutorials on topic models in the hope that it might be useful for other beginners like me. Please feel free to suggest in order to make this post of the most useful to learners.

Video lecture

Topic Models

David Blei
2 videos


Independent Factor Topic Models

Duangmanee (Pew) Putthividhya

Useful links (I’m working on the list)

LDA on Wiki

Blei, David M.; Ng, Andrew Y.; Jordan, Michael I; Lafferty, John (January 2003). “Latent Dirichlet allocation”. Journal of Machine Learning Research 3: pp. 993–1022. doi:10.1162/jmlr.2003.3.4-5.993

Blei, David M.; Lafferty, John D. (2006). “Correlated topic models”. Advances in Neural Information Processing Systems

D. Blei and M. Jordan. Modeling annotated data. In Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134. ACM Press, 2003. [PDF]

K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003. [PDF]

Blei, David M.; Jordan, Michael I.; Griffiths, Thomas L.; Tenenbaum; Joshua B (2004). “Hierarchical Topic Models and the Nested Chinese Restaurant Process”. Advances in Neural Information Processing Systems 16: [pdf]

Hanna M. Wallach (2008), “Structured Topic Model for Language” [PhD thesis]

Tomoharu Iwata, Takeshi Yamada, Naonori Ueda, “Modeling Social Annotation Data with Content Relevance using a Topic Model,” Advances in Neural Information Processing Systems (NIPS2009), 835-843, 2009 [pdf]