Wednesday, June 27, 2012

Chinese Restaurants and Indian Buffets; New Eateries in Statistical Machine Learning

One of the major new trends in Statistical Machine learning is a rejuvenated interest in unsupervised techniques as compared to supervised techniques. And the reason is obvious;  web applications are generating data at an enormous rate and it is becoming more and more expensive to manually label the data as is required by Supervised techniques. Usage Logs, natural language texts, and images make to the top of the list as the most abundant pieces of data being generated. And hence,  comes the interesting question: How to find trends in the data being generated? How to devise statistical models that can help organize the data? Take the example of the company I work in, Change.org, people are writing petition texts, making comments, sharing stuff on Facebook and Twitter at an amazing rate. More than 10,000 petitions are started on Change.org every month by our millions of users. And hence it is becoming more and more challenging for us to organize the content, learn trends, and present content to users that they care about. One important question in this regard is: How can we automatically categorize petitions in various categories where we learn the number of categories from the data itself? and is it possible to learn categories that form a hierarchy, like Global Warming as a sub-category of Environment and Environment as a subcategory of Planet Earth etc. And is it possible to learn all these hierarchies, number of categories from the data itself i-e with making as few assumptions as possible. The answer is yes!
A major recent trend in the unsupervised community is the use of non-parametric bayesian statistics. These techniques have become popular only recently because in most cases the process of updating the prior distributions with the likelihood of the data ( posterior computation) in non-parametric bayesian techniques involve intractable integrals which can only be approximated using sampling techniques (or approximate variational methods) and sampling techniques are becoming more and more popular as computers are becoming powerful. In a specific graphical model dealing with learning hierarchies, the probability distributions involve thousands of hidden variables and the current powerhouse workstations can employ sequential sampling techniques to generate reasonable answers with-in acceptable duration of time.
If you search for literature on non-parametric bayesian statistics, you would find a lot of references to Chinese restaurant processes and Indian Buffet processes. Welcome to the modern face of statistics! These are extremely powerful metaphors for describing various non-parametric techniques and can be effectively used to solve various challenging problems. One being the learning of hierarchies and number of categories, effectively computing posteriors over infinitely deep and infinitely branching trees of category hierarchies; the nested chinese restaurant process. Another one just deals with learning the number of categories from data and is referred to as Chinese Restaurant Franchise.  Statistical Machine learning was never so yummy!!!
We here at Change.org are implementing these techniques and advancing the way web apps organize their content. We are learning user interests on various issues at different levels of abstractions and all of it is leading to the way we present the web content to users so that they are connected to the most interesting and engaging content all the time.
A glimpse at the category hierarchies being learned from data