Wednesday, June 27, 2012

Chinese Restaurants and Indian Buffets; New Eateries in Statistical Machine Learning

One of the major new trends in Statistical Machine learning is a rejuvenated interest in unsupervised techniques as compared to supervised techniques. And the reason is obvious;  web applications are generating data at an enormous rate and it is becoming more and more expensive to manually label the data as is required by Supervised techniques. Usage Logs, natural language texts, and images make to the top of the list as the most abundant pieces of data being generated. And hence,  comes the interesting question: How to find trends in the data being generated? How to devise statistical models that can help organize the data? Take the example of the company I work in, Change.org, people are writing petition texts, making comments, sharing stuff on Facebook and Twitter at an amazing rate. More than 10,000 petitions are started on Change.org every month by our millions of users. And hence it is becoming more and more challenging for us to organize the content, learn trends, and present content to users that they care about. One important question in this regard is: How can we automatically categorize petitions in various categories where we learn the number of categories from the data itself? and is it possible to learn categories that form a hierarchy, like Global Warming as a sub-category of Environment and Environment as a subcategory of Planet Earth etc. And is it possible to learn all these hierarchies, number of categories from the data itself i-e with making as few assumptions as possible. The answer is yes!
A major recent trend in the unsupervised community is the use of non-parametric bayesian statistics. These techniques have become popular only recently because in most cases the process of updating the prior distributions with the likelihood of the data ( posterior computation) in non-parametric bayesian techniques involve intractable integrals which can only be approximated using sampling techniques (or approximate variational methods) and sampling techniques are becoming more and more popular as computers are becoming powerful. In a specific graphical model dealing with learning hierarchies, the probability distributions involve thousands of hidden variables and the current powerhouse workstations can employ sequential sampling techniques to generate reasonable answers with-in acceptable duration of time.
If you search for literature on non-parametric bayesian statistics, you would find a lot of references to Chinese restaurant processes and Indian Buffet processes. Welcome to the modern face of statistics! These are extremely powerful metaphors for describing various non-parametric techniques and can be effectively used to solve various challenging problems. One being the learning of hierarchies and number of categories, effectively computing posteriors over infinitely deep and infinitely branching trees of category hierarchies; the nested chinese restaurant process. Another one just deals with learning the number of categories from data and is referred to as Chinese Restaurant Franchise.  Statistical Machine learning was never so yummy!!!
We here at Change.org are implementing these techniques and advancing the way web apps organize their content. We are learning user interests on various issues at different levels of abstractions and all of it is leading to the way we present the web content to users so that they are connected to the most interesting and engaging content all the time.
A glimpse at the category hierarchies being learned from data

Thursday, April 5, 2012

Matrix Completion and Recommender Systems

Ever since Netflix introduced the world to the Netflix Prize, Recommendation Systems have been a hot research topic in Machine Learning Community. What BellKor and Pragmatic Chaos achieved in their winning algorithm for the Netflix prize was an impressive feat. But ever since then, a lot of people are following some of the techniques they presented, especially the "SVD-based" recommendation System. The procedures they presented are straight-forward to implement, and there are scalable implementations available in Apache Mahout which really only implements a highly simplified correlation-based Recommendation System. And Mahout seems to be a popular choice among a lot of Web applications.
These techniques however have a lot of limitations, first of all, what Bell-Kor and Pragmatic Chaos presented in their SVD based recommendation system was actually a hidden factors model in which they solved a series of linear regression problems to determine the hidden factors (the alternating least-squares method). Anyone remotely familiar with regression based techniques, knows that the output variable needs to be a numerical variable. Although this method works fine for places where we only have numerical numbers available in data (Netflix ratings) but a lot of times we only have access to categorical variables (e.g which item the user browsed over, purchased etc.) . A lot of time, we have mixed variables; both categorical and numerical. In these situations, the techniques currently prevalent in industry simply cannot be applied.
What I developed here at Change.org is a true mixed variable recommendation system. We have tons of categorical and numerical features. Recommendation System is simply a marketable name for Statistical Matrix Completion Procedures. You impose a statistical observation model and then learn the parameters of the models using Convex Optimization. The result is a neat Matrix Completion Algorithm which is tremendously useful because of the flexibility it provides in the data it can handle. More on this later as it develops !!