Thursday, April 5, 2012

Matrix Completion and Recommender Systems

Ever since Netflix introduced the world to the Netflix Prize, Recommendation Systems have been a hot research topic in Machine Learning Community. What BellKor and Pragmatic Chaos achieved in their winning algorithm for the Netflix prize was an impressive feat. But ever since then, a lot of people are following some of the techniques they presented, especially the "SVD-based" recommendation System. The procedures they presented are straight-forward to implement, and there are scalable implementations available in Apache Mahout which really only implements a highly simplified correlation-based Recommendation System. And Mahout seems to be a popular choice among a lot of Web applications.
These techniques however have a lot of limitations, first of all, what Bell-Kor and Pragmatic Chaos presented in their SVD based recommendation system was actually a hidden factors model in which they solved a series of linear regression problems to determine the hidden factors (the alternating least-squares method). Anyone remotely familiar with regression based techniques, knows that the output variable needs to be a numerical variable. Although this method works fine for places where we only have numerical numbers available in data (Netflix ratings) but a lot of times we only have access to categorical variables (e.g which item the user browsed over, purchased etc.) . A lot of time, we have mixed variables; both categorical and numerical. In these situations, the techniques currently prevalent in industry simply cannot be applied.
What I developed here at Change.org is a true mixed variable recommendation system. We have tons of categorical and numerical features. Recommendation System is simply a marketable name for Statistical Matrix Completion Procedures. You impose a statistical observation model and then learn the parameters of the models using Convex Optimization. The result is a neat Matrix Completion Algorithm which is tremendously useful because of the flexibility it provides in the data it can handle. More on this later as it develops !!