Wednesday, June 27, 2012

Chinese Restaurants and Indian Buffets; New Eateries in Statistical Machine Learning

One of the major new trends in Statistical Machine learning is a rejuvenated interest in unsupervised techniques as compared to supervised techniques. And the reason is obvious;  web applications are generating data at an enormous rate and it is becoming more and more expensive to manually label the data as is required by Supervised techniques. Usage Logs, natural language texts, and images make to the top of the list as the most abundant pieces of data being generated. And hence,  comes the interesting question: How to find trends in the data being generated? How to devise statistical models that can help organize the data? Take the example of the company I work in,, people are writing petition texts, making comments, sharing stuff on Facebook and Twitter at an amazing rate. More than 10,000 petitions are started on every month by our millions of users. And hence it is becoming more and more challenging for us to organize the content, learn trends, and present content to users that they care about. One important question in this regard is: How can we automatically categorize petitions in various categories where we learn the number of categories from the data itself? and is it possible to learn categories that form a hierarchy, like Global Warming as a sub-category of Environment and Environment as a subcategory of Planet Earth etc. And is it possible to learn all these hierarchies, number of categories from the data itself i-e with making as few assumptions as possible. The answer is yes!
A major recent trend in the unsupervised community is the use of non-parametric bayesian statistics. These techniques have become popular only recently because in most cases the process of updating the prior distributions with the likelihood of the data ( posterior computation) in non-parametric bayesian techniques involve intractable integrals which can only be approximated using sampling techniques (or approximate variational methods) and sampling techniques are becoming more and more popular as computers are becoming powerful. In a specific graphical model dealing with learning hierarchies, the probability distributions involve thousands of hidden variables and the current powerhouse workstations can employ sequential sampling techniques to generate reasonable answers with-in acceptable duration of time.
If you search for literature on non-parametric bayesian statistics, you would find a lot of references to Chinese restaurant processes and Indian Buffet processes. Welcome to the modern face of statistics! These are extremely powerful metaphors for describing various non-parametric techniques and can be effectively used to solve various challenging problems. One being the learning of hierarchies and number of categories, effectively computing posteriors over infinitely deep and infinitely branching trees of category hierarchies; the nested chinese restaurant process. Another one just deals with learning the number of categories from data and is referred to as Chinese Restaurant Franchise.  Statistical Machine learning was never so yummy!!!
We here at are implementing these techniques and advancing the way web apps organize their content. We are learning user interests on various issues at different levels of abstractions and all of it is leading to the way we present the web content to users so that they are connected to the most interesting and engaging content all the time.
A glimpse at the category hierarchies being learned from data

Thursday, April 5, 2012

Matrix Completion and Recommender Systems

Ever since Netflix introduced the world to the Netflix Prize, Recommendation Systems have been a hot research topic in Machine Learning Community. What BellKor and Pragmatic Chaos achieved in their winning algorithm for the Netflix prize was an impressive feat. But ever since then, a lot of people are following some of the techniques they presented, especially the "SVD-based" recommendation System. The procedures they presented are straight-forward to implement, and there are scalable implementations available in Apache Mahout which really only implements a highly simplified correlation-based Recommendation System. And Mahout seems to be a popular choice among a lot of Web applications.
These techniques however have a lot of limitations, first of all, what Bell-Kor and Pragmatic Chaos presented in their SVD based recommendation system was actually a hidden factors model in which they solved a series of linear regression problems to determine the hidden factors (the alternating least-squares method). Anyone remotely familiar with regression based techniques, knows that the output variable needs to be a numerical variable. Although this method works fine for places where we only have numerical numbers available in data (Netflix ratings) but a lot of times we only have access to categorical variables (e.g which item the user browsed over, purchased etc.) . A lot of time, we have mixed variables; both categorical and numerical. In these situations, the techniques currently prevalent in industry simply cannot be applied.
What I developed here at is a true mixed variable recommendation system. We have tons of categorical and numerical features. Recommendation System is simply a marketable name for Statistical Matrix Completion Procedures. You impose a statistical observation model and then learn the parameters of the models using Convex Optimization. The result is a neat Matrix Completion Algorithm which is tremendously useful because of the flexibility it provides in the data it can handle. More on this later as it develops !! 

Monday, December 12, 2011

Overfitting and Regularization 

So after a long time writing something, things have been good, got a job as a Machine Learning Expert in San Francisco where I am exploring new things. Today I would like to talk a little bit about some regularization and overfitting.  Overfitting as is generally defined, happens when your model tends to be more sure of the data then it should be. In other words, your model will not generalize well. Overfitting is a serious issue which occurs in Maximum-likelihood estimation of model parameters. For example, in Logistic regression, you are trying to find the normal vector to the separating hyperplane between your two classes and as is normally done, you write the likelihood function for your training data and then maximize it and solve for the unknown parameter vector, and you get an Update Equation for the normal parameter vector. And then you make iterations using that equation and say "wahlaaa!" when you see it converging. However, this convergence is generally a deception. A thing or two to detect overfitting,
1) Observe your parameter vector, if its components vary hugely between one extreme to another, better be warned of overfitting
2) If your training set was small, overfitting will come into play.

In the Non-Bayesian setting, the easiest way to avoid overfitting is to use a Regularizer which penalizes your objective function if the parameter vector's components become too large, simply put, regularizer tries to loosen up your model and in that process you may observe your training error to increase.

Why does overfitting occur? A simple explanation is that generally the features that we collect in our data, they are not uncorrelated from each other. If you arrange your data points as rows of a matrix and then try to plot its Singular values through svd, you ll find out that singular values fall very quickly to zero, the eigen value spectrum of a matrix with size over 5 million by 2000 column matrix (top 400) is as follows.

Assuming a linear regression model, if you try to find your regression coefficients with such a feature matrix which is very low rank by naively inverting this matrix, your R/Matlab is gonna complain about singularity. If you write an iterative algorithm for finding the parameter, the result would be overfitting or large variations in the components of parameter vector. Hence the easiest solution would be to use a regularizer as I previously mentioned. With the advent of multicore-processing softwares like Graphlab, distributed computing tools like Mahout or even R's irlba package, you can compute SVD of your data matrix and project it in lower dimensions which capture most of the variance of the data. Would that be a solution, you wish :),  but thats a story for another day !!!

Thursday, November 3, 2011

Career in Data Science

A very nice video about a career in data science by an Amazon Engineer

Tuesday, August 23, 2011

Data Science Workflow

Today I am going to describe the work flow that I follow to solve a "Big Data" problem which has a considerable research flavor as well

  • Rapid Prototyping in MATLAB: If you like to try new method for example Latent Dirichlet Allocation on a very small subset of your data, then MATLAB perhaps is the quickest and fastest way to do it for a certain domain of problems. If you could model your data as a Matrix ( e.g Documents by Words a convenient representation for many Information retrieval systems), then MATLAB has many built in tools plus some third party written software code as well which you could quickly try on a very small portion of your dataset and quickly visualize the results
  • Scripting Language (Python/Ruby): The next step is to convert your algorithm in your favorite Scripting Language, My favorite ones are Python and Ruby. Given the scale of the problem, this may be your production level code and you can use some visualization tools here like Python Imaging Library or a new excellent visualization tool by Stanford's Visualization group, Protovis to impressively show your results
  • Hadoop: This step is optional but it is becoming more and more necessary these days given the amount of data that is required to be processed and is my personal favorite :) Satisfied from your results from step 1 and 2, you can write a very nice MapReduce program in Java for Hadoop and let it go crazy on Amazon EC2 for truly immense datasets.

So there you have it, a nice workflow which you can follow to convert a research oriented problem into a great commodity for your company

Monday, August 8, 2011

Github: A place for social coders (aka nerds :) )

My Github:

Hi Guys, a new day and a very small post, this is my Github in which currently all of my repositories are public and for every one to use there are many interesting projects I have completed and posted there like Google's PageRank in MapReduce, an Inverted Index and Retrieval for the whole Wikipedia collection (500 GB phew!) Recommendation Engines for Netflix and Social blogging (delicious anyone?) and networking sites, All the MapReduce programs are ready to go crazy on Amazon's Computing Cloud (EC2 or EMR, the later being easy for MapReduce) I hope you guys will like it !