Monday, December 12, 2011


Overfitting and Regularization 

So after a long time writing something, things have been good, got a job as a Machine Learning Expert in San Francisco where I am exploring new things. Today I would like to talk a little bit about some regularization and overfitting.  Overfitting as is generally defined, happens when your model tends to be more sure of the data then it should be. In other words, your model will not generalize well. Overfitting is a serious issue which occurs in Maximum-likelihood estimation of model parameters. For example, in Logistic regression, you are trying to find the normal vector to the separating hyperplane between your two classes and as is normally done, you write the likelihood function for your training data and then maximize it and solve for the unknown parameter vector, and you get an Update Equation for the normal parameter vector. And then you make iterations using that equation and say "wahlaaa!" when you see it converging. However, this convergence is generally a deception. A thing or two to detect overfitting,
1) Observe your parameter vector, if its components vary hugely between one extreme to another, better be warned of overfitting
2) If your training set was small, overfitting will come into play.

In the Non-Bayesian setting, the easiest way to avoid overfitting is to use a Regularizer which penalizes your objective function if the parameter vector's components become too large, simply put, regularizer tries to loosen up your model and in that process you may observe your training error to increase.

Why does overfitting occur? A simple explanation is that generally the features that we collect in our data, they are not uncorrelated from each other. If you arrange your data points as rows of a matrix and then try to plot its Singular values through svd, you ll find out that singular values fall very quickly to zero, the eigen value spectrum of a matrix with size over 5 million by 2000 column matrix (top 400) is as follows.

Assuming a linear regression model, if you try to find your regression coefficients with such a feature matrix which is very low rank by naively inverting this matrix, your R/Matlab is gonna complain about singularity. If you write an iterative algorithm for finding the parameter, the result would be overfitting or large variations in the components of parameter vector. Hence the easiest solution would be to use a regularizer as I previously mentioned. With the advent of multicore-processing softwares like Graphlab, distributed computing tools like Mahout or even R's irlba package, you can compute SVD of your data matrix and project it in lower dimensions which capture most of the variance of the data. Would that be a solution, you wish :),  but thats a story for another day !!!