Monday, December 12, 2011


Overfitting and Regularization 

So after a long time writing something, things have been good, got a job as a Machine Learning Expert in San Francisco where I am exploring new things. Today I would like to talk a little bit about some regularization and overfitting.  Overfitting as is generally defined, happens when your model tends to be more sure of the data then it should be. In other words, your model will not generalize well. Overfitting is a serious issue which occurs in Maximum-likelihood estimation of model parameters. For example, in Logistic regression, you are trying to find the normal vector to the separating hyperplane between your two classes and as is normally done, you write the likelihood function for your training data and then maximize it and solve for the unknown parameter vector, and you get an Update Equation for the normal parameter vector. And then you make iterations using that equation and say "wahlaaa!" when you see it converging. However, this convergence is generally a deception. A thing or two to detect overfitting,
1) Observe your parameter vector, if its components vary hugely between one extreme to another, better be warned of overfitting
2) If your training set was small, overfitting will come into play.

In the Non-Bayesian setting, the easiest way to avoid overfitting is to use a Regularizer which penalizes your objective function if the parameter vector's components become too large, simply put, regularizer tries to loosen up your model and in that process you may observe your training error to increase.

Why does overfitting occur? A simple explanation is that generally the features that we collect in our data, they are not uncorrelated from each other. If you arrange your data points as rows of a matrix and then try to plot its Singular values through svd, you ll find out that singular values fall very quickly to zero, the eigen value spectrum of a matrix with size over 5 million by 2000 column matrix (top 400) is as follows.

Assuming a linear regression model, if you try to find your regression coefficients with such a feature matrix which is very low rank by naively inverting this matrix, your R/Matlab is gonna complain about singularity. If you write an iterative algorithm for finding the parameter, the result would be overfitting or large variations in the components of parameter vector. Hence the easiest solution would be to use a regularizer as I previously mentioned. With the advent of multicore-processing softwares like Graphlab, distributed computing tools like Mahout or even R's irlba package, you can compute SVD of your data matrix and project it in lower dimensions which capture most of the variance of the data. Would that be a solution, you wish :),  but thats a story for another day !!!

Thursday, November 3, 2011

Career in Data Science

A very nice video about a career in data science by an Amazon Engineer
http://youtu.be/0tuEEnL61HM

Tuesday, August 23, 2011

Data Science Workflow

Today I am going to describe the work flow that I follow to solve a "Big Data" problem which has a considerable research flavor as well

  • Rapid Prototyping in MATLAB: If you like to try new method for example Latent Dirichlet Allocation on a very small subset of your data, then MATLAB perhaps is the quickest and fastest way to do it for a certain domain of problems. If you could model your data as a Matrix ( e.g Documents by Words a convenient representation for many Information retrieval systems), then MATLAB has many built in tools plus some third party written software code as well which you could quickly try on a very small portion of your dataset and quickly visualize the results
  • Scripting Language (Python/Ruby): The next step is to convert your algorithm in your favorite Scripting Language, My favorite ones are Python and Ruby. Given the scale of the problem, this may be your production level code and you can use some visualization tools here like Python Imaging Library or a new excellent visualization tool by Stanford's Visualization group, Protovis to impressively show your results
  • Hadoop: This step is optional but it is becoming more and more necessary these days given the amount of data that is required to be processed and is my personal favorite :) Satisfied from your results from step 1 and 2, you can write a very nice MapReduce program in Java for Hadoop and let it go crazy on Amazon EC2 for truly immense datasets.


So there you have it, a nice workflow which you can follow to convert a research oriented problem into a great commodity for your company

Monday, August 8, 2011

Github: A place for social coders (aka nerds :) )

My Github:
https://github.com/ehtsham

Hi Guys, a new day and a very small post, this is my Github in which currently all of my repositories are public and for every one to use there are many interesting projects I have completed and posted there like Google's PageRank in MapReduce, an Inverted Index and Retrieval for the whole Wikipedia collection (500 GB phew!) Recommendation Engines for Netflix and Social blogging (delicious anyone?) and networking sites, All the MapReduce programs are ready to go crazy on Amazon's Computing Cloud (EC2 or EMR, the later being easy for MapReduce) I hope you guys will like it !

Friday, June 24, 2011

Data Science,--new name for Machine learning and Statistics in Industry

I became fascinated with data, it is powerful and it hides in itself great pieces of knowledge. How to leverage the data to get the fruit of knowledge.... be a data scientist...How?? become passionate about it..I am sharing one video of a person I greatly admire, she is Hilary Mason,a famous data scientist speaking at Strata 2011.
Strata 2011, Hilary Mason

Listen to her talk and imagine the endless ways you can use data ..!

Friday, June 3, 2011

Machine Learning in Industry: My First Post

Machine Learning in Industry: My First Post: "This is my first blog: So, whats up!, I am a budding Machine Learning Engineer passionate about Machine Learning research and development, a..."

My First Post

This is my first blog: So, whats up!, I am a budding Machine Learning Engineer passionate about Machine Learning research and development, and scalable implementations of the challenging machine learning techniques. After completing my graduate studies, I, like many others entering the void called practical life, was searching for my identity. You may think of the 'identity' as a field of profession one could associate with instead of some weird philosophical concept, as my intention is to keep this blog primarily for my experiences in industry. In this blog, I plan to share my ideas about how I feel Machine learning course work should be designed to meet the demands of industry. I also would share my views on the newly emerging paradigm of "Knowledge based economy". I feel this paradigm has great impact on students who want to pursue studies in Electrical Engineering/Computer Science and related disciplines in terms of career paths they can pursue.
So if you are a student, and want to know more about the things I have mentioned above, stay tuned to this blog. The world is a global village, as cliche it may sound, it is really true and especially for people like me who want to learn newly emerging technologies. So I would also share course websites, video lectures etc. which anyone with a sound background in the basics of computer science should be able to follow and learn new skills from the comfort of their homes.
As a start, I would like to share today something fundamental, a great course on Data Structures offered in University of California Berkeley, Data Structures Course at Berkeley , video lectures are available on youtube with assignments and detailed stuff on the course webpage, a great resource for anyone interested in Data Structures.