Tuesday, August 23, 2011

Data Science Workflow

Today I am going to describe the work flow that I follow to solve a "Big Data" problem which has a considerable research flavor as well

  • Rapid Prototyping in MATLAB: If you like to try new method for example Latent Dirichlet Allocation on a very small subset of your data, then MATLAB perhaps is the quickest and fastest way to do it for a certain domain of problems. If you could model your data as a Matrix ( e.g Documents by Words a convenient representation for many Information retrieval systems), then MATLAB has many built in tools plus some third party written software code as well which you could quickly try on a very small portion of your dataset and quickly visualize the results
  • Scripting Language (Python/Ruby): The next step is to convert your algorithm in your favorite Scripting Language, My favorite ones are Python and Ruby. Given the scale of the problem, this may be your production level code and you can use some visualization tools here like Python Imaging Library or a new excellent visualization tool by Stanford's Visualization group, Protovis to impressively show your results
  • Hadoop: This step is optional but it is becoming more and more necessary these days given the amount of data that is required to be processed and is my personal favorite :) Satisfied from your results from step 1 and 2, you can write a very nice MapReduce program in Java for Hadoop and let it go crazy on Amazon EC2 for truly immense datasets.


So there you have it, a nice workflow which you can follow to convert a research oriented problem into a great commodity for your company

Monday, August 8, 2011

Github: A place for social coders (aka nerds :) )

My Github:
https://github.com/ehtsham

Hi Guys, a new day and a very small post, this is my Github in which currently all of my repositories are public and for every one to use there are many interesting projects I have completed and posted there like Google's PageRank in MapReduce, an Inverted Index and Retrieval for the whole Wikipedia collection (500 GB phew!) Recommendation Engines for Netflix and Social blogging (delicious anyone?) and networking sites, All the MapReduce programs are ready to go crazy on Amazon's Computing Cloud (EC2 or EMR, the later being easy for MapReduce) I hope you guys will like it !