Tuesday, May 24, 2011

Upcoming in Hadoop


Yahoo Research Labs are coming up with Hadoop Federation release.
The major change in this release would be multiple name nodes instead of a single one, which in turn would have multiple back up name nodes. This is because of the fact that in HDFS all the meta data is stored in single name node and the current implementation of a cluster would break if we have more than 30K concurrent tasks running in parallel. All the Meta data is stored in the name node is the major limitation of HDFS which would be overcome soon with this release.
                Also this release is expected to address the availability issues in current HDFS cluster implementation. In the current version the cluster is scalable but when we add multiple nodes into our cluster, the time taken by the data nodes to be registered with the name node is very high, ie the cluster can go on an unavailable mode for an indefinitely long time during its upgrade.

Informatica to Bring in a Hadoop IDE
                This initiative from Informatica is not really focused on Hadoop M/R developers but on the existing customers of Informatica. The new IDE from informatica would be like an enhancement to the existing informatica version which would create jobs as maplets within informatica. Under the hood these Hadoop jobs would be triggered by means of Pig Scripts. It won’t be an open source proprietary.

Amazon Elastic Map Reduce
                It is an alternative to maintaining your own cluster for running M/R jobs. We can purchase a set of nodes from the Elastic map reduce cluster and run our jobs. The advantages of using this cluster are
·         It is already a highly optimized and a fine tuned Hadoop installation
·         We can add or shrink the number of nodes even when our M/R job is running.
·         Much cheaper than Amazon EC2 cloud
They also guarantee some security of data by the usage of Virtual Machines in their underlying cores. 
(This is not just upcoming but an existing one)

HSearch
                A search Engine/Utlility build on top of Hadoop. It is utilizing the Hadoop tools like Hbase for its functioning. Not using any parts of Lucene or Katta which were the proven tools in search domain. The beta version is ready which would be going in for further performance optimization befor made available as part of open source.

2 comments: