Tuesday, May 22, 2012

Hive Hbase integration/ Hive HbaseHandler : Common issues and resolution

It is common that when we try out hive hbase integration it leads to lots of unexpected errors even though hive and hbase are running individually without any issues. Some common issues are

1) jars not available
The following jars should be available on hive auxpath
  1. usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u2.jar
  2. /usr/lib/hive/lib/hbase-0.90.4-cdh3u2.jar
  3. /usr/lib/hive/lib/zookeeper-3.3.1.jar  
These jars vary with the hbase , hive and zookeeper version running on your cluster

2)zookeeper quorum not available for hive client
Should specify the zoo keeper server host names so that the leader hbase master server can be chosen for the hbase connection


Where zk1,zk2,zk3 should be the actual hostnames of the ZooKeeper servers

These values can be set either on your hive session or permanently on your hive configuration files

1) Setting in hive client session
$ hive -auxpath /usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u2.jar:/usr/lib/hive/lib/hbase-0.90.4-cdh3u2.jar:/usr/lib/hive/lib/zookeeper-3.3.1.jar -hiveconf hbase.zookeeper.quorum=zk1,zk2,zk3

2) Setting in hive-site.xml


Still hive hbase integration not working 
Error thrown : Master not running

Check whether the HbaseMaster is really down. If it is fine there could be other possibilities , a few common ones being
  • Firstly you need to check whether hbase.zookeeper.quorum is correctly set, it shouldn’t be localhost.
  • If multiple hbase clusters share the same zookeeper quorum then the znode parent value will be different for each. If that is the case, then set ‘zookeeper.znode.parent’ also has to be set in hive configuration, to the correct value from hbase-site.xml.

Is compression codecs required on client nodes?

Some times even after having compression codecs available across all nodes in cluster we see some of our jobs giving class not found for compression codecs.

Even though compression /decompression processes are done by task trackers. In certain cases the compression codecs are required on the client nodes. 

Some scenarios are
 Total Order Partitioner
          Before triggering the mapreduce job, the job need to have an understanding on the ranges of key. Only then it can decide on which range of keys should go into which reducer. We need this value before map tasks starts, for that initially the client makes a random across input data sample (seek could be like read first 20 mb skip next 200 mb read next 20 mb etc). 

 Hive and Pig
For better optimization of jobs, uniform distribution of data across reducers and determining number or reducers etc hive and pig actually does a quick seek on input data samples.

In both these cases, since a sample of Input data is actually read on client side before the MR tasks, if data is compressed the compression codec needs to be available on the client node as well.

-libjars not working in custom mapreduce code, How to debug

Mostly application developers bump into this issue. They ship their custom jars to map reduce job but when the classes in those are referred by code it throws a Class not found exception.

For -libjars to work your main class should satisfy the following two conditions.
1) Main Class should implement the Tool interface

 //wrong usage - Tool Interface not implemented
public class WordCount extends Configured {

//right usage
public class WordCount extends Configured implements Tool {
2) Main Class should get the existing configuration using getConf() method rather than creating anew configuration instance.

//wrong usage - creating anew instance of Conf 
public int run(String[] args) throws Exception {
   Configuration conf = new Configuration();
//right usage 
 public int run(String[] args) throws Exception {
    Configuration conf = getConf();

How to recover deleted files from hdfs/ Enable trash in hdfs

If you enable thrash in hdfs, when an rmr is issued the file will be still available in trash for some period. There by you can recover accidentally deleted ones. To enable hdfs thrash
set fs.trash.interval > 1

This specifies the time interval a file deleted would be available in trash. There is a property (fs.trash.checkpoint.interval) that specifies the checkpoint interval NN checks the trash dir at every intervals and deletes all files older than specified fs.trash.interval . ie say you have your
fs.trash.interval as 60 mins and fs.trash.checkpoint.interval as 30 mins, then in every 30 mins a check is performed and deletes all files that are more than 60 mins old.

fs.trash.checkpoint.interval should be equal to or less than fs.trash.interval

The value of fs.trash.interval  is specified in minutes.

fs.trash.interval should be enabled in client node as well as Name Node. Name Node it should be present for check pointing purposes. Based the value in client node it is decided whether to remove a file completely from hdfs or thrash it on an rmr issued from client.

The trash dir by default is /user/X/.Trash