Monday, June 29, 2015

Hadoop Archives (har) - Creating and Reading HAR

A quick post that explains the following with samples
  • Create a HAR file
  • List the Contents of a HAR file
  • Read the contents of a file that is within a HAR

Listed below is the input  directory structure in HDFS I’ll be using to create a har

hadoop fs -ls /bejoyks/test/har/source_files/*
Found 2 items
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir01/file1.tsv
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir01/file2.tsv
Found 2 items
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir02/file3.tsv
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:25 /bejoyks/test/har/source_files/srcDir02/file4.tsv

CLI Command to create a HAR

hadoop archive -archiveName tsv <archiveName.har> -p <ParentDirHDFS> -r <ReplicationFactor> <childDir01> <childDir02> <DestinationDirectoryHDFS>

Command Used
hadoop archive -archiveName tsv_daily.har -p /bejoyks/test/har/source_files -r 3 srcDir01 srcDir02 /bejoyks/test/har/destination

hadoop fs –ls  har://<AbsolutePathOfHarFile>

Command Used and Output
Command 01 :
hadoop fs -ls har:///bejoyks/test/har/destination/tsv_daily.har
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01
drwxr-xr-x   - hadoop supergroup          0 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir02

Command 02 :
hadoop fs -ls har:///home/hadoop/work/bejoyks/test/har/destination/tsv_daily.har/srcDir01
Found 2 items
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file1.tsv
-rw-r--r--   3 hadoop supergroup         22 2015-06-29 20:39 har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file2.tsv

READING a File within a HAR
hadoop fs -text har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file2.tsv
file2    row1
file2    row2

** Common mistakes while reading a HAR file

Always use the URI while reading a HAR file
Since we are used lo listing the directories/files in HDFS without the URI , we might use the similar pattern here. But HAR files doen’t work well if it is not prefixed with URI . If listed without URI you’ll get the HAR metadata under the hood, something like below.

hadoop fs -ls /bejoyks/test/har/destination/tsv_daily.har
Found 3 items
-rw-r--r--   5 hadoop supergroup        277 2015-06-29 20:39 /bejoyks/test/har/destination/tsv_daily.har/_index
-rw-r--r--   5 hadoop supergroup         23 2015-06-29 20:39 /bejoyks/test/har/destination/tsv_daily.har/_masterindex
-rw-r--r--   3 hadoop supergroup         88 2015-06-29 20:39 /bejoyks/test/har/destination/tsv_daily.har/part-0

Tuesday, August 21, 2012

Performance tuning of hive queries

Hive performance optimization is a larger topic on its own and is very specific to the queries you are using. Infact each query in a query file needs separate performance tuning to get the most robust results.

I'll try to list a few approaches in general used for performance optimization

Limit the data flow down the queries
When you are on a hive query the volume of data that flows each level down is the factor that decides performance. So if you are executing a script that contains a sequence of hive QL, make sure that the data filtration happens on the first few stages rather than bringing unwanted data to bottom. This will give you significant performance numbers as the queries down the lane will have very less data to crunch on.

This is a common bottle neck when some existing SQL jobs are ported to hive, we just try to execute the same sequence of SQL steps in hive as well which becomes a bottle neck on the performance. Understand the requirement or the existing SQL script and design your hive job considering data flow

Use hive merge files
Hive queries are parsed into map only and map reduce job. In a hive script there will lots of hive queries. Assume one of your queries is parsed to a mapreduce job and the output files from the job are very small, say 10 mb. In such a case the subsequent query that consumes this data may generate more number of map tasks and would be inefficient. If you have more jobs on the same data set then all the jobs will get inefficient. In such scenarios if you enable merge files in hive, the first query would run a merge job at the end there by merging small files into  larger ones. This is controlled
using the following parameters

hive.merge.mapfiles=true (true by default in hive)

For more control over merge files you can tweak these properties as well
hive.merge.size.per.task (the max final size of a file after the merge task)
hive.merge.smallfiles.avgsize (the merge job is triggered only if the average output filesizes is less than the specified value)

The default values for the above properties are

When you enable merge an extra map only job is triggered, whether this job gets you an optimization or an over head is totally dependent on your use case or the queries.

Join Optimizations
Joins are very expensive.Avoid it if possible. If it is required try to use join optimizations as map joins, bucketed map joins etc

There is still more left on hive query performance optimization, take this post as the baby step. More tobe added on to this post and will be addded soon . :)

How to migrate a hive table from one hive instance to another or between hive databases

Hive has the EXPORT IMPORT feature since hive 0.8. With this feature you can export the metadata as well as the data for a corresponding table to a file in hdfs using the EXPORT command. The data is stored in json format. Data once exported this way could be imported back to another database or hive instance using the IMPORT command.

The syntax looks something like this:
EXPORT TABLE table_or_partition TO hdfs_path;
IMPORT [[EXTERNAL] TABLE table_or_partition] FROM hdfs_path [LOCATION [table_location]];

Some sample statements would look like:
EXPORT TABLE <table name> TO 'location in hdfs';

Use test_db;
IMPORT FROM 'location in hdfs';

Export Import can be appled on a partition basis as well:
EXPORT TABLE <table name> PARTITION (loc="USA") to 'location in hdfs';

The below import commands imports to an external table instead of a managed one
IMPORT EXTERNAL TABLE FROM 'location in hdfs' LOCATION ‘/location/of/external/table’;

Tuesday, May 22, 2012

Hive Hbase integration/ Hive HbaseHandler : Common issues and resolution

It is common that when we try out hive hbase integration it leads to lots of unexpected errors even though hive and hbase are running individually without any issues. Some common issues are

1) jars not available
The following jars should be available on hive auxpath
  1. usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u2.jar
  2. /usr/lib/hive/lib/hbase-0.90.4-cdh3u2.jar
  3. /usr/lib/hive/lib/zookeeper-3.3.1.jar  
These jars vary with the hbase , hive and zookeeper version running on your cluster

2)zookeeper quorum not available for hive client
Should specify the zoo keeper server host names so that the leader hbase master server can be chosen for the hbase connection


Where zk1,zk2,zk3 should be the actual hostnames of the ZooKeeper servers

These values can be set either on your hive session or permanently on your hive configuration files

1) Setting in hive client session
$ hive -auxpath /usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u2.jar:/usr/lib/hive/lib/hbase-0.90.4-cdh3u2.jar:/usr/lib/hive/lib/zookeeper-3.3.1.jar -hiveconf hbase.zookeeper.quorum=zk1,zk2,zk3

2) Setting in hive-site.xml


Still hive hbase integration not working 
Error thrown : Master not running

Check whether the HbaseMaster is really down. If it is fine there could be other possibilities , a few common ones being
  • Firstly you need to check whether hbase.zookeeper.quorum is correctly set, it shouldn’t be localhost.
  • If multiple hbase clusters share the same zookeeper quorum then the znode parent value will be different for each. If that is the case, then set ‘zookeeper.znode.parent’ also has to be set in hive configuration, to the correct value from hbase-site.xml.

Is compression codecs required on client nodes?

Some times even after having compression codecs available across all nodes in cluster we see some of our jobs giving class not found for compression codecs.

Even though compression /decompression processes are done by task trackers. In certain cases the compression codecs are required on the client nodes. 

Some scenarios are
 Total Order Partitioner
          Before triggering the mapreduce job, the job need to have an understanding on the ranges of key. Only then it can decide on which range of keys should go into which reducer. We need this value before map tasks starts, for that initially the client makes a random across input data sample (seek could be like read first 20 mb skip next 200 mb read next 20 mb etc). 

 Hive and Pig
For better optimization of jobs, uniform distribution of data across reducers and determining number or reducers etc hive and pig actually does a quick seek on input data samples.

In both these cases, since a sample of Input data is actually read on client side before the MR tasks, if data is compressed the compression codec needs to be available on the client node as well.

-libjars not working in custom mapreduce code, How to debug

Mostly application developers bump into this issue. They ship their custom jars to map reduce job but when the classes in those are referred by code it throws a Class not found exception.

For -libjars to work your main class should satisfy the following two conditions.
1) Main Class should implement the Tool interface

 //wrong usage - Tool Interface not implemented
public class WordCount extends Configured {

//right usage
public class WordCount extends Configured implements Tool {
2) Main Class should get the existing configuration using getConf() method rather than creating anew configuration instance.

//wrong usage - creating anew instance of Conf 
public int run(String[] args) throws Exception {
   Configuration conf = new Configuration();
//right usage 
 public int run(String[] args) throws Exception {
    Configuration conf = getConf();

How to recover deleted files from hdfs/ Enable trash in hdfs

If you enable thrash in hdfs, when an rmr is issued the file will be still available in trash for some period. There by you can recover accidentally deleted ones. To enable hdfs thrash
set fs.trash.interval > 1

This specifies the time interval a file deleted would be available in trash. There is a property (fs.trash.checkpoint.interval) that specifies the checkpoint interval NN checks the trash dir at every intervals and deletes all files older than specified fs.trash.interval . ie say you have your
fs.trash.interval as 60 mins and fs.trash.checkpoint.interval as 30 mins, then in every 30 mins a check is performed and deletes all files that are more than 60 mins old.

fs.trash.checkpoint.interval should be equal to or less than fs.trash.interval

The value of fs.trash.interval  is specified in minutes.

fs.trash.interval should be enabled in client node as well as Name Node. Name Node it should be present for check pointing purposes. Based the value in client node it is decided whether to remove a file completely from hdfs or thrash it on an rmr issued from client.

The trash dir by default is /user/X/.Trash