Hive is a data warehousing tool built on top of hadoop. The data corresponding to hive tables are stored as delimited files in hdfs. Since it is used for data warehousing, the data for production system hive tables would definitely be at least in terms of hundreds of gigs. Now naturally the question arises, how efficiently we can store this data, definitely it has to be compressed. Now a few more questions arise, how can store compressed data in hive? How can we process and retrieve compressed data in hive using hive QL.
Now let’s look into these, it is fairly simple if you know hive. Before you use hive you need to enable a few parameters for dealing with compressed tables. It is the same compression enablers when you play around with map reduce along with a few of hive parameters.
· hive.exec.compress.output=true
· mapred.output.compress=true
· mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
Here I have used LZO as my compression in hdfs, hence using the LzopCodec. Beyond setting this you don’t need to do anything else, use the hive QLs normally as you do with uncompressed data. I have tried out the same successfully with Dynamic Partitions, Buckets etc, It works like any normal hive operations.
The input data for me from conventional sources were normal text, this raw data was loaded into a staging table. From the staging table with some hive QL the cleansed data was loaded into actual hive tables . The staging table gets flushed every time the data is loaded into target hive table.
This was a sweet and simple introduction to compression in HIve. I suggest anyone who is contemplaing compression to read chapter 11 from book 'Hive Programming'. Link to the book : http://shop.oreilly.com/product/0636920023555.do
ReplyDeleteHi,
ReplyDeleteNice introduction. I am in US, what is the best way to contact you?
Somesh