Monday, October 24, 2011

How to efficiently store data in hive/ Store and retrieve compressed data in hive

Hive is a data warehousing tool built on top of hadoop. The data corresponding to hive tables are stored as delimited files in hdfs. Since it is used for data warehousing, the data for production system hive tables would definitely be at least in terms of hundreds of gigs. Now naturally the question arises, how efficiently we can store this data, definitely it has to be compressed. Now a few more questions arise, how can store compressed data in hive? How can we process and retrieve compressed data in hive using hive QL.
                Now let’s look into these, it is fairly simple if you know hive. Before you use hive you need to enable a few parameters for dealing with compressed tables. It is the same compression enablers when you play around with map reduce along with a few of hive parameters.

·         hive.exec.compress.output=true
·         mapred.output.compress=true
·         mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

Here I have used LZO as my compression in hdfs, hence using the LzopCodec. Beyond setting this you don’t need to do anything else, use the hive QLs normally as you do with uncompressed data. I have tried out the same successfully with Dynamic Partitions, Buckets etc, It works like any normal hive operations.
               The input data for me from conventional sources were normal text, this raw data was loaded into a staging table. From the staging table with some hive QL the cleansed data was loaded into actual hive tables . The staging table gets flushed every time the data is loaded into target hive table.


  1. This was a sweet and simple introduction to compression in HIve. I suggest anyone who is contemplaing compression to read chapter 11 from book 'Hive Programming'. Link to the book :

  2. Hi,

    Nice introduction. I am in US, what is the best way to contact you?


  3. I appreciate you sharing this article. Really thank you! Much obliged.
    This is one awesome blog article. Much thanks again.

    sap online training
    software online training
    sap sd online training
    hadoop online training

  4. I really enjoy the blog.Much thanks again. Really Great.
    Very informative article post. Really looking forward to read more. Will read on…

    oracle online training
    sap fico online training
    dotnet online training

  5. This comment has been removed by the author.

  6. Great article. It helps to explain complicated things to people who are not competent and explicate why this is important.