Tuesday, August 21, 2012

Performance tuning of hive queries


Hive performance optimization is a larger topic on its own and is very specific to the queries you are using. Infact each query in a query file needs separate performance tuning to get the most robust results.

I'll try to list a few approaches in general used for performance optimization

Limit the data flow down the queries
When you are on a hive query the volume of data that flows each level down is the factor that decides performance. So if you are executing a script that contains a sequence of hive QL, make sure that the data filtration happens on the first few stages rather than bringing unwanted data to bottom. This will give you significant performance numbers as the queries down the lane will have very less data to crunch on.

This is a common bottle neck when some existing SQL jobs are ported to hive, we just try to execute the same sequence of SQL steps in hive as well which becomes a bottle neck on the performance. Understand the requirement or the existing SQL script and design your hive job considering data flow

Use hive merge files
Hive queries are parsed into map only and map reduce job. In a hive script there will lots of hive queries. Assume one of your queries is parsed to a mapreduce job and the output files from the job are very small, say 10 mb. In such a case the subsequent query that consumes this data may generate more number of map tasks and would be inefficient. If you have more jobs on the same data set then all the jobs will get inefficient. In such scenarios if you enable merge files in hive, the first query would run a merge job at the end there by merging small files into  larger ones. This is controlled
using the following parameters

hive.merge.mapredfiles=true
hive.merge.mapfiles=true (true by default in hive)

For more control over merge files you can tweak these properties as well
hive.merge.size.per.task (the max final size of a file after the merge task)
hive.merge.smallfiles.avgsize (the merge job is triggered only if the average output filesizes is less than the specified value)

The default values for the above properties are
hive.merge.size.per.task=256000000
hive.merge.smallfiles.avgsize=16000000

When you enable merge an extra map only job is triggered, whether this job gets you an optimization or an over head is totally dependent on your use case or the queries.

Join Optimizations
Joins are very expensive.Avoid it if possible. If it is required try to use join optimizations as map joins, bucketed map joins etc


There is still more left on hive query performance optimization, take this post as the baby step. More tobe added on to this post and will be addded soon . :)

19 comments:

  1. if joins are expensive then is hive can be used without joins

    ReplyDelete
  2. Hadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
    http://www.s4techno.com/lab-setup/

    ReplyDelete
  3. Hadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
    LAB Setup rent

    ReplyDelete
  4. Hi,

    Is it possible that we could trigger any hive query on remote hdfs using map reduce framework.

    My requirement is to trigger hive query on remote hive and i don't want to user hive server2 for that . can we do the same task using mapred.

    ReplyDelete
  5. This technical post helps me to improve my skills set, thanks for this wonder article I expect your upcoming blog, so keep sharing..
    Regards,
    cognos Training in Chennai|cognos Training Chennai

    ReplyDelete
  6. Thank you very much for providing latest sales force online training.you can search and visit our website.with best regards:salesforce online training

    ReplyDelete
  7. Thanks for sharing this valuable post to my knowledge great pleasure to be here SAS has great scope in IT industry. It’s an application suite that can change, manage & retrieve data from the variety of origin & perform statistical analytic on it…
    Regards,
    sas training in Chennai|sas training chennai|sas course in Chennai

    ReplyDelete
  8. Superb explanation & it's too clear to understand the concept as well, keep sharing admin with some updated information with right examples.
    Regards,

    Hadoop Training in Chennai|Big Data Training in Chennai|Fita Chennai reviews

    ReplyDelete
  9. I have read your blog it was nice to follow even I am looking for your future updates.Hadoop is a highly growing & scoopful technology in IT market

    ReplyDelete
  10. Nice Article.. Thanks For sharing with us !!!

    Visit - http://tekclasses.in/

    ReplyDelete
  11. Really a good piece of knowledge on Big Data and Hadoop. Thanks for such a good post. I would like to recommend one more resource NPN Training which helps in getting more knowledge on Hadoop. The best part of NPN Training is they provide complete Hands-on classes.

    For More Details visit
    http://npntraining.com/courses/big-data-and-hadoop.php

    ReplyDelete
  12. Superb i really enjoyed very much with this article here. Really its a amazing article i had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article.thank you for sharing such a great blog with us. expecting for your updation.
    seo company in chennai

    ReplyDelete
  13. I simply couldn’t depart your site before suggesting that I really enjoyed the usual information an individual supply in your visitors? Is going to be again steadily to check out new posts.

    Hadoop Training in Chennai

    ReplyDelete