Tuesday, May 24, 2011
Map Reduce - Some best practices
1. User larger HDFS blocks for better performance
If smaller HDFS blocks are used more time would be spend for seeking records on disk. This is a massive overhead when we deal with large files.
2. Always use Combiner if possible for local aggregation
Shuffle and Sort is a really expensive process hence try reducing the no of records involved for the same.
3. Use compression for intermediate output. LZO is advisable
This would be beneficial when we have a large no of mapers and reducers and when we do the sort and shuffle from mapper to reducer compressed data would be taking only least bandwidth
4. Use Distributed cache for smaller files only.
When the file size increases it can lead to memory constrains and there by lead to performance degradation. In midsized clusters use the same for distributing files of sizes less than 100MB.
5. Set No of Reducers to Zero
For jobs that exploit just the massive parallism of Hadoop, explicitly set the no of reducers as zero. As the Sort and shuffle is really expensive process avoiding the reduce Phase in turn avoids this step as well.
6. No of mappers
Choose the number of mappers a value much higher than the no of nodes available in your cluster. It would be better to assign 10 to 100 mappers per node. This can be more in may practical cases.
7. No of Reducers
Set the number of reducers slightly less than the no of available reduce slots in the cluster. This would ensure better utilization of the cluster and there by a performance advantage.
No of reduce slots in cluster = No of nodes in cluster * value of mapred.tasktracker.reducetasks.maximum
8. Mappers process optimal amount of data
Based on the use case ensure the mappers process optimal amount of data not too small or not too large. If smaller parts of data there are chances that some mappers world have to wait till others run to completion. If huge chunks of data then, under a task failure the reexecution of the same would be highly expensive. If file sizes are less than the HDFS block size then combining multiple files and feeding then to mappers would improve the performance to a greater extent.