Kick Start Hadoop: Map Reduce - Some best practices

Tuesday, May 24, 2011

Map Reduce - Some best practices

1. User larger HDFS blocks for better performance

If smaller HDFS blocks are used more time would be spend for seeking records on disk. This is a massive overhead when we deal with large files.

2. Always use Combiner if possible for local aggregation

Shuffle and Sort is a really expensive process hence try reducing the no of records involved for the same.

3. Use compression for intermediate output. LZO is advisable

This would be beneficial when we have a large no of mapers and reducers and when we do the sort and shuffle from mapper to reducer compressed data would be taking only least bandwidth

4. Use Distributed cache for smaller files only.

When the file size increases it can lead to memory constrains and there by lead to performance degradation. In midsized clusters use the same for distributing files of sizes less than 100MB.

5. Set No of Reducers to Zero

For jobs that exploit just the massive parallism of Hadoop, explicitly set the no of reducers as zero. As the Sort and shuffle is really expensive process avoiding the reduce Phase in turn avoids this step as well.

6. No of mappers

Choose the number of mappers a value much higher than the no of nodes available in your cluster. It would be better to assign 10 to 100 mappers per node. This can be more in may practical cases.

7. No of Reducers

Set the number of reducers slightly less than the no of available reduce slots in the cluster. This would ensure better utilization of the cluster and there by a performance advantage.

No of reduce slots in cluster = No of nodes in cluster * value of mapred.tasktracker.reducetasks.maximum

8. Mappers process optimal amount of data

Based on the use case ensure the mappers process optimal amount of data not too small or not too large. If smaller parts of data there are chances that some mappers world have to wait till others run to completion. If huge chunks of data then, under a task failure the reexecution of the same would be highly expensive. If file sizes are less than the HDFS block size then combining multiple files and feeding then to mappers would improve the performance to a greater extent.

1 comment:

kalyan hadoopMay 1, 2015 at 5:45 PM
Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit

Follow the below links to know more knowledge on Hadoop

WebSites:
================
Hadoop Training in Hyderabad

Hadoop Training in Hyderabad

Hadoop Training in Hyderabad

Videos:
===============
Hadoop Training in Hyderabad

Hadoop Training in Hyderabad

Hadoop Training in Hyderabad

Hadoop Training in Hyderabad

Hadoop Training in Hyderabad

Hadoop Training in Hyderabad

Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit
ReplyDelete
Replies

Add comment