In the word count example (word-count-hadoop-example) we can see two set of parallel process, a map and a reduce. In reduce process basically what happens is an aggregation of values or rather an operation on values that share the same key. Now consider a scenario where we have a just a parallel process to be performed no aggregation required. Here we go in for the map operation and no reduce, ie we’d explicitly set the number of reducers to zero.
Why set reducers to Zero?
We know that between map and reduce phases there is a key phase, the sort and shuffle. Sort and Shuffle is responsible for sorting the keys in ascending order and then grouping values based on same keys. This phase is really expensive and if reduce phase is not required we should avoid it as avoiding reduce phase would eliminate sort and shuffle phase as well. In order to eliminate reduce phase we have to explicitly set mapred.reduce.tasks=0 because by default there would be a value set for your cluster by your admin in conf/mapred-site.xml.
An example Business scenario
It is just a mock scenario just to explain one possibility of Hadoop usage in mere parallel processing scenario. It needn’t be a eligible use case for Hadoop.
Problem Statement
We have a database dump (extract) with us which contains the following details related to an employee as Employee ID, Employee Name, Location Id, Grade Id, Monthly Pay, Days Worked in a Month and No of holidays in that month. We need to calculate the monthly salary based on the no of days worked by that employee. (Just imagine that the file size is huge).
Proposed Solution
Since we need to calculate the salary for such large data set and the data being independent chunks (ie we can process line by line, data on one line is independent chunk) we can go for Hadoop.
(Some basic calculations are used here.)
Input File - Employee.txt
A two line sample of input file (db extract) is given.
10007~James Wilson~L105~G06~110000~22~8
10100~Roger Williams~L103~G09~145000~20~8
Mapper Class - SalaryCalculatorMapper.java
package com.bejoy.hadoop.samples.salcal;
import java.io.IOException;
import java.math.BigDecimal;
import java.util.Scanner;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class SalaryCalculatorMapper extends Mapper<LongWritable, Text, Text, Text>
{
//Variables for Map Reduce
private final static Text dummyVal = new Text("");
private Text word = new Text();
//Variables for business calculations
private String employeeId,employeeName,locationId,gradeId;
private Double monthlyPay,daysWorked,monthlyHolidays,currentMonthPay;
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
/* extracting each line and sending it for processing to process method and
retrieving the result back for sending to output collector*/
String output=processRecord(value.toString());
//casting the String to Text
word.set(output);
//Sending key value to output collector
context.write(word, dummyVal);
}
public String processRecord(String record)
{
StringBuilder outputRecord=new StringBuilder("");
try {
Scanner scanner = new Scanner(record);
//setting the delimiter used in input file
scanner.useDelimiter("~");
if ( scanner.hasNext() )
{
employeeId = scanner.next();
employeeName=scanner.next();
locationId=scanner.next();
gradeId=scanner.next();
monthlyPay=Double.parseDouble(scanner.next());
daysWorked=Double.parseDouble(scanner.next());
monthlyHolidays=Double.parseDouble(scanner.next());
//Initializing the calculated salary as 0
currentMonthPay=0.0;
//salary computations
if((daysWorked+monthlyHolidays)!=30)
currentMonthPay=(monthlyPay/30)*(daysWorked+monthlyHolidays);
else
currentMonthPay=monthlyPay;
//rounding using Big Decimal
BigDecimal bd = new BigDecimal(currentMonthPay);
bd = bd.setScale(2,BigDecimal.ROUND_UP);//2 - decimal places
currentMonthPay = bd.doubleValue();
//get output in the format Employee Id~Employee Name~Monthly Pay~Days Worked~Calculated Salary
outputRecord.append(employeeId);
outputRecord.append("~");
outputRecord.append(employeeName);
outputRecord.append("~");
outputRecord.append(monthlyPay.toString());
outputRecord.append("~");
outputRecord.append(daysWorked.toString());
outputRecord.append("~");
outputRecord.append(currentMonthPay.toString());
}
}
catch (Exception e) {
e.printStackTrace();
}
return outputRecord.toString();
}
}
Reducer Class – SalaryCalculatorReducer.java
package com.bejoy.hadoop.samples.salcal;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;
public class SalaryCalculatorReducer extends Reducer<Text, Text, Text, Text>
{
//Reduce method for just outputting the key from mapper as the value from mapper is just an empty string
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
context.write(key, new Text(""));
}
}
Driver Class – SalaryCalculator.java
package com.bejoy.hadoop.samples.salcal;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class SalaryCalculator extends Configured implements Tool
{
public int run(String[] args) throws Exception
{
//getting configuration object and setting job name
Configuration conf = getConf();
Job job = new Job(conf, "Salary Calculator Demo");
//setting the class names
job.setJarByClass(SalaryCalculator.class);
job.setMapperClass(SalaryCalculatorMapper.class);
job.setReducerClass(SalaryCalculatorReducer.class);
//setting the output data type classes
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new SalaryCalculator(), args);
System.exit(res);
}
}
How to run the map reduce job?
Follow the steps in order to run the job
1. Pack these 3 files into a jar (salary calc.jar) and copy the same into some location in LFS(/home/bejoys/samples/)
2. Copy the input file into some location in LFS (/home/bejoys/samples/employee.txt)
3. Copy the input into HDFS
Hadoop fs – copyFromLocal /home/bejoys/samples/input.txt /userdata/bejoys/samples/salcalc/input/
4. Run the jar
hadoop jar /home/bejoys/samples/salarycalc.jar com.bejoy.hadoop.samples.salcal.SalaryCalculator /userdata/bejoys/samples/salcalc/input/ /userdata/bejoys/samples/salcalc/output/
5. Retrieve the output on to LFS
hadoop fs -getmerge /userdata/bejoys/samples/salcalc/output /home/bejoys/samples/salcal_output.txt
text – location in LFS
text – location in HDFS
Do we need a reducer here?
No we don’t. In fact you don’t even need the SalaryCalculatorReducer class, but my merely avoiding this class and not setting the reducer class in Job object won’t do your job. (Commenting this line of code won’t serve the purpose job.setReducerClass(SalaryCalculatorReducer.class)). Even if you don’t set your reducer class here a default reducer would be fired by the map reduce job that matches and the output key and value types from your mapper.
How can we avoid the reduce operation?
Just the number of reduce tasks to zero, you can do it in two ways
1. Alter your driver class to add a line of code as
job.setNumReduceTasks(0);
2. Altering the code and they packing it as jar is little time consuming so we can do the same in command line itself during run time. Alter your Hadoop jar command to include the number of reducers as well
hadoop jar /home/bejoys/samples/salarycalc.jar com.bejoy.hadoop.samples.salcal.SalaryCalculator –D mapred.reduce.tasks=0 /userdata/bejoys/samples/salcalc/input/ /userdata/bejoys/samples/salcalc/output/
Do I have to specify an empty text as the mapper/reducer value?
No you don’t need to. You can use NullWritable instead. Refer hadoop for processing dependent data splits
This is such a great resource on hadoop ,that you are providing and you give it away for free. I love seeing websites that understand the value of providing a quality resource for free.
ReplyDeleteHadoop Training in hyderabad
Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit
ReplyDeleteFollow the below links to know more knowledge on Hadoop
WebSites:
================
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
Videos:
===============
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
Hadoop Training in Hyderabad
Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit