Friday, April 29, 2011

Word Count - Hadoop Map Reduce Example

                Word count is a typical example where Hadoop map reduce developers start their hands on with. This sample map reduce is intended to count the no of occurrences of each word  in the provided input files.

What are the minimum requirements?
1.       Input text files – any text file
2.       Cloudera test VM
3.       The mapper, reducer and driver classes to process the input files

 How it works
                The word count operation takes place in two stages a mapper phase and a reducer phase. In mapper phase first the test is tokenized into words then we form a key value pair with these words where the key being the word itself and value ‘1’. For example consider the sentence
“tring tring the phone rings”
In map phase the sentence would be split as words and form the initial key value pair as

In the reduce phase the keys are grouped together and the values for similar keys are added. So here there are only one pair of similar keys ‘tring’ the values for these keys would be added so the out put key value pairs would be
This would give the number of occurrence of each word in the input. Thus reduce forms an aggregation phase for keys.

The point to be noted here is that first the mapper class executes completely on the entire data set splitting the words and forming the initial key value pairs. Only after this entire process is completed the reducer starts. Say if we have a total of 10 lines in our input files combined together, first the 10 lines are tokenized and key value pairs are formed in parallel, only after this the aggregation/ reducer would start its operation.

The figure below would throw more light to your understanding

Now coming to the practical side of implementation we need our input file and map reduce program jar to do the process job. In a common map reduce process two methods do the key job namely the map and reduce , the main method would trigger the map and reduce methods. For convenience and readability it is better to include the map , reduce and main methods in 3 different class files . We’d look at the 3 files we require to perform the word count job

Word Count Mapper

import java.util.StringTokenizer;

import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
      //hadoop supported data types
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
      //map method that performs the tokenizer job and framing the initial key value pairs
      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
            //taking one line at a time and tokenizing the same
            String line = value.toString();
          StringTokenizer tokenizer = new StringTokenizer(line);
          //iterating through all the words available in that line and forming the key value pair
            while (tokenizer.hasMoreTokens())
               //sending to output collector which inturn passes the same to reducer
                 output.collect(word, one);

Let us dive in details of this source code we can see the usage of a few deprecated classes and interfaces; this is because the code has been written to be compliant with Hadoop versions 0.18 and later. From Hadoop version 0.20 some of the methods are deprecated by still supported.

Lets now focus on the class definition part
implements Mapper<LongWritable, Text, Text, IntWritable>
What does this Mapper<LongWritable, Text, Text, IntWritable> stand for?
The data types provided here are Hadoop specific data types designed for operational efficiency suited for massive parallel and lightning fast read write operations. All these data types are based out of java data types itself, for example LongWritable is the equivalent for long in java, IntWritable for int and Text for String.
When we use it as Mapper<LongWritable, Text, Text, IntWritable> , it refers to the data type of input and output key value pairs specific to the mapper or rateher the map method, ie Mapper<Input Key Type, Input Value Type, Output Key Type, Output Value Type>. In our example the input to a mapper is a single line, so this Text (one input line) forms the input value. The input key would a long value assigned in default based on the position of Text in input file. Our output from the mapper is of the format “Word, 1“ hence the data type of our output key value pair is <Text(String),  IntWritable(int)>

The next key component out here is the map method
map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
We’d now look into each of the input parameters in detail. The first and second parameter refers to the Data type of the input Key and Value to the mapper. The third parameter is the output collector which does the job of taking the  output data either from the mapper or reducer, with the output collector we need to specify the Data Types of the output Key and Value from the mapper. The fourth parameter, the reporter is used to report the task status internally in Hadoop environment to avoid time outs.

The functionality of the map method is as follows
1.       Create a IntWritable variable ‘one’ with value as 1
2.       Convert the input line in Text type to a String
3.       Use a tokenizer to split the line into words
4.       Iterate through each word and a form key value pairs as
a.       Assign each work from the tokenizer(of String type) to a Text ‘word
b.      Form key value pairs for each word as <word,one> and push it to the output collector

Word Count Reducer

import java.util.Iterator;

import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
      //reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
            int sum = 0;
            /*iterates through all the values available with a key and add them together and give the
            final result as the key and sum of its values*/
          while (values.hasNext())
               sum +=;
          output.collect(key, new IntWritable(sum));

Here like for the mapper the reducer implements
Reducer<Text, IntWritable, Text, IntWritable>
The first two refers to data type of Input Key and Value to the reducer and the last two refers to data type of output key and value. Our mapper emits output as <apple,1> , <grapes,1> , <apple,1> etc. This is the input for reducer so here the data types of key and value in java would be String and int, the equivalent in Hadoop would be Text and IntWritable. Also we get the output as<word, no of occurrences> so the data type of output Key Value would be <Text, IntWritable>

Now the key component here, the reduce method.
The input to reduce method from the mapper after the sort and shuffle phase would be the key with the list of associated values with it. For example here we have multiple values for a single key from our mapper like <apple,1> , <apple,1> , <apple,1> , <apple,1> . This key values would be fed into the reducer as < apple, {1,1,1,1} > .
Now let us evaluate our reduce method
reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)
Here all the input parameters are hold the same functionality as that of a mapper, the only diference is with the input Key Value. As mentioned earlier the input to a reducer instance is a key and list of values hence  ‘Text key, Iterator<IntWritable> values’ . The next parameter denotes the output collector of the reducer with the data type of output Key and Value.

The functionality of the reduce method is as follows
1.       Initaize a variable ‘sum’ as 0
2.       Iterate through all the values with respect to a key and sum up all of them
3.       Push to the output collector the Key and the obtained sum as value

Driver Class
The last class file is the driver class. This driver class is responsible for triggering the map reduce job in Hadoop, it is in this driver class we provide the name of our job, output key value data types and the mapper and reducer classes. The source code for the same is as follows

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount extends Configured implements Tool{
      public int run(String[] args) throws Exception
            //creating a JobConf object and assigning a job name for identification purposes
            JobConf conf = new JobConf(getConf(), WordCount.class);

            //Setting configuration object with the Data Type of output Key and Value

            //Providing the mapper and reducer class names

            //the hdfs input and output directory to be fetched from the command line
            FileInputFormat.addInputPath(conf, new Path(args[0]));
            FileOutputFormat.setOutputPath(conf, new Path(args[1]));

            return 0;
      public static void main(String[] args) throws Exception
            int res = Configuration(), new WordCount(),args);

Create all the three java files in your project. Now you’d be having compilation errors just get the latest release of Hadoop and add the jars on to your class path. Once free from compilation errors we have to package them to a jar. If you are using eclipse then right click on the project and use the export utility. While packing  the jar it is better not to give the main class, because in future when you have multiple map reduce and multiple drivers for the same project we should leave an option to choose the main class file  during run time through the command line. 

Follow the steps to execute the job
1.       Copy the jar to a location in LFS (/home/training/usecase/wordcount/wordcount.jar)
2.       Copy the input files from windows to LFS(/home/training/usecase/wordcount/input/)
3.       Create an input directory in HDFS
hadoop fs –mkdir /projects/wordcount/input/
4.       Copy the input files from LFS to HDFS
Hadoop fs –copyFromLocal /home/training/usecase/wordcount/input/* /projects/wordcount/input/
5.       Execute the jar
hadoop jar /home/training/usecase/wordcount/wordcount.jar com.bejoy.samples.wordcount.WordCount /projects/wordcount/input/ /projects/wordcount/output/

We’d just look at the command in detail with each parameter
/home/training/usecase/wordcount/wordcount.jar -> full path of the jar file in LFS
com.bejoy.samples.wordcount.WordCount  -> full package name of the Driver Class
/projects/wordcount/input/  -> input files location in HDFS
/projects/wordcount/output/  -> a directory in HDFS where we need the output files

NOTE: In Hadoop the map reduce process creates the output directory in hdfs and store the output files on to the same. If the output directory already exists in Hadoop then the m/r job wont execute, in that case either you need to change the output directory or delete the provided output directory in HDFS before running the jar again
6.       Once the job shows a success status we can see the output file in the output directory(part-00000)
Hadoop fs –ls /projects/wordcount/output/
7.       For any further investigation of output file we can retrieve the data from hdfs to LFS and from there to the desired location
hadoop fs –copyToLocal /projects/wordcount/output/ /home/training/usecase/wordcount/output/

Some better practices
                In our current example with the configuration parameters or during runtime we are not specifying the number of reducers. In default Hadoop map reduce jobs have the default no of reducers as one, hence one only one reducer instance is used to process the result set from all the mappers and therefore greater the load a single reducer instance and slower the whole process. We are not exploiting parallelism here, to exploit the same we have to assign the no of reducers explicitly. In runtime we can specify the no of reducers as
hadoop jar /home/training/usecase/wordcount/wordcount.jar com.bejoy.samples.wordcount.WordCount -D mapred.reduce.tasks=15 /projects/wordcount/input/ /projects/wordcount/output/

The key point to be noted here is that the no of output files is same as the no of reducers used as every reducer would produce its own output file. All these output files would be available in the hdfs output directory we assigned in the run command. It would be a cumbersome job to combine all these files manually to obtain the result set. For that Hadoop has provided a get merge command

hadoop fs –getmerge /projects/wordcount/output/ /home/training/usecase/wordcount/output/WordCount.txt

This command would combine the contents of all the files available directly within the /projects/wordcount/output/ hdfs directory and write the same to /home/training/usecase/wordcount/output/WordCount.txt file in LFS

You can find the working copy of the word count implementation with hadoop 0.20 API at the following location word count example with hadoop 0.20


  1. Nice example with details, Please add the new api example if possible.

    1. For the latest api, a working example with complete source code and explanation can be found at

  2. Hi Ratan

    You can find the sample code for mapreduce API @

  3. Thanks a lot for this article. It is really a kick starter.

  4. Can u please explain about how the input file is specified for the mapper and who sends line by line to mapper function?

  5. Hi Arockiaraj

    In a mapreduce program, the JobTracker assigns input splits to each map task based on factors like data locality , slot availability etc. A map task actually process a certain hdfs blocks. If you have a large file that comprises of 10 blocks and if your mapred split properties complement with the hdfs block size then you'll have 10 map tasks processing 1 block each.

    Now once the mapper has its own share of input based on the input format and certain other properties it is the RecordReader that reads record by record and given them as input to each execution of the map() method. In default TextInputFormat the record reader reads till a new line character for a record.

  6. How is default number of reducers chosen by mapreduce framework ? Is it according to data load or any configured property ?

  7. hi ,
    what is the type of KEYIN ???what do we call it?? datatype,class,interface etc???

  8. public class Mapper what does KEYIN mean ? i have searched in source code but unable to find declaration of KEYIn

  9. Hi Hemanth

    By KEYIN , I'm assuming you are referring to input key in mapper.
    Here I'm using the default TextInputFormat and for that the default Key is LongWritable, which is an offset value from beginning of the file.
    KEYIN is a subclass of Writable.

  10. Hello,

    Please help me to understand what is BigData and purpose with example.


  11. Default reducers will be 1, but you can still change it based on your requirement.

  12. Hello,
    Thanks a lot for a clear overview. I have a question - What happened if I wish to output the result from a reducer to lets say two different files with some logic related to that. Something like - mapper is reading, reducer accepts those reads, generate two different lists and write those lists into two different outputs/files - one for each list.
    Thanks a lot

  13. Check out the visual explanation I made

  14. Nice article. I need to find out how one can extend this example to doing Word Count on an xml file.

  15. this is awesome; thanks for helping the community.

  16. very nice tutorial,, i found it very useful thank you,

  17. I tried the code , it works for text file for both inside and outside and inside HDFS . Is there any difference in term of speed and architecture . Please assist me ? Thanks.

  18. Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for Hadoop Online Training. I found a best Hadoop Online Training demo class.

  19. Thank a lot.It is really a kick starter.

  20. Hello Dude
    I am Fresher in Hadoop. What about Future Vacanies for Hadoop Technology? Reply Must

  21. Hey friends here are some good tutorials on hadoop 2.2.0

  22. 'This is really very nice tutorial to have the basic understanding of map reduce function.Thanks a lot.

  23. Very good document for reference for a newbie in hadoop world. Counts words using unix scripts are not fun any more :P

    Expecting more and more illustrative examples.

  24. For the latest api, a working example with complete source code and explanation can be found at is great resource for BigData Hadoop newbies

  25. This is a great inspiring tutorials on hadoop.I am pretty much pleased with your good work.You put really very helpful information. Keep it up
    Hadoop Training in hyderabad

  26. This comment has been removed by the author.

  27. Nice Explanation,Excellent details,solve some doubts,thanks.
    Keep it up. :)

  28. Hadoop is an open source tool, so it has multiple benefits for developers and corporate as well.Anobody intrested in Hadoop Training so please check

  29. Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training

  30. Nice blog,
    you have explained map reduce in very nice way, it helps most of the students who wants to learn big data hadoop.
    We are also providing Hadoop training in Delhi and our trainers are working professionals having approx 4 to 5 year experience.

  31. You didn't explain the driver class properly. I'm surprised no one else has said anything about it. Please add some more information about that.

  32. Please explain the run method used in Driver class, How is the flow ?

  33. This is what I am looking for. Thanks a lot.


  35. Great article! Map-Reduce has served a great purpose, though: many, many companies, research labs and individuals are successfully bringing Map-Reduce to bear on problems to which it is suited: brute-force processing with an optional aggregation. But more important in the longer term, to my mind, is the way that Map-Reduce provided the justification for re-evaluating the ways in which large-scale data processing platforms are built (and purchased!). Learn more at

  36. Hi
    Really very nice blog.Very informative.Thanks for sharing the valuable information.Recently I visited are offering hadoop videos at $20 only.The videos are really awesome.

  37. hello
    Very nice explanation thanks
    I am new to map reduce programming. How to calculate total count of words. I want output as sum/total_count.

    here in this example total sum is 12 so for each key we divide it with respective sum of that word.


    for apple 4/12
    for mango 2/12

    so in output i want like this
    apple 0.33
    mango 0.16

    could any one tell me how should i achieve this i am really struggling a lot.


  38. Giving good information. it will help me lot visualpath is one of the best training institute in hyderabad ameerpet. lombardi bpm

  39. EDUWIZZ provides an excellent job opportunity in Hybris Trainingfor JAVA professionals
    who are seeking for job or looking to change to latest and advanced technologies.

  40. As CEOs across the globe grapple with issues from talent acquisition and retention to the need for greater employee productivity, a study by KPMG shows that HR has a massive opportunity to drive significant business value. To know more about , Visit Big Data training in chennai

  41. hello plz assist me how to print a particular word in op file

  42. I really enjoy the blog.Much thanks again. Really Great.
    Very informative article post.Really looking forward to read more. Will read on…

    sap online training
    sap sd online training
    hadoop online training

  43. I really enjoy the blog.Much thanks again. Really Great.
    Very informative article post.Really looking forward to read more. Will read on…

    oracle online training
    sap fico online training
    dotnet online training

  44. I was reading your blog this morning and noticed that you have a awesome
    resource page. I actually have a similar blog that might be helpful or useful
    to your audience.

    sap sd and crm online training
    sap online tutorials
    sap sd tutorial
    sap sd training in ameerpet

  45. There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

  46. Great example, tthanks to explain about word count. This post shows you are too experienced bigdata analyst, please share more tips like this.

  47. The Hadoop tutorial you have explained is most useful for begineers who are taking Hadoop Administrator Online Training
    Thank you for sharing Such a good tutorials on Hadoop

  48. This informative post helped me a lot in training my students. Thanks so much.
    HTML5 Training in T Nagar | HTML5 Training in T Nagar

  49. I was reading your blog and it gives a lot of information to us.Thanks for sharing the post..!
    QTP &QC Training

    Microstrategy Training

    MSBI Training

    Oracle 11g RAC Training

  50. plenty of new stuff proper here!

  51. Very thoughtful information for freshers and starters

  52. This is exactly what I was searching for. Awesome post. Thanks a bunch. Helped me in taking class for my students. Wish to follow your posts, keep writing! God Bless!
    Dot Net training | Dot Net training | Dot Net training

  53. This is just the information I am finding everywhere. Thanks for your blog, I just subscribe your blog. This is a nice blog..
    online word count

  54. Hello Admin, thank you for the article. It has helped me during my Java training in Chennai. Fita academy is a Java training institutes in Chennai that provides training for interested students. So feel free to contact us to join our Java J2EE training institutes in Chennai.

  55. Is there a way of doing this without using imports? Have to use IO inputs.

  56. Thanks for a providing Information. We are providing online training classes Datastageonlinetraining


  57. Hi, probably our entry may be off topic but anyways, I have been surfing around your blog and it looks very professional. It’s obvious you know your topic and you

    appear fervent about it. I’m developing a fresh blog plus I’m struggling to make it look good, as well as offer the best quality content. I have learned much at your
    web site and also I anticipate alot more articles and will be coming back soon. Thanks you.

    Data Analytics Courses in Chennai

  58. Hello admin, thank you for your informative post on hadoop training in Chennai. It helped a lot in training my students during our hadoop training Chennai sessions. We at Fita, provide big data training in Chennai for students who are interested in choosing a career in big data.

  59. Great information about Hadoop. It will be helpful for us.
    Thank you !!
    I would like to share useful things for hadoop job seekers Hadoop Interview Questions .

  60. Thanks for the informative topic, I need to read 3 different files in a single mapper and read the related data in all 3 different files and combine them in one single file, ho can I do this.

  61. More than 5000 registered IT consultants and IT corporate's.

    Request IT online training at

  62. Great post about Hadoop. It will be helpful for us.
    Thank you so much..
    For hadoop job seekers Hadoop Interview Questions .

  63. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop & Spark Training in Hyderabad will help you to enter big data hadoop & spark technology.

  64. Great post and informative was awesome to read, thanks for sharing this great content to my vision.
    Informatica Training In Chennai
    Hadoop Training In Chennai
    Oracle Training In Chennai
    SAS Training In Chennai

  65. Thanks for Sharing this valuble information and itis useful for me and CORE SAP learners.We also provides the best SAP Online Training

    SAP Online Training | sap abap online training course | sap crm online training | sap fico online training | sap sd online training

  66. Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
    Oracle Training In Chennai

  67. Best Java Training Institute In ChennaiThis information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

  68. Hi,

    I am just learning about Hadoop.
    I found that all the examples of MapReduce is about counting words. Is that the only purpose of MapReduce?


  69. hai!!!its nice to see your blog...Big Data Trunk is the leading Big Data focus consulting and training firm founded by industry veterans in data domain.We offer services like Strategy Consulting, Advisory Consulting and high quality classroom individual and corporate training. Once visit our blog and get a breif idea about Bigdata trunk...

    hadoop ónlinÉ training

    free big data bootcamp

    hadoop big data videos

    spark ónlinÉ training

    Big data QA Tester training

    Big data Analyst training

  70. Thanks for the blog.Map Reduce also moves the processing software to the data.

    Hadoop training in chennai

  71. Thanks for sharing the information.Hadoop is a platform for storing and processing of Data in an environment with clusters of computers using simple programming language.It is designed in such a way that it connects from single servers to group of servers with proper computation and storage.

    Hadoop training in chennai

  72. I just want to say that all the information you have given here is awesome. Thank you

    big data classroom training

  73. Programming is very interesting and creative thing if you do it with love. Your blog code helps a lot to beginners to learn programming from basic to advance level. I really love this blog because I learn a lot from here and this process is still continuing.
    Love from Pro Programmer

  74. Excellent tips. Really useful stuff .Never had an idea about this, will look for more of such informative posts from your side.. Good job...Keep it up

    Spark Training in Chennai

  75. It is a stunning post. Exceptionally valuable to me. I preferred it .Take a look to my site Professional Android Training in Chennai

  76. Nice article. However, I am stuck at the last step.

    On executing the jar, I get the following error message :
    Exception in thread "main" java.lang.ClassNotFoundException: com.bejoy.samples.wordcount.WordCount
    at java.lang.ClassLoader.loadClass(
    at java.lang.ClassLoader.loadClass(
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(
    at org.apache.hadoop.util.RunJar.main(

    Kindly help me out.

  77. Thanks for sharing beautiful article..check out this website Aurelius corporate training institute made on PHP.

  78. If you are searching for the best property in Mumbai, so better to browse real estate portals because on real estate portals you can find the properties for sale in Kalyan as per your need, requirement and budget. In market there are many way to buy the real estate properties such as; you can also contact to real estate agents as they can guide you to get the local property at best price.


  79. Yes, you are absolutely correct...And it is very informative and very clear and easy to understand.. seo training in chennai