Wednesday, May 25, 2011

Word Count Example with Hadoop – 0.20


For detailed understanding on the working and control flow of this example refer

Mapper Class  - WordCountMapper.java

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
            //hadoop supported data types
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
         
           public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
           {
             //taking one line at a time and tokenizing the same
               String line = value.toString();
               StringTokenizer tokenizer = new StringTokenizer(line);
           
             //iterating through all the words available in that line and forming the key value pair
               while (tokenizer.hasMoreTokens())
               {
                  word.set(tokenizer.nextToken());
                  //sending to output collector which inturn passes the same to reducer
                  context.write(word, one);
               }
           }
         
 }

Reducer Class - WordCountReducer.java

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
      //Reduce method for just outputting the key from mapper as the value from mapper is just an empty string   
      public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
      {
            int sum = 0;
            /*iterates through all the values available with a key and add them together and give the
            final result as the key and sum of its values*/
            for (IntWritable value : values)
            {
                  sum += value.get();

            }
            context.write(key, new IntWritable(sum));
       }
}

Driver Class - WordCount.java

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;



public class WordCount extends Configured implements Tool
{
      public int run(String[] args) throws Exception
      {
            //getting configuration object and setting job name
            Configuration conf = getConf();
        Job job = new Job(conf, "Word Count hadoop-0.20");
      
        //setting the class names
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        //setting the output data type classes
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //to accept the hdfs input and outpur dir at run time
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new WordCount(), args);
        System.exit(res);
    }
}

27 comments:

  1. Thank for the example, i would like to know what's the role for each class.
    thanks and regards

    ReplyDelete
  2. Hi Mehdi

    As the name suggests
    - Mapper
    - Reducer
    - Driver

    Driver is a trigger point. To know more on mapper and reduce please have a look at http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html

    For a detailed understanding I'd recommend "Hadoop - The definitive guide by Tom White"

    ReplyDelete
    Replies
    1. how do i configure hadoop using java so that i can perform these wordcount n other tasks ??

      Delete
  3. As the name suggests
    WordCountMapper.java - Mapper
    WordCountReducer.java - Reducer
    WordCount.java - Driver

    ReplyDelete
  4. How do you extend this functionality to an xml input file instead of a text file?

    ReplyDelete
  5. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
    at org.apache.hadoop.conf.Configuration.(Configuration.java:139)
    at WordCount.main(WordCount.java:39)
    Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
    at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
    ... 2 more

    ReplyDelete
  6. Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. It is really very helpful for us and I have gathered some important information from this blog.

    Hadoop Training in Chennai

    ReplyDelete
  7. Hi this is vignesh i am having 3 years of experience as a android developer and i am certified. i have knowledge on OOPS concepts in android but dont know indepth. After learning hadoop will be enough to get a good career in IT with good package? and i crossed hadoop training in chennai website where someone please help me to identity the syllabus covers everything or not??

    Thanks,
    vignesh

    ReplyDelete
  8. Thanks for sharing this informative blog. If anyone wants to get Big Data Training in Chennai visit fita academy located at Chennai, which offers best Hadoop Training in Chennai with years of experienced professionals.

    ReplyDelete
  9. Thank you so much for sharing this wonderful article. From this i have earned more knowledge since I have been following your blog for a long time. This will be very useful for me in finding the best institute for Big Data Course in Chennai

    ReplyDelete
  10. Thanks for sharing your view to our knowledge’s, its helps me plenty keep sharing…

    Hadoop training chennai, Hadoop training in chennai

    ReplyDelete
  11. Thanks for sharing this informative blog. FITA provides Salesforce Course in Chennai with years of experienced professionals and fully hands-on classes. Salesforce is a cloud based CRM software. Today's most of the IT industry use this software for customer relationship management. To know more details about salesforce reach FITA Academy. Rated as No.1 Salesforce Training Institutes in Chennai.

    ReplyDelete
  12. Java Training

    Hi I am Johnson lives in Chennai. I am a technology freak. Recently I did Java Course in Chennai at a leading Java Training Institutes in Chennai. This is really helpful for me to make a bright career in IT industry.

    Java Training in Chennai

    ReplyDelete
  13. I see this content as a Unique and very informative article. Impressive article like this may help many like me in finding the best Hadoop training institute in chennai

    ReplyDelete
  14. Unix Training

    Thanks for sharing this informative blog. Suppose if anyone interested to learn Unix Training in Chennai, Please visit Fita Academy located at Chennai, Velachery.

    Regards....

    Unix Training Institutes in Chennai

    ReplyDelete
  15. Cloud Computing Training

    I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article.You have done a great job . If anyone want to get real time Cloud Computing Course in Chennai, Please visit FITA academy located at Chennai Velachery which offer best Cloud Computing Training in Chennai.

    ReplyDelete
  16. I gathered a lot of information through this article.Every example is easy to undestandable and explaining the logic easily.Thanks!
    AWS course chennai | AWS Certification in chennai | AWS Certification chennai

    ReplyDelete
  17. I have read your blog and i got good information from you blog visual path is one of the best training institute in hyderabad ameerpet and it have hadoop and lombardi bpm

    ReplyDelete
  18. EDUWIZZ provides an excellent job opportunity in Hybris Trainingfor JAVA professionals
    who are seeking for job or looking to change to latest and advanced technologies.

    ReplyDelete
  19. Thanks for sharing this informative blog by Hybris Training

    ReplyDelete
  20. Salesforce is a cloud based CRM software. Today's most of the IT industry use this software for customer relationship management. To get more details about salesforce please refer this site.

    Regards..
    Salesforce Admin Training in Chennai

    ReplyDelete
  21. Hi Bijoy,

    Excellent post..
    Your blog (both older and newer API version) helped me understood the mapper and reducer job specially the shuffling and sorting part which I was not able to get it... Thanks for the post... Keep it up...

    ReplyDelete
  22. very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing. cloud computing training in chennai | cloud computing training chennai | cloud computing course in chennai | cloud computing course chennai

    ReplyDelete
  23. Well post in recent day’s customer relationship play vital role to get good platform in business industry, Salesforce crm tool helps you to maintain your customer relationship enhancement.
    Regards,
    Salesforce training in Chennai | Salesforce course in Chennai | Salesforce training institute in Chennai

    ReplyDelete