Friday, February 10, 2012

Enable Multiple threads in a mapper aka MultithreadedMapper


As the name suggests it is map task that spawns multiple threads. A map task can be considered as a process which runs on its own jvm boundary. Multithreaded spawns multiple threads within the same map task. Don’t confuse the same as multiple tasks within the same jvm (this is achieved with jvm reuse). When I say a task has multiple threads, a task would be reusing the input split as defined by the input format and record reader reads the input like a normal map task. The multi threading happens after this stage; once the record reading has happened then the input/task is divided into multiple threads.  (ie the input IO is not multi threaded and multiple threads come into picture after that)
MultiThreadedMapper is a good fit if your operation is highly CPU intensive and multiple threads getting multiple cycles could help in speeding up the task. If IO intensive, then running multiple tasks is much better than multi thread as in multiple tasks multiple IO reads would be happening in parallel.
Let us see how we can use MultiThreadedMapper. There are different ways to do the same in old mapreduce API and new API.
Old API
Enable Multi threaded map runner as
-D mapred.map.runner.class = org.apache.hadoop.mapred.lib.MultithreadedMapRunner
Or
jobConf.setMapRunnerClass(org.apache.hadoop.mapred.lib.MultithreadedMapRunner);

New API
Your mapper class should sub class (extend) org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of org.apache.hadoop.mapreduce.Mapper . The Multithreadedmapper has a different implementation of run() method.

You can set the number of threads within a mapper in MultiThreadedMapper by
MultithreadedMapper.setNumberOfThreads(n); or
mapred.map.multithreadedrunner.threads = n
 


Note: Don’t think it in a way that multi threaded mapper is better than normal map reduce as it spawns less jvms and less number of processes. If a mapper is loaded with lots of threads the chances of that jvm crashing are more and the cost of re-execution of such a hadoop task would be terribly high.
                Don’t use Multi Threaded Mapper to control the number of jvms spanned, if that is your goal you need to tweak the mapred.job.reuse.jvm.num.tasks parameter whose default value is 1, means no jvm reuse across tasks.
                The threads are at the bottom level ie within a map task and the higher levels on hadoop framework like the job has no communication regarding the same.

21 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Finds useful.
    Few questions
    If I am setting mapred.job.reuse.jvm.num.
    tasks= -1 and my NLineInput format has value 5 and total lines is 20, how it internally executed?

    ReplyDelete
  3. Hey,
    If you are using NLineInputFormat with your spec it is simple, 5 lines in a mapper task instance. 20 Lines then 4 map tasks. When jvm reuse is -1, all the map tasks on the same node/task tracker will be using the same jvm instance. To be noted, you can never guarantee that all the mappers would be on the same node, it'd be dependent on factors like the slots available, scheduling used etc.

    Regards
    Bejoy

    ReplyDelete
  4. In our case, maps are memory and cpu bound. we are also planning to use multithreaded mapper to achieve efficiencies in many aspects
    1. memory (common data structures) will be shared across multiple threads
    2. each thread will be scheduled on different core and hence should be equivalent of running multiple maps on same physical box.
    3. should be able to specify bigger split size and hence combiner efficiency should improve.

    I have following queries
    1. what are dis-advantages of multithreaded mapper beyond mentioned in notes section?
    2. what could be other options to increase efficiency if maps are cpu and memory bound, primarily cpu bound.

    Manish Jain
    Gauvus, Guavus

    ReplyDelete
  5. Hi Manish
    From your requirement, looks like you don't need MultiThreadedMapper. Use the normal map reduce , you can achieve you requirement of sharing data across mappers using distributed cache.

    Some pointers inline

    1. memory (common data structures) will be shared across multiple threads
    >> use distributed cache
    2. each thread will be scheduled on different core and hence should be equivalent of running multiple maps on same physical box.
    >> muliple maps is a more cleaner approach here than going for MultiThreaded one
    3. should be able to specify bigger split size and hence combiner efficiency should improve.
    >> it is a trade off on parallelism vs data shuffle. more maps more parallelism.

    Regards
    Bejoy

    ReplyDelete
  6. Hi Bejoy
    We have java application which spawns multiple threads, each thread invoke a mapreduce task. Since multiple threads invoking the mapreduse task the application is failing.
    Could you pls let me know how to execute mapreduce tasks in multi threaded way.

    Thanks
    Arun

    ReplyDelete
  7. Hi Arun

    This post is on a map task in a map reduce job spawning multiple threads. From what I got your case is different, You are launching mapreduce jobs in each spawned thread in your java application. And if the number of threads are too high and if the cluster capacity is not that great then definitely it can clog. You may have to get the failed task/job logs and see what is actually the root cause. The root cause can be many in your case.

    ReplyDelete
  8. Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.

    ReplyDelete
  9. it,s a nice article and it is very useful for hadoop learners.hadoop online trainings also provides the hadoop online training

    ReplyDelete
  10. it's nice information and it is useful for us.123trainings prvides hadoop online training in india
    to see free demo just clickonline training hadoop demo class in hyderabad

    ReplyDelete


  11. it's nice information and it is useful for us.123trainings prvides hadoop online training in india

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. I must first of all appreciate the wonderfull efforts put in this blog to come up with such a good platform to encourage learning of Hadoop training

    ReplyDelete
  14. It was nice article it was very useful for me as well as useful for Hadoop learners.thanks for providing this valuable information.

    ReplyDelete
  15. HADOOP Online Training by hyderabadsys online Trainings with a fantastic and continuous staff. Our Hadoop hyderabadsys online Trainings substance outlined according to the current IT industry necessity. Apache Hadoop is having great request in the business sector, tremendous number of employment opportunities are there in the IT world. Taking into account this interest hyderabadsys online Trainings began giving Online classes on Hadoop Training through the different online traininng strategies like Gotomeeting, Webex. Hadoop internet preparing is one hot no problem that has been broadly utilized. Owing to this, there is a gigantic requirement for Hadoop web preparing Administrators. There are numerous merchants who offer Hadoop internet preparing organization preparing. Engineering development inside the most recent decade has been huge to the point that things once considered unlimited are currently regular place and capacities and employments that once obliged high abilities and broad preparing can now be performed by practically anybody.
    Hadoop Online Training
    Contact us:
    India +91 9030400777
    Usa +1-347-606-2716
    Email: contact@Hyderabadsys.com

    ReplyDelete
  16. Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training

    ReplyDelete
  17. You want big data interview questions and answers follow this link.
    http://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers

    ReplyDelete
  18. It is very good information and thanks for sharing this.even I love to share valuable information regarding technology.Recently my friend suggested me to buy hadoop videos at www.hadooponlinetraining.com.the videos are really good and having life time acess.

    ReplyDelete
  19. Hi Ramya
    Ya I visited www.hadooponlinetutor.com.I even bought the videos also.Really the videos are very good and got at $20 only.Thanku so much Ramya.

    ReplyDelete
  20. Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work Big Data Training Chennai

    ReplyDelete