As the name suggests it is map task that spawns multiple threads. A map task can be considered as a process which runs on its own jvm boundary. Multithreaded spawns multiple threads within the same map task. Don’t confuse the same as multiple tasks within the same jvm (this is achieved with jvm reuse). When I say a task has multiple threads, a task would be reusing the input split as defined by the input format and record reader reads the input like a normal map task. The multi threading happens after this stage; once the record reading has happened then the input/task is divided into multiple threads. (ie the input IO is not multi threaded and multiple threads come into picture after that)
MultiThreadedMapper is a good fit if your operation is highly CPU intensive and multiple threads getting multiple cycles could help in speeding up the task. If IO intensive, then running multiple tasks is much better than multi thread as in multiple tasks multiple IO reads would be happening in parallel.
Let us see how we can use MultiThreadedMapper. There are different ways to do the same in old mapreduce API and new API.
Old API
Enable Multi threaded map runner as
-D mapred.map.runner.class = org.apache.hadoop.mapred.lib.MultithreadedMapRunner
Or
jobConf.setMapRunnerClass(org.apache.hadoop.mapred.lib.MultithreadedMapRunner);
New API
Your mapper class should sub class (extend) org.apache.hadoop.mapreduce.lib.map.MultithreadedMapper instead of org.apache.hadoop.mapreduce.Mapper . The Multithreadedmapper has a different implementation of run() method.
You can set the number of threads within a mapper in MultiThreadedMapper by
MultithreadedMapper.setNumberOfThreads(n); or
mapred.map.multithreadedrunner.threads = n
Note: Don’t think it in a way that multi threaded mapper is better than normal map reduce as it spawns less jvms and less number of processes. If a mapper is loaded with lots of threads the chances of that jvm crashing are more and the cost of re-execution of such a hadoop task would be terribly high.
Don’t use Multi Threaded Mapper to control the number of jvms spanned, if that is your goal you need to tweak the mapred.job.reuse.jvm.num.tasks parameter whose default value is 1, means no jvm reuse across tasks.
The threads are at the bottom level ie within a map task and the higher levels on hadoop framework like the job has no communication regarding the same.
This comment has been removed by the author.
ReplyDeleteFinds useful.
ReplyDeleteFew questions
If I am setting mapred.job.reuse.jvm.num.
tasks= -1 and my NLineInput format has value 5 and total lines is 20, how it internally executed?
Hey,
ReplyDeleteIf you are using NLineInputFormat with your spec it is simple, 5 lines in a mapper task instance. 20 Lines then 4 map tasks. When jvm reuse is -1, all the map tasks on the same node/task tracker will be using the same jvm instance. To be noted, you can never guarantee that all the mappers would be on the same node, it'd be dependent on factors like the slots available, scheduling used etc.
Regards
Bejoy
In our case, maps are memory and cpu bound. we are also planning to use multithreaded mapper to achieve efficiencies in many aspects
ReplyDelete1. memory (common data structures) will be shared across multiple threads
2. each thread will be scheduled on different core and hence should be equivalent of running multiple maps on same physical box.
3. should be able to specify bigger split size and hence combiner efficiency should improve.
I have following queries
1. what are dis-advantages of multithreaded mapper beyond mentioned in notes section?
2. what could be other options to increase efficiency if maps are cpu and memory bound, primarily cpu bound.
Manish Jain
Gauvus, Guavus
Hi Manish
ReplyDeleteFrom your requirement, looks like you don't need MultiThreadedMapper. Use the normal map reduce , you can achieve you requirement of sharing data across mappers using distributed cache.
Some pointers inline
1. memory (common data structures) will be shared across multiple threads
>> use distributed cache
2. each thread will be scheduled on different core and hence should be equivalent of running multiple maps on same physical box.
>> muliple maps is a more cleaner approach here than going for MultiThreaded one
3. should be able to specify bigger split size and hence combiner efficiency should improve.
>> it is a trade off on parallelism vs data shuffle. more maps more parallelism.
Regards
Bejoy
Hi Bejoy
ReplyDeleteWe have java application which spawns multiple threads, each thread invoke a mapreduce task. Since multiple threads invoking the mapreduse task the application is failing.
Could you pls let me know how to execute mapreduce tasks in multi threaded way.
Thanks
Arun
Hi Arun
ReplyDeleteThis post is on a map task in a map reduce job spawning multiple threads. From what I got your case is different, You are launching mapreduce jobs in each spawned thread in your java application. And if the number of threads are too high and if the cluster capacity is not that great then definitely it can clog. You may have to get the failed task/job logs and see what is actually the root cause. The root cause can be many in your case.
Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.
ReplyDeleteit,s a nice article and it is very useful for hadoop learners.hadoop online trainings also provides the hadoop online training
ReplyDeleteit's nice information and it is useful for us.123trainings prvides hadoop online training in india
ReplyDeleteto see free demo just clickonline training hadoop demo class in hyderabad
ReplyDeleteit's nice information and it is useful for us.123trainings prvides hadoop online training in india
This comment has been removed by the author.
ReplyDeleteI must first of all appreciate the wonderfull efforts put in this blog to come up with such a good platform to encourage learning of Hadoop training
ReplyDeleteIt was nice article it was very useful for me as well as useful for Hadoop learners.thanks for providing this valuable information.
ReplyDeleteHADOOP Online Training by hyderabadsys online Trainings with a fantastic and continuous staff. Our Hadoop hyderabadsys online Trainings substance outlined according to the current IT industry necessity. Apache Hadoop is having great request in the business sector, tremendous number of employment opportunities are there in the IT world. Taking into account this interest hyderabadsys online Trainings began giving Online classes on Hadoop Training through the different online traininng strategies like Gotomeeting, Webex. Hadoop internet preparing is one hot no problem that has been broadly utilized. Owing to this, there is a gigantic requirement for Hadoop web preparing Administrators. There are numerous merchants who offer Hadoop internet preparing organization preparing. Engineering development inside the most recent decade has been huge to the point that things once considered unlimited are currently regular place and capacities and employments that once obliged high abilities and broad preparing can now be performed by practically anybody.
ReplyDeleteHadoop Online Training
Contact us:
India +91 9030400777
Usa +1-347-606-2716
Email: contact@Hyderabadsys.com
Thanks for sharing this. Excel Training ,
ReplyDeleteExcel Training in Delhi , Excel Training in Gurgaon
Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training
ReplyDeleteYou want big data interview questions and answers follow this link.
ReplyDeletehttp://kalyanhadooptraining.blogspot.in/search/label/Big%20Data%20Interview%20Questions%20and%20Answers
It is very good information and thanks for sharing this.even I love to share valuable information regarding technology.Recently my friend suggested me to buy hadoop videos at www.hadooponlinetraining.com.the videos are really good and having life time acess.
ReplyDeleteHi Ramya
ReplyDeleteYa I visited www.hadooponlinetutor.com.I even bought the videos also.Really the videos are very good and got at $20 only.Thanku so much Ramya.
Thank you so much for sharing this worthwhile to spent time on. You are running a really awesome blog. Keep up this good work Big Data Training Chennai
ReplyDelete