Hello. I have very rare cases of map and/or reduce tasks stopping or continuing at a very slow pace (0.1% in an hour). Just a piece of advice here: Instead of killing the job, kill only the slow/stuck task, it will restart and this time (hopefully) not get stuck.
Regards, LLoyd On 24 February 2016 at 08:56, Varun saxena <[email protected]> wrote: > Hi Silnov, > > > > Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ? > > I suspect its same. > > MAPREDUCE-6513 is marked to go in 2.7.3 > > > > Regards, > > Varun Saxena. > > > > > > *From:* Silnov [mailto:[email protected]] > *Sent:* 24 February 2016 14:52 > *To:* user > *Subject:* MapReduce job doesn't make any progress for a very very long > time after one Node become unusable. > > > > hello everyone! I am a greenhand on hadoop. > > I have one question seeking for your help! > > > > > > I have some nodes running hadoop 2.6.0. > The cluster's configuration remain default largely. > I run some job on the cluster(especially some job processing a lot of > data) every day. > Sometimes, I found my job remain the same progression for a very very long > time. So I have to kill the job mannually and re-submit it to the cluster. > It works well before(re-submit the job and it run to the end), but > something go wrong today. > After I re-submit the same task for 3 times, its running go deadlock(the > progression doesn't change for a long time, and each time has a different > progress value.e.g.33.01%,45.8%,73.21%). > > I begin to check the web UI for the hadoop, then I find there are 98 map > suspend while all the running reduce task have consumed all the avaliable > memory. I stop the yarn and add configuration below into yarn-site.xml and > then restart the yarn. > <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property> > <value>0.1</value> > <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property> > <value>1.0</value> > (wanting the yarn to preempt the reduce task's resource to run suspending > map task) > After restart the yarn,I submit the job with the property > mapreduce.job.reduce.slowstart.completedmaps=1. > but the same result happen again!!(my job remain the same progress value > for a very very long time) > > I check the web UI for the hadoop again,and find that the suspended map > task is newed with the previous note:"TaskAttempt killed because it ran on > unusable node node02:21349". > Then I check the resourcemanager's log and find some useful messages below: > ******Deactivating Node node02:21349 as it is now LOST. > ******node02:21349 Node Transitioned from RUNNING to LOST. > > I think this may happen because my network across the cluster is not good > which cause the RM don't receive the NM's heartbeat in time. > > But I wonder that why the yarn framework can't preempt the running reduce > task's resource to run the suspend map task?(this cause the job remain the > same progress value for a very very long time ) > > Any one can help? > Thank you very much! >
