RE: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Varun saxena Tue, 23 Feb 2016 23:57:00 -0800

Hi Silnov,

Can you check your AM logs and compare it with MAPREDUCE-6513 scenario ?
I suspect its same.
MAPREDUCE-6513 is marked to go in 2.7.3

Regards,
Varun Saxena.

From: Silnov [mailto:[email protected]]
Sent: 24 February 2016 14:52
To: user
Subject: MapReduce job doesn't make any progress for a very very long time 
after one Node become unusable.

hello everyone! I am a greenhand on hadoop.

I have one question seeking for your help!

I have some nodes running hadoop 2.6.0.
The cluster's configuration remain default largely.
I run some job on the cluster(especially some job processing a lot of data) 
every day.
Sometimes, I found my job remain the same progression for a very very long 
time. So I have to kill the job mannually and re-submit it to the cluster. It 
works well before(re-submit the job and it run to the end), but something go 
wrong today.
After I re-submit the same task for 3 times, its running go deadlock(the 
progression doesn't change for a long time, and each time has a different 
progress value.e.g.33.01%,45.8%,73.21%).

I begin to check the web UI for the hadoop, then I find there are 98 map 
suspend while all the running reduce task have consumed all the avaliable 
memory. I stop the yarn and add configuration below into yarn-site.xml and then 
restart the yarn.
<property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
<value>0.1</value>
<property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
<value>1.0</value>
(wanting the yarn to preempt the reduce task's resource to run suspending map 
task)
After restart the yarn,I submit the job with the property 
mapreduce.job.reduce.slowstart.completedmaps=1.
but the same result happen again!!(my job remain the same progress value for a 
very very long time)

I check the web UI for the hadoop again,and find that the suspended map task is 
newed with the previous note:"TaskAttempt killed because it ran on unusable 
node node02:21349".
Then I check the resourcemanager's log and find some useful messages below:
******Deactivating Node node02:21349 as it is now LOST.
******node02:21349 Node Transitioned from RUNNING to LOST.

I think this may happen because my network across the cluster is not good which 
cause the RM don't receive the NM's heartbeat in time.

But I wonder that why the yarn framework can't preempt the running reduce 
task's resource to run the suspend map task?(this cause the job remain the same 
progress value for a very very long 
time[https://issues.apache.org/jira/images/icons/emoticons/sad.gif] )

Any one can help?
Thank you very much!

RE: MapReduce job doesn't make any progress for a very very long time after one Node become unusable.

Reply via email to