[
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305
]
zhihai xu commented on YARN-4728:
---------------------------------
Thanks for reporting this issue [~Silnov]!
It looks like this issue is caused by the long timeout at two level. This issue
is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work
around this issue by changing the configuration values:
"ipc.client.connect.max.retries.on.timeouts" (default is 45),
"ipc.client.connect.timeout"(default is 20000ms) and
"yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms).
> MapReduce job doesn't make any progress for a very very long time after one
> Node become unusable.
> -------------------------------------------------------------------------------------------------
>
> Key: YARN-4728
> URL: https://issues.apache.org/jira/browse/YARN-4728
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler, nodemanager, resourcemanager
> Affects Versions: 2.6.0
> Environment: hadoop 2.6.0
> yarn
> Reporter: Silnov
> Priority: Critical
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data)
> every day.
> Sometimes, I found my job remain the same progression for a very very long
> time. So I have to kill the job mannually and re-submit it to the cluster. It
> works well before(re-submit the job and it run to the end), but something go
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the
> progression doesn't change for a long time, and each time has a different
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map
> suspend while all the running reduce task have consumed all the avaliable
> memory. I stop the yarn and add configuration below into yarn-site.xml and
> then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending map
> task)
> After restart the yarn,I submit the job with the property
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task
> is newed with the previous note:"TaskAttempt killed because it ran on
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce
> task's resource to run the suspend map task?(this cause the job remain the
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)