[ 
https://issues.apache.org/jira/browse/YARN-4728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15160305#comment-15160305
 ] 

zhihai xu commented on YARN-4728:
---------------------------------

Thanks for reporting this issue [~Silnov]! 
It looks like this issue is caused by the long timeout at two level. This issue 
is similar as YARN-3944, YARN-4414, YARN-3238 and YARN-3554. You may work 
around this issue by changing the configuration values: 
"ipc.client.connect.max.retries.on.timeouts" (default is 45),  
"ipc.client.connect.timeout"(default is 20000ms) and 
"yarn.client.nodemanager-connect.max-wait-ms" (default is 900,000ms).

> MapReduce job doesn't make any progress for a very very long time after one 
> Node become unusable.
> -------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4728
>                 URL: https://issues.apache.org/jira/browse/YARN-4728
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler, nodemanager, resourcemanager
>    Affects Versions: 2.6.0
>         Environment: hadoop 2.6.0
> yarn
>            Reporter: Silnov
>            Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I have some nodes running hadoop 2.6.0.
> The cluster's configuration remain default largely.
> I run some job on the cluster(especially some job processing a lot of data) 
> every day.
> Sometimes, I found my job remain the same progression for a very very long 
> time. So I have to kill the job mannually and re-submit it to the cluster. It 
> works well before(re-submit the job and it run to the end), but something go 
> wrong today.
> After I re-submit the same task for 3 times, its running go deadlock(the 
> progression doesn't change for a long time, and each time has a different 
> progress value.e.g.33.01%,45.8%,73.21%).
> I begin to check the web UI for the hadoop, then I find there are 98 map 
> suspend while all the running reduce task have consumed all the avaliable  
> memory. I stop the yarn and add configuration below  into yarn-site.xml and 
> then restart the yarn.
> <property>yarn.app.mapreduce.am.job.reduce.rampup.limit</property>
> <value>0.1</value>
> <property>yarn.app.mapreduce.am.job.reduce.preemption.limit</property>
> <value>1.0</value>
> (wanting the yarn to preempt the reduce task's resource to run suspending map 
> task)
> After restart the yarn,I submit the job with the property 
> mapreduce.job.reduce.slowstart.completedmaps=1.
> but the same result happen again!!(my job remain the same progress value for 
> a very very long time)
> I check the web UI for the hadoop again,and find that the suspended map task 
> is newed with the previous note:"TaskAttempt killed because it ran on 
> unusable node node02:21349".
> Then I check the resourcemanager's log and find some useful messages below:
> ******Deactivating Node node02:21349 as it is now LOST.
> ******node02:21349 Node Transitioned from RUNNING to LOST.
> I think this may happen because my network across the cluster is not good 
> which cause the RM don't receive the NM's heartbeat in time.
> But I wonder that why the yarn framework can't preempt the running reduce 
> task's resource to run the suspend map task?(this cause the job remain the 
> same progress value for a very very long time:( )
> Any one can help?
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to