[jira] [Commented] (YARN-4217) Failed AM attempt retries on same failed host

Eric Payne (JIRA) Thu, 01 Oct 2015 09:07:18 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940016#comment-14940016
 ]


Eric Payne commented on YARN-4217:
----------------------------------

One way to fix this would be by blacklisting the bad nodes. However, we need to 
be careful that the cure isn't worse than the disease. For example, Hadoop 0.20 
had black/grey listing of nodes but it was often disabled because it caused 
more problems than it solved. We don't want one misconfigured pipeline spawning 
AMs/tasks that always fail to cause the RM to think all nodes are bad and bring 
the cluster to a halt. It's difficult to discern whether a failure was the 
node's fault or the job's fault (or sometimes neither was at fault).

I think the best approach initially is to implement an application-specific 
blacklisting approach, where the RM will track bad nodes per application rather 
than across applications. That way an AM that isn't working on a node can be 
tried on another node, but a misconfigured/specialized AM won't break the node 
for other AMs/tasks that work just fine on that node. The drawback of course is 
that if the node really is totally bad then each application has to learn
that separately.

> Failed AM attempt retries on same failed host
> ---------------------------------------------
>
>                 Key: YARN-4217
>                 URL: https://issues.apache.org/jira/browse/YARN-4217
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: applications
>    Affects Versions: 2.7.1
>            Reporter: Eric Payne
>
> This happens when the cluster is maxed out. One node is going bad, so 
> everything that happens on it fails, so the bad node is never busy. Since the 
> cluster is maxed out, when the RM looks for a node with available resources, 
> it will always find the almost bad one because nothing can run on it so it 
> has available resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4217) Failed AM attempt retries on same failed host

Reply via email to