[ https://issues.apache.org/jira/browse/YARN-4217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940016#comment-14940016 ]
Eric Payne commented on YARN-4217: ---------------------------------- One way to fix this would be by blacklisting the bad nodes. However, we need to be careful that the cure isn't worse than the disease. For example, Hadoop 0.20 had black/grey listing of nodes but it was often disabled because it caused more problems than it solved. We don't want one misconfigured pipeline spawning AMs/tasks that always fail to cause the RM to think all nodes are bad and bring the cluster to a halt. It's difficult to discern whether a failure was the node's fault or the job's fault (or sometimes neither was at fault). I think the best approach initially is to implement an application-specific blacklisting approach, where the RM will track bad nodes per application rather than across applications. That way an AM that isn't working on a node can be tried on another node, but a misconfigured/specialized AM won't break the node for other AMs/tasks that work just fine on that node. The drawback of course is that if the node really is totally bad then each application has to learn that separately. > Failed AM attempt retries on same failed host > --------------------------------------------- > > Key: YARN-4217 > URL: https://issues.apache.org/jira/browse/YARN-4217 > Project: Hadoop YARN > Issue Type: Improvement > Components: applications > Affects Versions: 2.7.1 > Reporter: Eric Payne > > This happens when the cluster is maxed out. One node is going bad, so > everything that happens on it fails, so the bad node is never busy. Since the > cluster is maxed out, when the RM looks for a node with available resources, > it will always find the almost bad one because nothing can run on it so it > has available resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)