[ https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966288#comment-14966288 ]
Sunil G commented on YARN-4284: ------------------------------- Thank you [~sjlee0] for the comments. Yes, I understood your point and got the idea from the patch also. I was having an assumption that, we are looking into a general blacklisting for all apps by seeing a failure for one app attempt in a node. Thank you for clarifying the same. This change seems almost fine for me. But as you told, the solution is slightly aggressive in marking a node as blacklisted per app. Also I am worried about cases like preemption from RM ({{ContainerExitStatus.PREEMPTED}} or {{KILLED_BY_RESOURCEMANAGER}}). Due to some queue over usage, RM may select AM container to preempt (again this is very unlikely to happen with YARN-1496, but its possible). And if application mark this node as blacklisted due to preemption or some similar cases, its not so correct I think. How do you feel? > condition for AM blacklisting is too narrow > ------------------------------------------- > > Key: YARN-4284 > URL: https://issues.apache.org/jira/browse/YARN-4284 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.8.0 > Reporter: Sangjin Lee > Assignee: Sangjin Lee > Attachments: YARN-4284.001.patch > > > Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the > next app attempt can be assigned to a different node. > However, currently the condition under which the node gets blacklisted is > limited to {{DISKS_FAILED}}. There are a whole host of other issues that may > cause the failure, for which we want to locate the AM elsewhere; e.g. disks > full, JVM crashes, memory issues, etc. > Since the AM blacklisting is per-app, there is little practical downside in > blacklisting the nodes on *any failure* (although it might lead to > blacklisting the node more aggressively than necessary). I would propose > locating the next app attempt to a different node on any failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)