[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280745#comment-15280745 ] Jian He commented on YARN-4635: --- I guess this is a bit late for 2.8, let's move to 2.9 ? > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132071#comment-15132071 ] Sunil G commented on YARN-4635: --- [~vvasudev], I also think the same way. [~djp], cud u also share the plan. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132072#comment-15132072 ] Sunil G commented on YARN-4635: --- Meantime I will provide an early version patch based on this ticket on YARN-4637. And will start the discussions in that ticket. Thank you. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132322#comment-15132322 ] Sunil G commented on YARN-4635: --- Hi [~jian he] bq. I said this because I feel the API may look simpler and we don't need a separate nested AMBlackListingRequest class For this feature, if we have a class for {{AMBlackListingRequest}}, I think it will be more clear to end user. Few minor advantages. If enable/disable flag is not present in AMBlackListingRequest, then as you mentioned we ll be using threshold alone. Now it has to convey 4 cases. - If user is not configuring this information from AM end. (a default value is needed, -ve value may be). - User wants to disable Blacklisting for this specific application. (threshold need to 0) - User wants to use this blacklisting and will configure a value for this. - User wants this feature, but do not know a good threshold. Wants to use global threshold. So he need to give some big values more than 1.0f to get this behavior. If we have a flag, we can simply turn on or off this feature per app. And if user doesnt wants this feature, user need not have to set this blacklist object in context (null will set). These are not very strong reasons, and as you told we can achieve the current behavior in both ways. No problem in choosing either of these options. :) > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131994#comment-15131994 ] Sunil G commented on YARN-4635: --- Hi [~vvasudev] YARN-4637 will be handling purging mechanism. We have few options/policies possible for purging (time based / NM event based etc). Once this patch is in shape, YARN-4637 can make progress based on same. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132004#comment-15132004 ] Varun Vasudev commented on YARN-4635: - [~sunilg] - does that mean this won't get committed till YARN-4637 is ready? > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131938#comment-15131938 ] Varun Vasudev commented on YARN-4635: - I agree with [~jianhe] - 1) KILLED_EXCEEDED_PMEM and KILLED_EXCEEDED_VMEM are container specific and no reason to blacklist the node. The AM will be killed for exceeding it's pmem and vmem irrespective of other containers on the node. 2) For DISKS_FAILED the RM should mark the disk as bad and the node should be skipped. In addition, I don't see any mechanism for purging the blacklist in this patch - if that's the case we should work on this in a feature branch and not commit to trunk/branch-2 until we have the purge mechanism sorted out. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130272#comment-15130272 ] Junping Du commented on YARN-4635: -- Thanks [~jianhe] for review and comments. First, I would like to claim an assumption that the blacklist mechanism for AM launching is not for tracking nodes that completely not work (unhealthy) but tracking nodes that has suspect to fail the AM container due to previous failed experience. This is because we already have unhealthy report mechanism to report serious issue for NM so here is another one which should have a higher bar (as in some sense, AM container is more important than other container) according to the history. My response will be based on above assumption. bq. why should below container exit status back list the node ? This container failure could due to resource congestion (like KILLED_EXCEEDED_PMEM) or unknown reason (ABORTED, INVALID) that make this NM higher suspect than normal nodes. bq. For DISKS_FAILED which is considered as global blacklist node in this jira, I think in this case, the node will report as unhealthy and RM should remove the node already. Some DISKS_FAILED could happens due to the failed container write disk to full. But it could still have other directories available to use by node. It could still get launched with normal containers but not suitable to risk AM container. bq. AMBlackListingRequest contains a boolean flag and a threshold number. Do you think it’s ok to just use the threshold number only ? 0 means disabled, and numbers larger than 0 means enabled? If so, it means the job submitter have to understand how many nodes the current cluster have and the job parameter should be updated if it get submitted to different cluster (with different nodes). IMO, That sounds more complexity to users. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131632#comment-15131632 ] Jian He commented on YARN-4635: --- bq. Some DISKS_FAILED could happens due to the failed container write disk to full. But it could still have other directories available to use by node. It could still get launched with normal containers but not suitable to risk AM container. In current code, the DISKS_FAILED status is set when this condition is true {code} if (!dirsHandler.areDisksHealthy()) { ret = ContainerExitStatus.DISKS_FAILED; throw new IOException("Most of the disks failed. " + dirsHandler.getDisksHealthReport(false)); } {code} The same check {{dirsHandler.areDisksHealthy}} is used by DiskHealth monitor. {code} boolean isHealthy() { boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true : nodeHealthScriptRunner.isHealthy(); return scriptHealthStatus && dirsHandler.areDisksHealthy(); } {code} Essentially, if this condition is false, the node will be reported as unhealthy in the first place, which makes RM remove the node. And the global blacklisted becomes not useful in practice because the node is already removed. Maybe I missed something, a unit test can prove this. bq. If so, it means the job submitter have to understand how many nodes the current cluster have Sorry, I don't understand why job submitter needs to understand the number of nodes. what I meant is that, right now a boolean flag(false) is used to indicate that this feature is disabled. alternatively, a 0 threshold can achieve the same result (with logic change on RM side). I said this because I feel the API may look simpler and we don't need a separate nested AMBlackListingRequest class. Having the threshold set in submissionContext will be enough. But I don't have strong opinion on this. Current way is ok too. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130324#comment-15130324 ] Sunil G commented on YARN-4635: --- Thanks [~jianhe] for the comments and thanks [~djp] for the clarifications. bq.Do you think it’s ok to just use the threshold number only ? 0 means disabled, and numbers larger than 0 means enabled Adding one more minor advantage using a threshold. If app specifies {{AMBlackListingRequest}} flag as false, then global blacklisting will not be applicable for this app. Such control is easier with a flag i think, how do you feel. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129036#comment-15129036 ] Hadoop QA commented on YARN-4635: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 37s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 10m 27s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 20s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 59s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 20s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 53s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 46s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 44s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 50s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 7s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 16s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 16s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 53s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 4 new + 803 unchanged - 4 fixed = 807 total (was 807) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 36s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 38s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 42s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 3s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 76m 51s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 36s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 51s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128389#comment-15128389 ] Sunil G commented on YARN-4635: --- Hi [~djp] Thanks for sharing the patch fast. Overall looks fine for me. Few points: 1. Per app blacklist manager need not have to consider the case to remove a node from this blacklist. But for global blacklist manager, i think we need a {{removeNode}} interface in {{BlacklistManager}}. If we can launch an AM container at some later point of time after the first failure, we can remove that node immediately from global blacklisting. May be {{RMAppAttemptImpl#checkStatusForNodeBlacklisting}} can check for success too (Or are we planning to handle in the ticket where we try to come with time based clear mechanism). Thoughts? 2. I think {{SimpleBlacklistManager#refreshNodeHostCount}} can pre-compute failure threshold also along with updating {{numberOfNodeManagerHosts}}. So whoever is invoking {{getBlacklistUpdates}} need not have to compute always. This is minor suggestion in existing code. 3. {code} +// No thread safe problem as getBlacklistUpdates() in +// SimpleBlacklistManager do clone operation to blacklistNodes +List amBlacklistAdditions = new ArrayList(); {code} There are chances of duplicates from global and per-app level blacklists, correct?. So could we use a Set here. One possibility, one AM container failed due to ABORTED and added to per-app level blacklist, second attempt failed to due to DISK_FAILED and added to global. Now this will be a duplicate scenario. Thoughts? > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128533#comment-15128533 ] Junping Du commented on YARN-4635: -- bq. If we can launch an AM container at some later point of time after the first failure, we can remove that node immediately from global blacklisting. In most case, AM container won't get chance to launch again on this node because blacklist mechanism already blacklist it get allocated. However, the corner case is: two AM containers get launched at the same time, one failure but the other one successful. IMO, the successfully completed one shouldn't purge node from blacklist as normal node as the failure marked as global affected like DISK_FAILURE could still happen on coming am containers. In another words, it still get more risky for AM launched on this node which is not changed by another AM container finished. We can discuss more about purge node from global list, like: time based, event (NM reconnect) based, etc. in a dedicated JIRA YARN-4637 that I filed before. bq. I think SimpleBlacklistManager#refreshNodeHostCount can pre-compute failure threshold also along with updating numberOfNodeManagerHosts. So whoever is invoking getBlacklistUpdates need not have to compute always. This is minor suggestion in existing code. Sounds good. Updated in v2 patch. bq. There are chances of duplicates from global and per-app level blacklists, correct?. So could we use a Set here. One possibility, one AM container failed due to ABORTED and added to per-app level blacklist, second attempt failed to due to DISK_FAILED and added to global. Now this will be a duplicate scenario. Thoughts? Nice catch! The same app with different attempts won't cause this duplicated issue. The possible duplicated scenario is: an app AM failed on this node for reason like ABORTED, but at the mean time, the other app's AM failed on this node for DISK_FAILURE, then the same node could be duplicated on two list. Fix this issue in v2 patch. There is another issue that the threshold control on BlacklistManager is applied on two list (global and per app) separately, so it is possible that two lists together could unexpectedly blacklist all nodes. We need a thread-safe merge operation for two BlacklistManagers to address this problem. Mark a TODO item in the patch. Will file a separated JIRA to fix this. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128599#comment-15128599 ] Sunil G commented on YARN-4635: --- Thanks [~djp] bq.We can discuss more about purge node from global list, like: time based, event (NM reconnect) based, etc. in a dedicated JIRA YARN-4637 +1. Yes, we can cover time based/ event based cases in that JIRA. And as you mentioned, corner case will happen only if some AM launched on a node which is later blacklisted due to another apps' failure. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128628#comment-15128628 ] Sunil G commented on YARN-4635: --- bq. it is possible that two lists together could unexpectedly blacklist all nodes. Hi [~djp]. Is this the case where node1 to node6 is blacklisted by app and node7 to node10 is blacklist by global manager (considering we have node1 to node10 and disableThreshold is 0.8). Could we also check {{disableThreshold}} on the total Set which we created now. And if we crosses the limit, clear app based / global based blacklists from this list. Could this solve the above mentioned scenario? > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128640#comment-15128640 ] Sunil G commented on YARN-4635: --- Yes. We need a thread safe way of merging here. Else it may cause some corner Cases. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128664#comment-15128664 ] Junping Du commented on YARN-4635: -- Thanks for comments, Sunil. bq. Is this the case where node1 to node6 is blacklisted by app and node7 to node10 is blacklist by global manager. Yes. This is correct. bq. Could we also check disableThreshold on the total Set which we created now. And if we crosses the limit, clear app based / global based blacklists from this list. Could this solve the above mentioned scenario? The thing could be slightly complicated than this. Several things to consider: - The threshold can be different for global/app as we already give app flexibility in YARN-4389, we should choose one bar (upper or lower or always app bar). - When together over threshold bar we chose above, we should flip both lists or only one of them. Also, the flip mechanism worth to discuss further, as I think other mechanism like: LRU could be better. - if one list get flipped, how shall we merge with the other unflipped one. The removal items could overlap items in additions although they belongs to different affected scope, etc. I would suggest to have a further discussion in a separated JIRA. > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129911#comment-15129911 ] Jian He commented on YARN-4635: --- I have some questions on existing code. why should below container exit status back list the node ? - KILLED_EXCEEDED_PMEM and KILLED_EXCEEDED_VMEM. I feel these are specific to the container only ? - ABORTED, it’s used when AM releases the container or app finishes, or container expired etc. - INVALID, this is the default exit code. For DISKS_FAILED which is considered as global blacklist node in this jira, I think in this case, the node will report as unhealthy and RM should remove the node already. In YARN-4389, AMBlackListingRequest contains a boolean flag and a threshold number. Do you think it’s ok to just use the threshold number only ? 0 means disabled, and numbers larger than 0 means enabled? cc [~sunilg] > Add global blacklist tracking for AM container failure. > --- > > Key: YARN-4635 > URL: https://issues.apache.org/jira/browse/YARN-4635 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Junping Du >Assignee: Junping Du >Priority: Critical > Attachments: YARN-4635-v2.patch, YARN-4635.patch > > > We need a global blacklist in addition to each app’s blacklist to track AM > container failures in global > affection. That means we need to differentiate the non-succeed > ContainerExitStatus reasoning from > NM or more related to App. > For more details, please refer the document in YARN-4576. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
[ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127562#comment-15127562 ] Hadoop QA commented on YARN-4635: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 31s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 42s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 43s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 26s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 34s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 18s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 41s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 41s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 4s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 41s {color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 4 new + 804 unchanged - 4 fixed = 808 total (was 808) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 4 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 32s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 19s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 53s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 6s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 7s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} |