[jira] [Commented] (YARN-3571) AM does not re-blacklist NMs after ignoring-blacklist event happens?
[ https://issues.apache.org/jira/browse/YARN-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850161#comment-15850161 ] Manikandan R commented on YARN-3571: I am interested in working on this. Can I work on it? > AM does not re-blacklist NMs after ignoring-blacklist event happens? > > > Key: YARN-3571 > URL: https://issues.apache.org/jira/browse/YARN-3571 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.5.1 >Reporter: Hao Zhu > > Detailed analysis are in item "3 Will AM re-blacklist NMs after > ignoring-blacklist event happens?" of below link: > http://www.openkb.info/2015/05/when-will-application-master-blacklist.html > The current behavior is : if that Node Manager has ever been blacklisted > before, then it will not be blacklisted again after ignore-blacklist happens; > Else, it will be blacklisted. > However I think the right behavior should be : AM can re-blacklist NMs even > after ignoring-blacklist happens once. > The code logic is in function containerFailedOnHost(String hostName) of > RMContainerRequestor.java: > {code} > protected void containerFailedOnHost(String hostName) { > if (!nodeBlacklistingEnabled) { > return; > } > if (blacklistedNodes.contains(hostName)) { > if (LOG.isDebugEnabled()) { > LOG.debug("Host " + hostName + " is already blacklisted."); > } > return; //already blacklisted > {code} > The reason of above behavior is in above item 2: when ignoring-blacklist > happens, it only ask RM to clear "blacklistAdditions", however it dose not > clear the "blacklistedNodes" variable. > This behavior may cause the whole job/application to fail if the previous > blacklisted NM was released after ignoring-blacklist event happens. > Imagine a serial murder is released from prison just because the prison is > 33% full, and horribly he/she will never be put in prison again. Only new > murder will be put in prison. > Example to prove: > Test 1: > One node(h4) has issue, other 3 nodes are healthy. > The job failed with below AM logs: > {code} > [root@h1 container_1430425729977_0006_01_01]# egrep -i 'failures on > node|blacklist|FATAL' syslog > 2015-05-02 18:38:41,246 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: > nodeBlacklistingEnabled:true > 2015-05-02 18:38:41,246 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: > blacklistDisablePercent is 1 > 2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_02_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:07,297 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on > node h4.poc.com > 2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_08_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:07,954 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on > node h4.poc.com > 2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_07_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:08,152 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on > node h4.poc.com > 2015-05-02 18:39:08,152 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host > h4.poc.com > 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the > blacklist for application_1430425729977_0006: blacklistAdditions=1 > blacklistRemovals=0 > 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore > blacklisting set to true. Known: 4, Blacklisted: 1, 25% > 2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the > blacklist for application_1430425729977_0006: blacklistAdditions=0 > blacklistRemovals=1 > 2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_02_1 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_09_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:35,133 FATAL [IPC Server
[jira] [Commented] (YARN-3571) AM does not re-blacklist NMs after ignoring-blacklist event happens?
[ https://issues.apache.org/jira/browse/YARN-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823114#comment-15823114 ] Naganarasimha G R commented on YARN-3571: - Seems to be MapReduce related, and seems to exist in current version too, We can move this to MapREDUCE project ? > AM does not re-blacklist NMs after ignoring-blacklist event happens? > > > Key: YARN-3571 > URL: https://issues.apache.org/jira/browse/YARN-3571 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.5.1 >Reporter: Hao Zhu > > Detailed analysis are in item "3 Will AM re-blacklist NMs after > ignoring-blacklist event happens?" of below link: > http://www.openkb.info/2015/05/when-will-application-master-blacklist.html > The current behavior is : if that Node Manager has ever been blacklisted > before, then it will not be blacklisted again after ignore-blacklist happens; > Else, it will be blacklisted. > However I think the right behavior should be : AM can re-blacklist NMs even > after ignoring-blacklist happens once. > The code logic is in function containerFailedOnHost(String hostName) of > RMContainerRequestor.java: > {code} > protected void containerFailedOnHost(String hostName) { > if (!nodeBlacklistingEnabled) { > return; > } > if (blacklistedNodes.contains(hostName)) { > if (LOG.isDebugEnabled()) { > LOG.debug("Host " + hostName + " is already blacklisted."); > } > return; //already blacklisted > {code} > The reason of above behavior is in above item 2: when ignoring-blacklist > happens, it only ask RM to clear "blacklistAdditions", however it dose not > clear the "blacklistedNodes" variable. > This behavior may cause the whole job/application to fail if the previous > blacklisted NM was released after ignoring-blacklist event happens. > Imagine a serial murder is released from prison just because the prison is > 33% full, and horribly he/she will never be put in prison again. Only new > murder will be put in prison. > Example to prove: > Test 1: > One node(h4) has issue, other 3 nodes are healthy. > The job failed with below AM logs: > {code} > [root@h1 container_1430425729977_0006_01_01]# egrep -i 'failures on > node|blacklist|FATAL' syslog > 2015-05-02 18:38:41,246 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: > nodeBlacklistingEnabled:true > 2015-05-02 18:38:41,246 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: > blacklistDisablePercent is 1 > 2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_02_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:07,297 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on > node h4.poc.com > 2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_08_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:07,954 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on > node h4.poc.com > 2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_07_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:08,152 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on > node h4.poc.com > 2015-05-02 18:39:08,152 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host > h4.poc.com > 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the > blacklist for application_1430425729977_0006: blacklistAdditions=1 > blacklistRemovals=0 > 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore > blacklisting set to true. Known: 4, Blacklisted: 1, 25% > 2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the > blacklist for application_1430425729977_0006: blacklistAdditions=0 > blacklistRemovals=1 > 2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_02_1 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_09_0 - exited :