[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-6153: - Target Version/s: 2.8.1, 3.0.0-alpha3 (was: 2.8.0, 3.0.0-alpha3) > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam >Assignee: kyungwan nam > Fix For: 2.8.0, 3.0.0-alpha3 > > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, > YARN-6153.006.patch, YARN-6153-branch-2.8.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: (was: YARN-6153.006-1.patch) > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam >Assignee: kyungwan nam > Fix For: 2.8.0, 3.0.0-alpha3 > > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, > YARN-6153.006.patch, YARN-6153-branch-2.8.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: (was: YARN-6153-branch-2.8.patch) > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam >Assignee: kyungwan nam > Fix For: 2.8.0, 3.0.0-alpha3 > > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, > YARN-6153.006-1.patch, YARN-6153.006.patch, YARN-6153-branch-2.8.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153-branch-2.8.patch Thanks for your comment... I'm uploading a new patch for the branch-2.8. the system clock in the RMAppImpl will be used for checking validity interval. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam >Assignee: kyungwan nam > Fix For: 2.8.0, 3.0.0-alpha3 > > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, > YARN-6153.006-1.patch, YARN-6153.006.patch, YARN-6153-branch-2.8.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153.006-1.patch I'm uploading the additional patch for the hadoop-trunk. (YARN-6153.006-1.patch) above 1. problem has been fixed in the same way as the branch-2.8. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam >Assignee: kyungwan nam > Fix For: 2.8.0, 3.0.0-alpha3 > > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, > YARN-6153.006-1.patch, YARN-6153.006.patch, YARN-6153-branch-2.8.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153-branch-2.8.patch I'm uploading the patch for branch-2.8. 1. in the testRMAppAttemptFailuresValidityInterval, using the systemClock has been replaced with Thread.sleep. by following, the time to check the validity interval is no longer the systemClock in RMAppImpl. {code} - private int getNumFailedAppAttempts() { + public int getNumFailedAppAttempts() { int completedAttempts = 0; -long endTime = this.systemClock.getTime(); // Do not count AM preemption, hardware failures or NM resync // as attempt failure. for (RMAppAttempt attempt : attempts.values()) { if (attempt.shouldCountTowardsMaxAttemptRetry()) { -if (this.attemptFailuresValidityInterval <= 0 -|| (attempt.getFinishTime() > endTime -- this.attemptFailuresValidityInterval)) { - completedAttempts++; -} +completedAttempts++; } } {code} 2. in the testAMRestartNotLostContainerAfterAttemptFailuresValidityInterval, the timeout value has been increased to 40 seconds. currently, YARN-4807 is not yet included in the branch-2.8. I think that’s why the timeout happens. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam >Assignee: kyungwan nam > Fix For: 2.8.0, 3.0.0-alpha3 > > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, > YARN-6153.006.patch, YARN-6153-branch-2.8.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153.006.patch Thanks for your review. Ok. I'm uploading a new patch 006, which it has been fixed according to your comment. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch, > YARN-6153.006.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153.005.patch my source tree was not up-to-date. that's why compile is failed. I'm uploading a new patch which is based on up-to-date source. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch, YARN-6153.005.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153.004.patch thanks for review. yes, It looks better. I’m uploading a new patch. I have fixed it as your comment. Consequently, It looks like that _maybeLastAttempt_ is no longer used. so, the code associated with _maybeLastAttempt_ have been removed. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch, YARN-6153.004.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153.003.patch thanks for review. I'm uploading a new patch that is fixed as your comment. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam > Attachments: YARN-6153.001.patch, YARN-6153.002.patch, > YARN-6153.003.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153.002.patch thanks for your review. I'm attaching a new patch. - as your suggestion, the checking logic is moved to shouldCountTowardsMaxAttemptRetry. - to verify this issue a test case is added. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam > Attachments: YARN-6153.001.patch, YARN-6153.002.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6153) keepContainer does not work when AM retry window is set
[ https://issues.apache.org/jira/browse/YARN-6153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kyungwan nam updated YARN-6153: --- Attachment: YARN-6153.001.patch if maybeLastAttempt in RMAppAttemptImpl is true, keepContainers is always ignored. but, after AM reset window time, it is no longer the last attempt. I'm attaching a patch. if the last attempt is aged as longer than AM reset window time, the keepContainers will be kept. > keepContainer does not work when AM retry window is set > --- > > Key: YARN-6153 > URL: https://issues.apache.org/jira/browse/YARN-6153 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: kyungwan nam > Attachments: YARN-6153.001.patch > > > yarn.resourcemanager.am.max-attempts has been configured to 2 in my cluster. > I submitted a YARN application (slider app) that keepContainers=true, > attemptFailuresValidityInterval=30. > it did work properly when AM was failed firstly. > all containers launched by previous AM were resynced with new AM (attempt2) > without killing containers. > after 10 minutes, I thought AM failure count was reset by > attemptFailuresValidityInterval (5 minutes). > but, all containers were killed when AM was failed secondly. (new AM attempt3 > was launched properly) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org