[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404357#comment-16404357 ] Yuqi Wang commented on YARN-2047: - [~bikassaha] and [~hex108], could you please check YARN-8012, it can help to resolve this issue. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Priority: Major > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998572#comment-14998572 ] Jun Gong commented on YARN-2047: I am not sure why AM cannot be trusted, but information about running containers could be regarded as a reference, and they are only used for specific usage described above. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996513#comment-14996513 ] Jun Gong commented on YARN-2047: Sorry for the late reply. The issue aims to make sure that a lost NM's containers are marked expired by the RM even across RM restart. What I said aims to solve the problem it caused in another way. Any thought? {quote} If this is a required action then it would also imply that saving a such nodes would be a critical state change operation. So, e.g. decommission command from the admin should not complete until the store has been updated. Is that the case? {quote} Yes, it is. However the store process is often very fast, it might be acceptable. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996539#comment-14996539 ] Jun Gong commented on YARN-2047: Another thought: RM rebuilds containers' information form AMs. When AM re-register with RM, AM tells its running containers' information to RM. Then RM records them in a HashSet *amRunningContainers*, queries them by calling *getRMContainer(containerId)*, and deletes them from *amRunningContainers* if the RMContainer exists. When NM re-register with RM, RM deletes all the containers that NM reports from *amRunningContainers*. After some time(NM expiry time), RM iterates *amRunningContainers*, and tells corresponding AM they have finished. The result seems same as the issue aims. However it needs add or modify AM's register RPC. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997181#comment-14997181 ] Bikas Saha commented on YARN-2047: -- I think the general idea is that the AM cannot be trusted about allocated resources or running containers. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991069#comment-14991069 ] Bikas Saha commented on YARN-2047: -- >From the description it seems like the original scope was making sure that a >lost NM's containers are marked expired by the RM even across RM restart. For >that, wont it be enough to save a dead/decommissioned NM info in the state >store. Upon restart, repopulate the decommissioned/dead status from the state >store. It can take appropriate action at that time - e.g. cancelling an AM >containers for those NMs when the AM re-registers or asking those NMs to >restart and re-register if they heartbeat again. If this is a required action then it would also imply that saving a such nodes would be a critical state change operation. So, e.g. decommission command from the admin should not complete until the store has been updated. Is that the case? > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989812#comment-14989812 ] Jun Gong commented on YARN-2047: For case 1, RM could save dead NMs in StateStore, when these NM registers with containers, RM could let NM kill these containers. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart
[ https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989769#comment-14989769 ] Jun Gong commented on YARN-2047: I think we could list cases which will cause the problem in the issue: 1. When RM restarts, NM stops and could not restart(e.g. the server is down forever). To deal with this case, RM might need save information about NMs and their containers, it might not be acceptable as discussed in YARN-3161. 2. NM stops; after some time, RM1 regards it as dead and complete containers on it; RM1 stops and RM2 becomes active RM. Then NM restarts. Those containers will become live again when NM registers them with RM2. This case is more often than the above case. And we need to solve it. How about solving the problem in the NM side? My proposal: adding a timestamp in NMStateStore, and update it regularly. When NM restarts, it checks current time and last updated timestamp, it could know whether it has been regarded as dead in RM, and kills contains if it has been regarded as dead. If the proposal in case 2 is OK, I could attach a patch. > RM should honor NM heartbeat expiry after RM restart > > > Key: YARN-2047 > URL: https://issues.apache.org/jira/browse/YARN-2047 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha > > After the RM restarts, it forgets about existing NM's (and their potentially > decommissioned status too). After restart, the RM cannot maintain the > contract to the AM's that a lost NM's containers will be marked finished > within the expiry time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)