[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2018-03-18 Thread Yuqi Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404357#comment-16404357
 ] 

Yuqi Wang commented on YARN-2047:
-

[~bikassaha] and [~hex108], could you please check YARN-8012, it can help to 
resolve this issue.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Priority: Major
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-10 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998572#comment-14998572
 ] 

Jun Gong commented on YARN-2047:


I am not sure why AM cannot be trusted, but information about running 
containers could be regarded as a reference, and they are only used for 
specific usage described above.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-09 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996513#comment-14996513
 ] 

Jun Gong commented on YARN-2047:


Sorry for the late reply. 

The issue aims to make sure that a lost NM's containers are marked expired by 
the RM even across RM restart. What I said aims to solve the problem it caused 
in another way. Any thought?

{quote}
If this is a required action then it would also imply that saving a such nodes 
would be a critical state change operation. So, e.g. decommission command from 
the admin should not complete until the store has been updated. Is that the 
case?
{quote}
Yes, it is. However the store process is often very fast, it might be 
acceptable.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-09 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996539#comment-14996539
 ] 

Jun Gong commented on YARN-2047:


Another thought: RM rebuilds containers' information form AMs.  

When AM re-register with RM, AM tells its running containers' information to 
RM. Then RM records them in a HashSet *amRunningContainers*, queries them by 
calling *getRMContainer(containerId)*, and deletes them from 
*amRunningContainers* if the RMContainer exists.  When NM re-register with RM, 
RM deletes all the containers that NM reports from *amRunningContainers*. After 
some time(NM expiry time), RM iterates *amRunningContainers*, and tells 
corresponding AM they have finished.

The result seems same as the issue aims. However it needs add or modify AM's 
register RPC.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-09 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997181#comment-14997181
 ] 

Bikas Saha commented on YARN-2047:
--

I think the general idea is that the AM cannot be trusted about allocated 
resources or running containers.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14991069#comment-14991069
 ] 

Bikas Saha commented on YARN-2047:
--

>From the description it seems like the original scope was making sure that a 
>lost NM's containers are marked expired by the RM even across RM restart. For 
>that, wont it be enough to save a dead/decommissioned NM info in the state 
>store. Upon restart, repopulate the decommissioned/dead status from the state 
>store. It can take appropriate action at that time - e.g. cancelling an AM 
>containers for those NMs when the AM re-registers or asking those NMs to 
>restart and re-register if they heartbeat again.


If this is a required action then it would also imply that saving a such nodes 
would be a critical state change operation. So, e.g. decommission command from 
the admin should not complete until the store has been updated. Is that the 
case?

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-04 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989812#comment-14989812
 ] 

Jun Gong commented on YARN-2047:


For case 1, RM could save dead NMs in StateStore, when these NM registers with 
containers, RM could let NM kill these containers.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2047) RM should honor NM heartbeat expiry after RM restart

2015-11-04 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989769#comment-14989769
 ] 

Jun Gong commented on YARN-2047:


I think we could list cases which will cause the problem in the issue:

1. When RM restarts, NM stops and could not restart(e.g. the server is down 
forever).
To deal with this case, RM might need save information about NMs and their 
containers, it might not be acceptable as discussed in YARN-3161. 

2. NM stops; after some time, RM1 regards it as dead and complete containers on 
it; RM1 stops and RM2 becomes active RM. Then NM restarts. Those containers 
will become live again when NM registers them with RM2.
This case is more often than the above case. And we need to solve it. How about 
solving the problem in the NM side? My proposal: adding a timestamp in 
NMStateStore, and update it regularly. When NM restarts, it checks current time 
and last updated timestamp, it could know whether it has been regarded as dead 
in RM, and kills contains if it has been regarded as dead. 

If the proposal in case 2 is OK, I could attach a patch.

> RM should honor NM heartbeat expiry after RM restart
> 
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)