Jun Gong commented on YARN-2047:

I think we could list cases which will cause the problem in the issue:

1. When RM restarts, NM stops and could not restart(e.g. the server is down 
To deal with this case, RM might need save information about NMs and their 
containers, it might not be acceptable as discussed in YARN-3161. 

2. NM stops; after some time, RM1 regards it as dead and complete containers on 
it; RM1 stops and RM2 becomes active RM. Then NM restarts. Those containers 
will become live again when NM registers them with RM2.
This case is more often than the above case. And we need to solve it. How about 
solving the problem in the NM side? My proposal: adding a timestamp in 
NMStateStore, and update it regularly. When NM restarts, it checks current time 
and last updated timestamp, it could know whether it has been regarded as dead 
in RM, and kills contains if it has been regarded as dead. 

If the proposal in case 2 is OK, I could attach a patch.

> RM should honor NM heartbeat expiry after RM restart
> ----------------------------------------------------
>                 Key: YARN-2047
>                 URL: https://issues.apache.org/jira/browse/YARN-2047
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.

This message was sent by Atlassian JIRA

Reply via email to