Jun Gong commented on YARN-2047:

Another thought: RM rebuilds containers' information form AMs.  

When AM re-register with RM, AM tells its running containers' information to 
RM. Then RM records them in a HashSet *amRunningContainers*, queries them by 
calling *getRMContainer(containerId)*, and deletes them from 
*amRunningContainers* if the RMContainer exists.  When NM re-register with RM, 
RM deletes all the containers that NM reports from *amRunningContainers*. After 
some time(NM expiry time), RM iterates *amRunningContainers*, and tells 
corresponding AM they have finished.

The result seems same as the issue aims. However it needs add or modify AM's 
register RPC.

> RM should honor NM heartbeat expiry after RM restart
> ----------------------------------------------------
>                 Key: YARN-2047
>                 URL: https://issues.apache.org/jira/browse/YARN-2047
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
> After the RM restarts, it forgets about existing NM's (and their potentially 
> decommissioned status too). After restart, the RM cannot maintain the 
> contract to the AM's that a lost NM's containers will be marked finished 
> within the expiry time.

This message was sent by Atlassian JIRA

Reply via email to