[
https://issues.apache.org/jira/browse/YARN-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996539#comment-14996539
]
Jun Gong commented on YARN-2047:
--------------------------------
Another thought: RM rebuilds containers' information form AMs.
When AM re-register with RM, AM tells its running containers' information to
RM. Then RM records them in a HashSet *amRunningContainers*, queries them by
calling *getRMContainer(containerId)*, and deletes them from
*amRunningContainers* if the RMContainer exists. When NM re-register with RM,
RM deletes all the containers that NM reports from *amRunningContainers*. After
some time(NM expiry time), RM iterates *amRunningContainers*, and tells
corresponding AM they have finished.
The result seems same as the issue aims. However it needs add or modify AM's
register RPC.
> RM should honor NM heartbeat expiry after RM restart
> ----------------------------------------------------
>
> Key: YARN-2047
> URL: https://issues.apache.org/jira/browse/YARN-2047
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Bikas Saha
>
> After the RM restarts, it forgets about existing NM's (and their potentially
> decommissioned status too). After restart, the RM cannot maintain the
> contract to the AM's that a lost NM's containers will be marked finished
> within the expiry time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)