[
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029539#comment-14029539
]
Tsuyoshi OZAWA commented on YARN-2052:
--------------------------------------
[~jianhe], I think it's OK after fencing operation, but one problem is
{{recover()}} is invoked before the fencing. My idea to deal with the problem
is as follows:
1. Active RM stores current epoch value.
2. After the fail over, new active RM recovers epoch and recognizes the epoch
value as {{epoch + 1}}.
3. New active RM issues {{fence()}} on ZKRMStateStore and increment epoch.
> ContainerId creation after work preserving restart is broken
> ------------------------------------------------------------
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Tsuyoshi OZAWA
>
> Container ids are made unique by using the app identifier and appending a
> monotonically increasing sequence number to it. Since container creation is a
> high churn activity the RM does not store the sequence number per app. So
> after restart it does not know what the new sequence number should be for new
> allocations.
--
This message was sent by Atlassian JIRA
(v6.2#6252)