Tsuyoshi OZAWA commented on YARN-2052:

[~jianhe], I think it's OK after fencing operation, but one problem is 
{{recover()}} is invoked before the fencing. My idea to deal with the problem 
is as follows:

1. Active RM stores current epoch value.
2. After the fail over, new active RM recovers epoch and recognizes the epoch 
value as {{epoch + 1}}.
3. New active RM issues {{fence()}} on ZKRMStateStore and increment epoch.

> ContainerId creation after work preserving restart is broken
> ------------------------------------------------------------
>                 Key: YARN-2052
>                 URL: https://issues.apache.org/jira/browse/YARN-2052
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Tsuyoshi OZAWA
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.

This message was sent by Atlassian JIRA

Reply via email to