[
https://issues.apache.org/jira/browse/YARN-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056153#comment-16056153
]
Arun Suresh commented on YARN-6127:
-----------------------------------
Thanks for the patch [~botong]
Couple of comments:
* It looks like when an interceptor needs to persist state, it has to
explicitly do an
{{nmContext.getNMStateStore().storeAMRMProxyAppContextEntry()}} while after
recovery, it must explicitly invoke the {{getRecoveredDataMap()}} to access the
state. I feel it might be better to just expose an {{InterceptorState}}
API/class that is available to the Interceptor via the context. This state
object can then expose a {{get(key)}} and {{put(key, value)}} which would under
the hood negotiate with the stateStore to store the state and retrieve all
existing keys and values on recovery.
* We should be incrementing the major version of the version Info. Also, I
think we would need to do something similar to YARN-5547 to handle the
AMRMPROXY_KEY_PREFIX to ensure that rollback does not bomb.
> Add support for work preserving NM restart when AMRMProxy is enabled
> --------------------------------------------------------------------
>
> Key: YARN-6127
> URL: https://issues.apache.org/jira/browse/YARN-6127
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: amrmproxy, nodemanager
> Reporter: Subru Krishnan
> Assignee: Botong Huang
> Attachments: YARN-6127.v1.patch, YARN-6127.v2.patch
>
>
> YARN-1336 added the ability to restart NM without loosing any running
> containers. In a Federated YARN environment, there's additional state in the
> {{AMRMProxy}} to allow for spanning across multiple sub-clusters, so we need
> to enhance {{AMRMProxy}} to support work-preserving restart.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]