[ 
https://issues.apache.org/jira/browse/YARN-6704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089203#comment-16089203
 ] 

Botong Huang commented on YARN-6704:
------------------------------------

Thanks [~subru] for the review! Please see below. 

1. The reason I separated _createAndRegisterNewUAM _ into two methods is that 
{{FederationInterceptor}} will need to get the UAM token (in 
{{UnmanagedAMIdentifier}}) besides the register response (line 914 in FI patch 
v2), so that FI can persist it in NMSS for restart recovery. 

With FI restart, the semantic of _launchUAM_ is changed, as I added in the 
javadoc: Launch a new UAM or re-attach to an existing UAM. When an non-null 
_UnmanagedAMIdentifier_ is supplied, this means the UAM with the given token 
and attemptId should already be running. In this case, it is mostly just 
recover the variables and RM proxy, without actually talking to RM. The 
provided identifier is used within {{UnmanagedApplicationMaster}}._launchUAM_ 
(the else part and below)

2. If the location (sc id) of running containers are recoverable from RM then 
yes we don't need to store them here. Although currently RM will only accept 
register call when RM itself fails over, otherwise it will throw application 
already registered exception. I assume we cannot modify this behavior? Also the 
register response seems only contains the running containers of _previous_ 
attempt, not the current attempt, correct? 

Note that these containers are stored in NMSS (node local). The # of apps per 
node is limited, meaning apps * containers won't be too big. 

3. About the two iterations of the recoveredData structure, it is a bit odd. I 
did this this way because uam recover need to use _amRegistrationResponse_ and 
_amRegistrationRequest_, so I need to make sure the latter are recovered before 
recovering the uams. However, I realize I can simply do look up in HashMap, 
without making a full pass over the map. Will update the next patch. 

> Add Federation Interceptor restart when work preserving NM is enabled
> ---------------------------------------------------------------------
>
>                 Key: YARN-6704
>                 URL: https://issues.apache.org/jira/browse/YARN-6704
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>         Attachments: YARN-6704-YARN-2915.v1.patch, 
> YARN-6704-YARN-2915.v2.patch
>
>
> YARN-1336 added the ability to restart NM without loosing any running 
> containers. {{AMRMProxy}} restart is added in YARN-6127. In a Federated YARN 
> environment, there's additional state in the {{FederationInterceptor}} to 
> allow for spanning across multiple sub-clusters, so we need to enhance 
> {{FederationInterceptor}} to support work-preserving restart.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to