[jira] [Commented] (YARN-5531) UnmanagedAM pool manager for federating application across clusters

Botong Huang (JIRA) Tue, 16 May 2017 11:58:21 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012936#comment-16012936
 ]


Botong Huang commented on YARN-5531:
------------------------------------

Thanks [~kasha] for the detailed comments! I have addressed most of them in v11 
patch, the rest explanations are here: 

* 1 & 3.3.3. The reason we put it here is that Federation Interceptor 
(YARN-3666 and YARN-6511) in NM will be using UAM. Putting it in Yarn Client 
will result in cyclic dependencies for NM project. 

* 2.1-2 This is generalized from the Federation use case, where for one 
application we enforce the same applicationId in all sub-clusters (RMs in 
different sub-clusters use different epochs, so that their app Id won't 
overlap). uamID (sub-cluster ID really) is used to identify the UAMs. In v11 
patch, I made the input attemptId becomes optional. If not supplied, the UAM 
will ask for an appID from RM first. In general, attempt id can be used as the 
uamID. 

* 2.5.1 Parallel kill is necessary for performance reason. In federation, the 
service stop of UAM pool is in the code path of Federation Interceptor 
shutdown, potentially blocking the application finish event in the NM where AM 
is running. Furthermore, when we try to kill the UAMs, RM in some sub-clusters 
might be failing over, which takes several minutes to come back. Sequential 
kill can be bad. 

* 2.5.5 Because of the above reason, I prefer not to retry here. One option is 
to throw the exception past this stop call, the user can handle the exception 
and retry if needed. In Federation Interceptor's case, we can simply catch it, 
log as warning and move on. What do you think?

* 2.8.2 & 3.1 & 3.6.2 As mentioned with [~subru] earlier, this UAM pool and UAM 
is more of a library for the actual UAM. The interface UAM pool expose to user 
is similar to {{ApplicationMasterProtocol}} (registerAM, allocate and 
finishAM), user is supposed to act like an AM and heartbeat to us. So for 
{{finishApplicationMaster}}, we abide by the protocol, if the UAM is still 
registered after the finishAM call, the user should retry. 

* 3.3.1 & 3.3.4 The launch UAM code is indeed a bit messy, I've cleaned up the 
code in v11. I merged the two monitor methods, might look a bit complex, can 
revert if needed. 

* 3.5.1 AsyncCallback works nicely in here. I think dispatcher can work as 
well, but I'd prefer to do that in another JIRA if needed. 

* 3.7.2-3 This is a corner use case for Federation. In federation interceptor, 
we handle the UAMs asynchronously. UAM is created the first time AM try to ask 
for resource from certain sub-cluster. The register, allocate and finish calls 
for UAM are all triggered by heartbeats from AM. This means that all three 
calls are triggered asynchronously. For instance, while the register call for 
UAM is still pending (say because the UAM RM is falling over and the register 
call is blocked for five minutes), we need to allow the allocate calls to come 
in without exception and buffer them. Once the register succeeds late, we 
should be able to move on from there. 





> UnmanagedAM pool manager for federating application across clusters
> -------------------------------------------------------------------
>
>                 Key: YARN-5531
>                 URL: https://issues.apache.org/jira/browse/YARN-5531
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Subru Krishnan
>            Assignee: Botong Huang
>         Attachments: YARN-5531-YARN-2915.v10.patch, 
> YARN-5531-YARN-2915.v11.patch, YARN-5531-YARN-2915.v1.patch, 
> YARN-5531-YARN-2915.v2.patch, YARN-5531-YARN-2915.v3.patch, 
> YARN-5531-YARN-2915.v4.patch, YARN-5531-YARN-2915.v5.patch, 
> YARN-5531-YARN-2915.v6.patch, YARN-5531-YARN-2915.v7.patch, 
> YARN-5531-YARN-2915.v8.patch, YARN-5531-YARN-2915.v9.patch
>
>
> One of the main tenets the YARN Federation is to *transparently* scale 
> applications across multiple clusters. This is achieved by running UAMs on 
> behalf of the application on other clusters. This JIRA tracks the addition of 
> a UnmanagedAM pool manager for federating application across clusters which 
> will be used the FederationInterceptor (YARN-3666) which is part of the 
> AMRMProxy pipeline introduced in YARN-2884.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-5531) UnmanagedAM pool manager for federating application across clusters

Reply via email to