[
https://issues.apache.org/jira/browse/YARN-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012936#comment-16012936
]
Botong Huang commented on YARN-5531:
------------------------------------
Thanks [~kasha] for the detailed comments! I have addressed most of them in v11
patch, the rest explanations are here:
* 1 & 3.3.3. The reason we put it here is that Federation Interceptor
(YARN-3666 and YARN-6511) in NM will be using UAM. Putting it in Yarn Client
will result in cyclic dependencies for NM project.
* 2.1-2 This is generalized from the Federation use case, where for one
application we enforce the same applicationId in all sub-clusters (RMs in
different sub-clusters use different epochs, so that their app Id won't
overlap). uamID (sub-cluster ID really) is used to identify the UAMs. In v11
patch, I made the input attemptId becomes optional. If not supplied, the UAM
will ask for an appID from RM first. In general, attempt id can be used as the
uamID.
* 2.5.1 Parallel kill is necessary for performance reason. In federation, the
service stop of UAM pool is in the code path of Federation Interceptor
shutdown, potentially blocking the application finish event in the NM where AM
is running. Furthermore, when we try to kill the UAMs, RM in some sub-clusters
might be failing over, which takes several minutes to come back. Sequential
kill can be bad.
* 2.5.5 Because of the above reason, I prefer not to retry here. One option is
to throw the exception past this stop call, the user can handle the exception
and retry if needed. In Federation Interceptor's case, we can simply catch it,
log as warning and move on. What do you think?
* 2.8.2 & 3.1 & 3.6.2 As mentioned with [~subru] earlier, this UAM pool and UAM
is more of a library for the actual UAM. The interface UAM pool expose to user
is similar to {{ApplicationMasterProtocol}} (registerAM, allocate and
finishAM), user is supposed to act like an AM and heartbeat to us. So for
{{finishApplicationMaster}}, we abide by the protocol, if the UAM is still
registered after the finishAM call, the user should retry.
* 3.3.1 & 3.3.4 The launch UAM code is indeed a bit messy, I've cleaned up the
code in v11. I merged the two monitor methods, might look a bit complex, can
revert if needed.
* 3.5.1 AsyncCallback works nicely in here. I think dispatcher can work as
well, but I'd prefer to do that in another JIRA if needed.
* 3.7.2-3 This is a corner use case for Federation. In federation interceptor,
we handle the UAMs asynchronously. UAM is created the first time AM try to ask
for resource from certain sub-cluster. The register, allocate and finish calls
for UAM are all triggered by heartbeats from AM. This means that all three
calls are triggered asynchronously. For instance, while the register call for
UAM is still pending (say because the UAM RM is falling over and the register
call is blocked for five minutes), we need to allow the allocate calls to come
in without exception and buffer them. Once the register succeeds late, we
should be able to move on from there.
> UnmanagedAM pool manager for federating application across clusters
> -------------------------------------------------------------------
>
> Key: YARN-5531
> URL: https://issues.apache.org/jira/browse/YARN-5531
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager, resourcemanager
> Reporter: Subru Krishnan
> Assignee: Botong Huang
> Attachments: YARN-5531-YARN-2915.v10.patch,
> YARN-5531-YARN-2915.v11.patch, YARN-5531-YARN-2915.v1.patch,
> YARN-5531-YARN-2915.v2.patch, YARN-5531-YARN-2915.v3.patch,
> YARN-5531-YARN-2915.v4.patch, YARN-5531-YARN-2915.v5.patch,
> YARN-5531-YARN-2915.v6.patch, YARN-5531-YARN-2915.v7.patch,
> YARN-5531-YARN-2915.v8.patch, YARN-5531-YARN-2915.v9.patch
>
>
> One of the main tenets the YARN Federation is to *transparently* scale
> applications across multiple clusters. This is achieved by running UAMs on
> behalf of the application on other clusters. This JIRA tracks the addition of
> a UnmanagedAM pool manager for federating application across clusters which
> will be used the FederationInterceptor (YARN-3666) which is part of the
> AMRMProxy pipeline introduced in YARN-2884.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]