[
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915112#comment-13915112
]
Xuan Gong commented on YARN-1410:
---------------------------------
bq. When can appId be null in the submission-context?
It will not happen right now. Take DistributedShell as an example, before we
submit the application, we will get an applicationId which is used to set some
directories for local resources, shell_script, etc. That is why we need the
applicationId as the global unique ID. It may not be necessary for users’ own
applications. They can just simply call yarnClient#submitApplication() to
submit their applications. That is why we add null check for applicationId in
ClientRMService# submitApplication().
If we really think this check is un-necessary, we should at least document this
in yarnClient#submitApplication(), saying, “Before you use this api to submit
the application, make sure you have an applicationId”. Also we should not
expose those apis, such as ApplicationSubmissionContext#newInstance() or
BuilderUtils# newApplicationSubmissionContext(), to public for users to create
ApplicationSubmissionContext object. We should only get
ApplicationSubmissionContext by calling getNewApplication(), which can get
applicationId, too.
bq. Documentation
Sure. I will add those.
bq. Does DistributedShell need any changes to reflect the potential change in
appId after fail-over? If so, let's fix that too here. Please file a MR ticket
to fix MapReduce too if needed. Fixes are needed in either case if anyone
caches the appId from GetNewApplicationResponse.
I do not think we need make any changes. DistributedShell and MapReduce has
applicationId before submits the application. When failover happens, the old
applicationId will be re-used. So, the applicationId return from
yarnClient#submitApplication() or from SubmitApplicationResponse is the same as
the application we used to submit application.
bq. Orthogonal to this ticket, we need to make sure clients don't pass in
invalid application-IDs as part of the submission-context. It can be validated
by simply looking at our counter and may be also caching recently used appIDs
(atleast within a single RM). I remember we had a JIRA for this somewhere. We
also need throttles so that malicious client don't exhaust appIDs.
ApplicationId has two pieces of information:
ResourceManager.getClusterTimeStamp() (the time RM become active) and
applicationCounter.incrementAndGet(). Since we allow user to re-use old or
create their own applicationId, the situation you described may happen. For the
HA case, if failover happens several times, clusterTimeStamp for the same RM
will be different. Because everytime when RM become active, we will get a new
clusterTimeStamp. So, we could check the clusterTimeStamp and app counter at
the same time. For the given applicationid, if
applicationId#getClusterTimestamp == ResourceManager.getClusterTimeStamp() and
applicationId#getId > applicationCounter.get(), then we can consider this
applicationId as malicious applicationid.
> Handle RM fails over after getApplicationID() and before submitApplication().
> -----------------------------------------------------------------------------
>
> Key: YARN-1410
> URL: https://issues.apache.org/jira/browse/YARN-1410
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Xuan Gong
> Attachments: YARN-1410-outline.patch, YARN-1410.1.patch,
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch,
> YARN-1410.5.patch, YARN-1410.6.patch, YARN-1410.7.patch, YARN-1410.8.patch,
> YARN-1410.9.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create
> app id) the new RM may reject the app submission resulting in unexpected
> failure on the client side.
> The same may happen for other 2 step client API operations.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)