[ 
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915112#comment-13915112
 ] 

Xuan Gong commented on YARN-1410:
---------------------------------

bq. When can appId be null in the submission-context?

It will not happen right now. Take DistributedShell as an example, before we 
submit the application, we will get an applicationId which is used to set some 
directories for local resources, shell_script, etc. That is why we need the 
applicationId as the global unique ID. It may not be necessary for users’ own 
applications. They can just simply call yarnClient#submitApplication() to 
submit their applications. That is why we add null check for applicationId in 
ClientRMService# submitApplication().

If we really think this check is un-necessary, we should at least document this 
in yarnClient#submitApplication(), saying, “Before you use this api to submit 
the application, make sure you have an applicationId”. Also we should not 
expose those apis, such as ApplicationSubmissionContext#newInstance() or 
BuilderUtils# newApplicationSubmissionContext(), to public for users to create 
ApplicationSubmissionContext object. We should only get 
ApplicationSubmissionContext by calling getNewApplication(), which can get 
applicationId, too.

bq. Documentation

Sure. I will add those.

bq. Does DistributedShell need any changes to reflect the potential change in 
appId after fail-over? If so, let's fix that too here. Please file a MR ticket 
to fix MapReduce too if needed. Fixes are needed in either case if anyone 
caches the appId from GetNewApplicationResponse.

I do not think we need make any changes. DistributedShell and MapReduce has 
applicationId before submits the application. When failover happens, the old 
applicationId will be re-used. So, the applicationId return from 
yarnClient#submitApplication() or from SubmitApplicationResponse is the same as 
the application we used to submit application.

bq. Orthogonal to this ticket, we need to make sure clients don't pass in 
invalid application-IDs as part of the submission-context. It can be validated 
by simply looking at our counter and may be also caching recently used appIDs 
(atleast within a single RM). I remember we had a JIRA for this somewhere. We 
also need throttles so that malicious client don't exhaust appIDs.

ApplicationId has two pieces of information: 
ResourceManager.getClusterTimeStamp() (the time RM become active) and 
applicationCounter.incrementAndGet(). Since we allow user to re-use old or 
create their own applicationId, the situation you described may happen. For the 
HA case, if failover happens several times, clusterTimeStamp for the same RM 
will be different. Because everytime when RM become active, we will get a new 
clusterTimeStamp. So, we could check the clusterTimeStamp and app counter at 
the same time. For the given applicationid, if 
applicationId#getClusterTimestamp == ResourceManager.getClusterTimeStamp() and 
applicationId#getId > applicationCounter.get(), then we can consider this 
applicationId as  malicious applicationid.



> Handle RM fails over after getApplicationID() and before submitApplication().
> -----------------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, 
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, 
> YARN-1410.5.patch, YARN-1410.6.patch, YARN-1410.7.patch, YARN-1410.8.patch, 
> YARN-1410.9.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed 
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create 
> app id) the new RM may reject the app submission resulting in unexpected 
> failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to