[ 
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870114#comment-13870114
 ] 

Xuan Gong commented on YARN-1410:
---------------------------------

Thanks, [~bikassaha] for the review.

bq. Xuan, can you please verify Karthik's comment above and fix/file jira?

replied. See my previous comment.

bq. Why is this at most once? This could be retried any number of times until 
the RM gives us a new application id, right?

This AtMostOnce annotation is kind of confusing. we add annotation @AtMostOnce 
for "non-idempotent" operations which need to do the retry if failover happens. 
So, I think that getNewApplication should be retried if failover happens and it 
is "non-idempotent" operation. That is why I add AtMostOnce annotation.

bq. In the common case, this extra operation is going to be pure overhead. Can 
we get this information via an exception in the submitApplication() method?
bq. Why are we removing this code. We should return a new exception for the 
client.
bq. Can it happen that submitApplication() will fail on the client but the RM 
has actually saved the application (with or without failover)? What should we 
do in that case?

Reply these three comments together. They are related.  Here are two scenarios:
1. If failover happens before YarnClient got the SubmitApplicationResponse, but 
state of RMApp has been saved in zookeeper. When we retry the 
ClientRMService#submitApplication with the old AppId, what will happen next is 
that we will get a "Application with id already exists"  exception since we 
have already saved this application in zookeeper.
2. If we submit an application with the old ApplicationId. We expect that we 
will get a "Application with id already exists" exception.

So, the main problem is how we can differentiate the two cases. If we get a 
"Application with id already exists" exception, how can we that it is because 
of the failover or that we submit an old applicationId ? Also, at 
ClientRMService#submitApplication, when the checker 
(rmContext.getRMApps().get(applicationId) != null) is true, how can we tell 
which exception we should throw ? Throw a "Application with id already exists" 
exception because we submit the application with old id ? or throw some other 
exceptions because of failover ?

So, that is why I move the applicationId duplicate checker from 
ClientRMService#submitApplication into YarnClient#submitApplication. Before, we 
want to submit an application, make sure the application id is not duplicate. 
After that, if we still have the checker 
(rmContext.getRMApps().get(applicationId) != null) as true in 
ClientRMService#submitApplication, we can assume that is because of failover.

bq. Can we create a common yarn client instead of doing it in every 
createApplication()/submitApplication() helper method call?

Right. we can do that.

bq. What are the changes in TestYarnClient for? I dont see any new test added.

Those changes are related with corner case as I described.

bq. There is no testcase for the new functionality of YarnClient where it 
creates an appId if its not supplied by the user? If you want, we could do it a 
separate jira related to this jira.

Yes, I will add it.

> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, 
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed 
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create 
> app id) the new RM may reject the app submission resulting in unexpected 
> failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to