[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870114#comment-13870114 ]
Xuan Gong commented on YARN-1410: --------------------------------- Thanks, [~bikassaha] for the review. bq. Xuan, can you please verify Karthik's comment above and fix/file jira? replied. See my previous comment. bq. Why is this at most once? This could be retried any number of times until the RM gives us a new application id, right? This AtMostOnce annotation is kind of confusing. we add annotation @AtMostOnce for "non-idempotent" operations which need to do the retry if failover happens. So, I think that getNewApplication should be retried if failover happens and it is "non-idempotent" operation. That is why I add AtMostOnce annotation. bq. In the common case, this extra operation is going to be pure overhead. Can we get this information via an exception in the submitApplication() method? bq. Why are we removing this code. We should return a new exception for the client. bq. Can it happen that submitApplication() will fail on the client but the RM has actually saved the application (with or without failover)? What should we do in that case? Reply these three comments together. They are related. Here are two scenarios: 1. If failover happens before YarnClient got the SubmitApplicationResponse, but state of RMApp has been saved in zookeeper. When we retry the ClientRMService#submitApplication with the old AppId, what will happen next is that we will get a "Application with id already exists" exception since we have already saved this application in zookeeper. 2. If we submit an application with the old ApplicationId. We expect that we will get a "Application with id already exists" exception. So, the main problem is how we can differentiate the two cases. If we get a "Application with id already exists" exception, how can we that it is because of the failover or that we submit an old applicationId ? Also, at ClientRMService#submitApplication, when the checker (rmContext.getRMApps().get(applicationId) != null) is true, how can we tell which exception we should throw ? Throw a "Application with id already exists" exception because we submit the application with old id ? or throw some other exceptions because of failover ? So, that is why I move the applicationId duplicate checker from ClientRMService#submitApplication into YarnClient#submitApplication. Before, we want to submit an application, make sure the application id is not duplicate. After that, if we still have the checker (rmContext.getRMApps().get(applicationId) != null) as true in ClientRMService#submitApplication, we can assume that is because of failover. bq. Can we create a common yarn client instead of doing it in every createApplication()/submitApplication() helper method call? Right. we can do that. bq. What are the changes in TestYarnClient for? I dont see any new test added. Those changes are related with corner case as I described. bq. There is no testcase for the new functionality of YarnClient where it creates an appId if its not supplied by the user? If you want, we could do it a separate jira related to this jira. Yes, I will add it. > Handle client failover during 2 step client API's like app submission > --------------------------------------------------------------------- > > Key: YARN-1410 > URL: https://issues.apache.org/jira/browse/YARN-1410 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Xuan Gong > Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, > YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > App submission involves > 1) creating appId > 2) using that appId to submit an ApplicationSubmissionContext to the user. > The client may have obtained an appId from an RM, the RM may have failed > over, and the client may submit the app to the new RM. > Since the new RM has a different notion of cluster timestamp (used to create > app id) the new RM may reject the app submission resulting in unexpected > failure on the client side. > The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.1.5#6160)