[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13867591#comment-13867591 ]
Xuan Gong commented on YARN-1410: --------------------------------- Posting a patch that outlines the approach. Thanks to the AtMostOnce annotation and idempotent annotation, we only need to do a little code changes. * As my previous comment, we will make RM accept the appId in the context. When the failover happens, we will re-use the old applicationId (assigned by previous active RM) to submit the applications for current active RM * Use AtMostOnce and idempotent annotation ** As we discussed in YARN-1521, submitApplication and getNewApplication can not be idempotent. To make those functions retry, we can add AtMostOnce annotation. And getApplicationReport can be marked as idempotent. ** I would like to add related annotations for these three apis here, because I think that this is part of work for this ticket * This is how application submission works. YarnClient#SubmitApplication +call+ ClientRMService#SubmitApplication +call+ RMAppManager#SubmitApplication +create+ RMApp and submit the START Event +ReturnBack+ ClientRMService#SubmitApplication +ReturnBack+ YarnClient#SubmitApplication +CheckingAppStatus+ getApplicationReport +END+ ** The failover may happen in any steps or between any steps. ** If failover happens : *** between the time that YarnClient#SubmitApplication starts and the time that ClientRMService#SubmitApplication is called. The YarnClient will find the next active RM, and continue to do where it left. *** between the time that ClientRMService#SubmitApplication starts and the time that RMAppManager#SubmitApplication is called. We will restart ClientRMService#SubmitApplication(re-run it from the first line). At this time, the application has not been saved in zookeeper yet, so we are fine to restart ClientRMService#SubmitApplication. *** between the time that RMAppManager#SubmitApplication starts and the time that RMApp has been created and START EVENT has been submitted. We will do the same thing as previous case. *** after the time that RMApp has been created and START Event has been send out. If the failover happens, there are several different cases: **** after YarnClient got the SubmitApplicationResponse, but state of RMApp has not been saved in zookeeper yet. If the failover happens, when we try to getApplicationReport, we will get ApplicationNotFoundException. What I am doing here is to catch this exception, and call YarnClient#SubmitApplication again. **** after YarnClient got the SubmitApplicationResponse, and the state of RMApp has been saved in zookeeper. If the failover happens, we do not need to do anything. **** before YarnClient got the SubmitApplicationResponse, and state of RMApp has not been saved in zookeeper yet. If the failover happens, we will restart ClientRMService#SubmitApplication at very beginning **** before YarnClient got the SubmitApplicationResponse, but state of RMApp has been saved in zookeeper. This is the most tricky case. If the failover happens here, we will re-run ClientRMService#SubmitApplication at very beginning. It will try to re-submit the application with the old applicationId. But since we have already saved this application in zookeeper, we will get a "Application with id already exists" exception which is *not* we want. For the last corner case, [~bikassaha], [~kkambatl] Any suggestions ? > Handle client failover during 2 step client API's like app submission > --------------------------------------------------------------------- > > Key: YARN-1410 > URL: https://issues.apache.org/jira/browse/YARN-1410 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Xuan Gong > Attachments: YARN-1410-outline.patch, YARN-1410.1.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > App submission involves > 1) creating appId > 2) using that appId to submit an ApplicationSubmissionContext to the user. > The client may have obtained an appId from an RM, the RM may have failed > over, and the client may submit the app to the new RM. > Since the new RM has a different notion of cluster timestamp (used to create > app id) the new RM may reject the app submission resulting in unexpected > failure on the client side. > The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.1.5#6160)