[ 
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13867591#comment-13867591
 ] 

Xuan Gong commented on YARN-1410:
---------------------------------

Posting a patch that outlines the approach. Thanks to the AtMostOnce annotation 
and idempotent annotation, we only need to do a little code changes.
* As my previous comment, we will make RM accept the appId in the context. When 
the failover happens, we will re-use the old applicationId (assigned by 
previous active RM) to submit the applications for current active RM
* Use AtMostOnce and idempotent annotation
** As we discussed in YARN-1521, submitApplication and getNewApplication can 
not be idempotent. To make those functions retry, we can add AtMostOnce 
annotation. And getApplicationReport can be marked as idempotent. 
**  I would like to add related annotations for these three apis here, because 
I think that this is part of work for this ticket
* This is how application submission works. YarnClient#SubmitApplication +call+ 
ClientRMService#SubmitApplication +call+ RMAppManager#SubmitApplication 
+create+ RMApp and submit the START Event +ReturnBack+ 
ClientRMService#SubmitApplication +ReturnBack+ YarnClient#SubmitApplication 
+CheckingAppStatus+ getApplicationReport +END+
** The failover may happen in any steps or between any steps.
** If failover happens :
*** between the time that YarnClient#SubmitApplication starts and the time that 
ClientRMService#SubmitApplication is called. The YarnClient will find the next 
active RM, and continue to do where it left.
*** between the time that ClientRMService#SubmitApplication starts and the time 
that RMAppManager#SubmitApplication is called. We will restart 
ClientRMService#SubmitApplication(re-run it from the first line). At this time, 
the application has not been saved in zookeeper yet, so we are fine to restart 
ClientRMService#SubmitApplication.
*** between the time that RMAppManager#SubmitApplication starts and the time 
that RMApp has been created and START EVENT has been submitted. We will do the 
same thing as previous case.
*** after the time that RMApp has been created and START Event has been send 
out. If the failover happens, there are several different cases:
**** after YarnClient got the SubmitApplicationResponse, but state of RMApp has 
not been saved in zookeeper yet. If the failover happens, when we try to 
getApplicationReport, we will get ApplicationNotFoundException. What I am doing 
here is to catch this exception, and call YarnClient#SubmitApplication again.
**** after YarnClient got the SubmitApplicationResponse, and the state of RMApp 
has been saved in zookeeper. If the failover happens, we do not need to do 
anything. 
**** before YarnClient got the SubmitApplicationResponse, and state of RMApp 
has not been saved in zookeeper yet. If the failover happens, we will restart 
ClientRMService#SubmitApplication at very beginning
**** before YarnClient got the SubmitApplicationResponse, but state of RMApp 
has been saved in zookeeper. This is the most tricky case. If the failover 
happens here, we will re-run ClientRMService#SubmitApplication at very 
beginning. It will try to re-submit the application with the old applicationId. 
But since we have already saved this application in zookeeper, we will get a 
"Application with id already exists" exception which is *not* we want.


For the last corner case, [~bikassaha], [~kkambatl] Any suggestions ?


> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed 
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create 
> app id) the new RM may reject the app submission resulting in unexpected 
> failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to