[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911156#comment-13911156 ]
Bikas Saha commented on YARN-1410: ---------------------------------- Sounds good. Lets track 2) on a separate new jira. Xuan, can you please open one. For 1) I believe the change would be limited to allow the new RM to accept an unknown application id in submitApplication(), under the assumption that the previous RM had generated the the app id and the previous RM died either 1) before the client even attempted to submit or 2) before saving the app in the store and the client is retrying the new RM. We can remove the idempotent etc annotations and just keep the change limited to the initial proposal 1) create new API that accepts app-submission-context in which the user does not supply the app id 2) allow the RM to accept an app-submission-context that has an unknown app id. Based on the comment - https://issues.apache.org/jira/browse/YARN-1410?focusedCommentId=13864516&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13864516 The solution will be incomplete since the old RM could have saved the state and the new RM would find a conflicting app-submission request with an existing app-id. Thats why we branched off into that discussion. For now, we handle this in the following manner. 1) if the state of the existing app is NEW then just accept the submitApplication() (effectively emulating the RetryCache) 2) if the state of the app != NEW then fail the submitApp. OR we could choose to solve this in the new jira being created. > Handle client failover during 2 step client API's like app submission > --------------------------------------------------------------------- > > Key: YARN-1410 > URL: https://issues.apache.org/jira/browse/YARN-1410 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Xuan Gong > Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, > YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, > YARN-1410.5.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > App submission involves > 1) creating appId > 2) using that appId to submit an ApplicationSubmissionContext to the user. > The client may have obtained an appId from an RM, the RM may have failed > over, and the client may submit the app to the new RM. > Since the new RM has a different notion of cluster timestamp (used to create > app id) the new RM may reject the app submission resulting in unexpected > failure on the client side. > The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.1.5#6160)