[
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909964#comment-13909964
]
Xuan Gong commented on YARN-1410:
---------------------------------
Ok, your proposal might also work well. But after I carefully go through
RetryCache’s source code, I found we might not need to use so complicated
structure. The way RetryCache checking repeated operations is by comparing
their ClientId and CallId. Under your idea, we store the ClientId and CallId in
ApplicationSubmissionContextData, which is stored in RMStateStore. Then, we
need read ClientId and CallId back to rebuild the RetryCache. It may be not
necessary to do these steps.
Because we use the global unique Id to check the duplication, why not just
create a UUID for ApplicationSubmissionContext? And use this ID to do the
duplication checking directly.
Considering this point, here is my new proposal:
Add a new field named ApplicationUUID into ApplicationSubmissionContext. In
YarnClientImpl, before we submit this application, we manually generate a
random UUID for it. By doing this, even if the user can fake the
applicationUUID, we will overwrite it. During the process of application
submission, we will check the applicationId and applicationUUID at the same
time. If both of them are duplicated, that means that, after RMStateStore
saving the ApplicationState. the failover or RM restart happens.
There are several advantages for this new proposal:
First of all, we will not have any compatibility issues. We just add one
optional field in ApplicationSubmissionContextProto. Second, we do not add any
extra logics that mean the whole performance of application submission will not
be changed.
> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
> Key: YARN-1410
> URL: https://issues.apache.org/jira/browse/YARN-1410
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Xuan Gong
> Attachments: YARN-1410-outline.patch, YARN-1410.1.patch,
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch,
> YARN-1410.5.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create
> app id) the new RM may reject the app submission resulting in unexpected
> failure on the client side.
> The same may happen for other 2 step client API operations.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)