[
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910845#comment-13910845
]
Xuan Gong commented on YARN-1410:
---------------------------------
I really doubt that the Retry cache would work for us. Look at the code on how
they are using RetryCache. Take FSNameSystem.delete() as an example,
{code}
boolean delete(String src, boolean recursive)
throws AccessControlException, SafeModeException,
UnresolvedLinkException, IOException {
CacheEntry cacheEntry = RetryCache.waitForCompletion(retryCache);
if (cacheEntry != null && cacheEntry.isSuccess()) {
return true; // Return previous response
}
boolean ret = false;
try {
ret = deleteInt(src, recursive, cacheEntry != null);
} catch (AccessControlException e) {
logAuditEvent(false, "delete", src);
throw e;
} finally {
RetryCache.setState(cacheEntry, ret);
}
return ret;
}
{code}
Before it starts to do the operation, it will check whether this operation is
successful. Before it sends the response, it will mark the operation is
successful. It will works perfectly in these HDFS operations. Because after we
received the operation response, we can say that the operation is finished.
But this does not work for the YARN operations. Take ApplicationSubmission as
an example, can we say applicationSubmission is finished when we receives the
response from ClientRMService? No, we cannot make that conclusion. Then how
will we set the state for the cahceEntry in RetryCache? Set in YarnClientImpl#
submitApplication? Then we need to find a way to expose the RetryCache to
client code. Or maybe we can add extra logic in ClientRMService to check
whether the app is submitted before return back the response? Then this will
add another hop and decrease the performance just like my old
check-before-submission proposal.
I think that the over-all logic of RetryCache does not work, maybe not that
useful, for the YARN operations, except that it can provide global unique ID
for checking repeated operations. But just for providing such ID, I really do
not think that we need to use such “complicate” structures.
Also for “proposing a custom solution”, I think the proposal that saves enough
information, such as ClientId and ServiceId in ApplicationSubmissionContext,
then read them back to rebuild the RetryCache , is a custom solution for
ApplicationSubmission, too. I do not think that this way can work for other
non-idempotent apis, such as renewDelegationToken(), etc.
> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
> Key: YARN-1410
> URL: https://issues.apache.org/jira/browse/YARN-1410
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Xuan Gong
> Attachments: YARN-1410-outline.patch, YARN-1410.1.patch,
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch,
> YARN-1410.5.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create
> app id) the new RM may reject the app submission resulting in unexpected
> failure on the client side.
> The same may happen for other 2 step client API operations.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)