[ 
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910845#comment-13910845
 ] 

Xuan Gong commented on YARN-1410:
---------------------------------

I really doubt that the Retry cache would work for us. Look at the code on how 
they are using RetryCache. Take FSNameSystem.delete() as an example, 
{code}
  boolean delete(String src, boolean recursive)
      throws AccessControlException, SafeModeException,
      UnresolvedLinkException, IOException {
    CacheEntry cacheEntry = RetryCache.waitForCompletion(retryCache);
    if (cacheEntry != null && cacheEntry.isSuccess()) {
      return true; // Return previous response
    }
    boolean ret = false;
    try {
      ret = deleteInt(src, recursive, cacheEntry != null);
    } catch (AccessControlException e) {
      logAuditEvent(false, "delete", src);
      throw e;
    } finally {
      RetryCache.setState(cacheEntry, ret);
    }
    return ret;
  }
{code}

Before it starts to do the operation, it will check whether this operation is 
successful. Before it sends the response, it will mark the operation is 
successful. It will works perfectly in these HDFS operations. Because after we 
received the operation response, we can say that the operation is finished.

But this does not work for the YARN operations. Take ApplicationSubmission as 
an example, can we say applicationSubmission is finished when we receives the 
response from ClientRMService? No, we cannot make that conclusion. Then how 
will we set the state for the cahceEntry in RetryCache? Set in YarnClientImpl# 
submitApplication? Then we need to find a way to expose the RetryCache to 
client code. Or maybe we can add extra logic in ClientRMService to check 
whether the app is submitted before return back the response? Then this will 
add another hop and decrease the performance just like my old 
check-before-submission proposal.

I think that the over-all logic of RetryCache does not work, maybe not that 
useful, for the YARN operations, except that it can provide global unique ID 
for checking repeated operations. But just for providing such ID, I really do 
not think that we need to use such “complicate” structures.

Also for “proposing a custom solution”, I think the proposal that saves enough 
information, such as ClientId and ServiceId in ApplicationSubmissionContext, 
then read them back to rebuild the RetryCache , is a custom solution for 
ApplicationSubmission, too. I do not think that this way can work for other 
non-idempotent apis, such as renewDelegationToken(), etc.



> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, 
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, 
> YARN-1410.5.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed 
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create 
> app id) the new RM may reject the app submission resulting in unexpected 
> failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to