[
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13907694#comment-13907694
]
Bikas Saha commented on YARN-1410:
----------------------------------
I am repeatedly asking for this because its a problem that we will continue to
face in other non-idempotent operations on different RM and NM protocols. We
need to establish a consistent behavior that can be reused for all operations
instead of operation specific workarounds that are brittle.
I spoke to [~sureshms] offline and he showed me RetryCache helper class that it
present in hadoop common. And also the usage of that class in a non-idempotent
FSNameSystem.delete() RPC. Can you please take a look at that code. We do not
have to bother about client-id/call-id. RetryCache is taking care of all that
for us. The main thing we have to do is use RetryCache methods properly and
save the right information in the store such that the RetryCache can be
re-populated after restart if needed. I am adding Suresh as a watcher to this
jira. He has volunteered to help review/help understand the code.
Suresh also mentioned that the AtMostOnce etc annotations are supposed to be
made on the RPC methods. The RetryCache kicks in only based on annotations on
the protocol methods.
It would be good if we take some time and do this cleanly in an re-usable
manner once so that work on the remaining API's can be made easier. If we use
specific work arounds then I am concerned that these may come back to bite us
later on.
> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
> Key: YARN-1410
> URL: https://issues.apache.org/jira/browse/YARN-1410
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Xuan Gong
> Attachments: YARN-1410-outline.patch, YARN-1410.1.patch,
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create
> app id) the new RM may reject the app submission resulting in unexpected
> failure on the client side.
> The same may happen for other 2 step client API operations.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)