[
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910898#comment-13910898
]
Bikas Saha commented on YARN-1410:
----------------------------------
There is considerable confusion here. I havent seen the latest code but here is
my understanding of app submission in yarn.
1) client call submitApp(). this submits the app context and returns success or
failure after initial static checks.
2) if success is returned then client call getAppReport() and waits for the app
to be accepted. If the app gets accepted, then client reports success to use
that app has been successfully submitted. Else app submission fails.
Now there can be retries in step 1) or step 2). Step 2 is idempotent. We dont
need to worry about that. Step 1) is non-idempotent. With the retry cache
approach, upon retry (directly to the same RM or to a failed over RM), a
correctly working RetryCache will return the same response as was originally
sent by the RM. So if the RM returned success, RetryCache will return success.
If the RM returned immediate failure (based on static checks) then the
RetryCache will return failure. Its not clear to me why this would cause issues
or why it wont work in YARN.
The RetryCache is used for per RPC retries. It is not related to the 2-step
process that we use in YARN where each step is a different RPC request. Final
success for the user is based on the completion of both steps. RetryCache can
be used to return the same RPC response for Step 1 as many times as the client
retries that same RPC request. Thats exactly what we want. The crucial piece is
storing whats needed to re-populate the RetryCache upon failover. Here, we are
piggy-backing on AppSubmissionContext storage just like HDFS piggybacks on the
edit log entry.
I hope this make things clear. [~sureshms] Does this make sense?
Side Note:
RetryCache also has an option to store a payload along with the response. This
is useful when the response has a large internal object that is hard/expensive
to re-create and can be fetched from the RetryCache directly.
> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
> Key: YARN-1410
> URL: https://issues.apache.org/jira/browse/YARN-1410
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Xuan Gong
> Attachments: YARN-1410-outline.patch, YARN-1410.1.patch,
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch,
> YARN-1410.5.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create
> app id) the new RM may reject the app submission resulting in unexpected
> failure on the client side.
> The same may happen for other 2 step client API operations.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)