[ 
https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13910898#comment-13910898
 ] 

Bikas Saha commented on YARN-1410:
----------------------------------

There is considerable confusion here. I havent seen the latest code but here is 
my understanding of app submission in yarn.
1) client call submitApp(). this submits the app context and returns success or 
failure after initial static checks.
2) if success is returned then client call getAppReport() and waits for the app 
to be accepted. If the app gets accepted, then client reports success to use 
that app has been successfully submitted. Else app submission fails.

Now there can be retries in step 1) or step 2). Step 2 is idempotent. We dont 
need to worry about that. Step 1) is non-idempotent. With the retry cache 
approach, upon retry (directly to the same RM or to a failed over RM), a 
correctly working RetryCache will return the same response as was originally 
sent by the RM. So if the RM returned success, RetryCache will return success. 
If the RM returned immediate failure (based on static checks) then the 
RetryCache will return failure. Its not clear to me why this would cause issues 
or why it wont work in YARN.

The RetryCache is used for per RPC retries. It is not related to the 2-step 
process that we use in YARN where each step is a different RPC request. Final 
success for the user is based on the completion of both steps. RetryCache can 
be used to return the same RPC response for Step 1 as many times as the client 
retries that same RPC request. Thats exactly what we want. The crucial piece is 
storing whats needed to re-populate the RetryCache upon failover. Here, we are 
piggy-backing on AppSubmissionContext storage just like HDFS piggybacks on the 
edit log entry.

I hope this make things clear. [~sureshms] Does this make sense?

Side Note: 
RetryCache also has an option to store a payload along with the response. This 
is useful when the response has a large internal object that is hard/expensive 
to re-create and can be fetched from the RetryCache directly.


> Handle client failover during 2 step client API's like app submission
> ---------------------------------------------------------------------
>
>                 Key: YARN-1410
>                 URL: https://issues.apache.org/jira/browse/YARN-1410
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>         Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, 
> YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, 
> YARN-1410.5.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> App submission involves
> 1) creating appId
> 2) using that appId to submit an ApplicationSubmissionContext to the user.
> The client may have obtained an appId from an RM, the RM may have failed 
> over, and the client may submit the app to the new RM.
> Since the new RM has a different notion of cluster timestamp (used to create 
> app id) the new RM may reject the app submission resulting in unexpected 
> failure on the client side.
> The same may happen for other 2 step client API operations.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to