[ 
https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130825#comment-15130825
 ] 

Naganarasimha G R commented on YARN-4665:
-----------------------------------------

Thanks for some clarification [~vvasudev] & [~jlowe],
bq.  but I'd expect the submission logic to be a POST followed by GET polling 
until the state is ACCEPTED or later. If the GET results in a no-such-app error 
then the client retries the POST and continues polling.
IIUC REST API user *needs to take care explicitly* in the above mentioned way 
so that its successfully submitted, if yes then we should better capture it in 
the document as nothing about this is mentioned 2.7.2 doc. Or correct me if i 
am missing something. 
[~vvasudev],
bq.  Internally the functionality uses the same code flow as the RPC path - all 
calls flow through ClientRMService#submitApplication. 
IIUC here the concern is, as the app submission is asynchronous so the submit 
call might return successfully but the statestore operation fails so on RM 
failover the submitted app is lost. In case of {{YarnClient}}, client takes 
care of re-requesting till the app state is appropriate but in case of REST, 
caller/user needs to take care of calling GET apps after doing a POST 
submission of a app.  ??subsequent re-submits?? is handled in the server side 
but client needs to retry until it doesn't get a  no-such-app error, right ?

> Asynch submit can lose application submissions
> ----------------------------------------------
>
>                 Key: YARN-4665
>                 URL: https://issues.apache.org/jira/browse/YARN-4665
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.1.0-beta
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>
> The change introduced in YARN-514 opens up a hole into which applications can 
> fall and be lost.  Prior to YARN-514, the {{submitApplication()}} call did 
> not complete until the application state was persisted to the state store.  
> After YARN-514, the {{submitApplication()}} call is asynchronous, with the 
> application state being saved later.
> If the state store is slow or unresponsive, it may be that an application's 
> state may not be persisted for quite a while.  During that time, if the RM 
> fails (over), all applications that have not yet been persisted to the state 
> store will be lost.  If the active RM loses ZK connectivity, a significant 
> number of job submissions can pile up before the ZK connection times out, 
> resulting in a large pile of client failures when it finally does.
> This issue is inherent in the design of YARN-514.  I see three solutions:
> 1. Add a WAL to the state store. HBase does it, so we know how to do it. It 
> seems like a heavy solution to the original problem, however.  It's certainly 
> not a trivial change.
> 2. Revert YARN-514 and update the RPC layer to allow a connection to be 
> parked if it's doing something that may take a while. This is a generally 
> useful feature but could be a deep rabbit hole.
> 3. Revert YARN-514 and add back-pressure to the job submission. For example, 
> we set a maximum number of threads that can simultaneously be assigned to 
> handle job submissions.  When that threshold is reached, new job submissions 
> get a try-again-later response. This is also a generally useful feature and 
> should be a fairly constrained set of changes.
> I think the third option is the most approachable.  It's the smallest change, 
> and it adds useful behavior beyond solving the original issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to