[
https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128569#comment-15128569
]
Rohith Sharma K S commented on YARN-4665:
-----------------------------------------
The scenario is valid, but it is handled at YarnClient. Is that is not
sufficient?
In method YarnClientImpl#submitApplication()
{code}
try{
//
rmClient.submitApplication(request);
//
}catch (ApplicationNotFoundException ex) {
// FailOver or RM restart happens before RMStateStore saves
// ApplicationState
LOG.info("Re-submit application " + applicationId + "with the " +
"same ApplicationSubmissionContext");
rmClient.submitApplication(request);
}
{code}
> Asynch submit can lose application submissions
> ----------------------------------------------
>
> Key: YARN-4665
> URL: https://issues.apache.org/jira/browse/YARN-4665
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.1.0-beta
> Reporter: Daniel Templeton
> Assignee: Daniel Templeton
> Priority: Critical
>
> The change introduced in YARN-514 opens up a hole into which applications can
> fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did
> not complete until the application state was persisted to the state store.
> After YARN-514, the {{submitApplication()}} call is asynchronous, with the
> application state being saved later.
> If the state store is slow or unresponsive, it may be that an application's
> state may not be persisted for quite a while. During that time, if the RM
> fails (over), all applications that have not yet been persisted to the state
> store will be lost without the client being aware.
> This issue is inherent in the design of YARN-514. I see three solutions:
> 1. Add a WAL to the state store. HBase does it, so we know how to do it. It
> seems like a heavy solution to the original problem, however. It's certainly
> not a trivial change.
> 2. Revert YARN-514 and update the RPC layer to allow a connection to be
> parked if it's doing something that may take a while. This is a generally
> useful feature but could be a deep rabbit hole.
> 3. Revert YARN-514 and add back-pressure to the job submission. For example,
> we set a maximum number of threads that can simultaneously be assigned to
> handle job submissions. When that threshold is reached, new job submissions
> get a try-again-later response. This is also a generally useful feature and
> should be a fairly constrained set of changes. The downside is that it
> impacts the API.
> I think the third option is the most approachable. It's the smallest change,
> and it adds useful behavior beyond solving the original issue. And I don't
> think the API impact is significant.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)