[ https://issues.apache.org/jira/browse/YARN-4665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Naganarasimha G R reassigned YARN-4665: --------------------------------------- Assignee: Naganarasimha G R (was: Daniel Templeton) > Asynch submit can lose application submissions > ---------------------------------------------- > > Key: YARN-4665 > URL: https://issues.apache.org/jira/browse/YARN-4665 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.1.0-beta > Reporter: Daniel Templeton > Assignee: Naganarasimha G R > > The change introduced in YARN-514 opens up a hole into which applications can > fall and be lost. Prior to YARN-514, the {{submitApplication()}} call did > not complete until the application state was persisted to the state store. > After YARN-514, the {{submitApplication()}} call is asynchronous, with the > application state being saved later. > If the state store is slow or unresponsive, it may be that an application's > state may not be persisted for quite a while. During that time, if the RM > fails (over), all applications that have not yet been persisted to the state > store will be lost. If the active RM loses ZK connectivity, a significant > number of job submissions can pile up before the ZK connection times out, > resulting in a large pile of client failures when it finally does. > This issue is inherent in the design of YARN-514. I see three solutions: > 1. Add a WAL to the state store. HBase does it, so we know how to do it. It > seems like a heavy solution to the original problem, however. It's certainly > not a trivial change. > 2. Revert YARN-514 and update the RPC layer to allow a connection to be > parked if it's doing something that may take a while. This is a generally > useful feature but could be a deep rabbit hole. > 3. Revert YARN-514 and add back-pressure to the job submission. For example, > we set a maximum number of threads that can simultaneously be assigned to > handle job submissions. When that threshold is reached, new job submissions > get a try-again-later response. This is also a generally useful feature and > should be a fairly constrained set of changes. > I think the third option is the most approachable. It's the smallest change, > and it adds useful behavior beyond solving the original issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)