Karthik Kambatla commented on YARN-5677:

Committed this to trunk. 

The patch does not compile with branch-2. Looks like some type issues with 
{{any()}} in tests. [~templedf] - can you post a branch-2 patch as well? 

> RM should transition to standby when connection is lost for an extended period
> ------------------------------------------------------------------------------
>                 Key: YARN-5677
>                 URL: https://issues.apache.org/jira/browse/YARN-5677
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: YARN-5677.001.patch, YARN-5677.002.patch, 
> YARN-5677.003.patch, YARN-5677.004.patch, YARN-5677.005.patch
> In trunk, there is no maximum number of retries that I see.  It appears the 
> connection will be retried forever, with the active never figuring out it's 
> no longer active.  In my testing, the active-active state lasted almost 2 
> hours with no sign of stopping before I killed it.  The solution appears to 
> be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to