[ 
https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15528575#comment-15528575
 ] 

Jian He commented on YARN-5677:
-------------------------------

[~templedf],  this curator based leader election code is new in 2.9, and it's 
not used by default, unless you enable it by setting  
yarn.resourcemanager.ha.curator-leader-elector.enabled to true.
If you don't set it, it will use the hadoop-common ActiveStandbyElector and we 
did see similar issues in the past with ActiveStandbyElector too. 
Which one are you testing with ? It  you are testing with the hadoop-common 
one, it'll be good if you can test the curator based implementation too.. 

> RM can be in active-active state for an extended period
> -------------------------------------------------------
>
>                 Key: YARN-5677
>                 URL: https://issues.apache.org/jira/browse/YARN-5677
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.3, 3.0.0-alpha1
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>
> Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses 
> contact with the ZK node(s).
> In branch-2.7, the RM will retry the connection 1000 times by default.  
> Attempting to contact a node which cannot be reached is slow, which means the 
> active can take over an hour to realize it is no longer active.  I clocked it 
> at about an hour and a half in my tests.  The solution appears to be to add 
> some time awareness into the retry loop.
> In branch-2.8/trunk, there is no maximum number of retries that I see.  It 
> appears the connection will be retried forever, with the active never 
> figuring out it's no longer active.  In my testing, the active-active state 
> lasted almost 2 hours with no sign of stopping before I killed it.  The 
> solution appears to be to cap the number of retries or amount of time spent 
> retrying.
> This issue is significant because of the asynchronous nature of job 
> submission.  If the active doesn't know it's not active, it will buffer up 
> job submissions until it finally realizes it has become the standby. Then it 
> will fail all the job submissions in bulk. In high-volume workflows, that 
> behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to 
> the new active until the old active realizes it's the standby.  Workloads 
> submitted after the old active loses contact with ZK will therefore fail to 
> be executed regardless of which RM the clients contact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to