Tarun Parimi created YARN-10789:
-----------------------------------

             Summary: RM HA startup can fail due to race conditions in 
ZKConfigurationStore
                 Key: YARN-10789
                 URL: https://issues.apache.org/jira/browse/YARN-10789
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Tarun Parimi
            Assignee: Tarun Parimi


We are observing below error randomly during hadoop install and RM initial 
startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
configured. This cause one of the RM's to not startup.

{code:java}
2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service 
RMActiveServices failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /confstore/CONF_STORE
{code}

We are trying to create the znode /confstore/CONF_STORE when we initialize the 
ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
initialized when CapacityScheduler does a serviceInit. This serviceInit is done 
by both Active and Standby RM. So we can run into a race condition when both 
Active and Standby try to create the same znode when both RM are started at 
same time.

ZKRMStateStore on the other hand avoids such race conditions, by creating the 
znodes only after serviceStart. serviceStart only happens for the active RM 
which won the leader election, unlike serviceInit which happens irrespective of 
leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to