wuchang created YARN-6590:
-----------------------------

             Summary: ResourceManager Master/Slave transition make all 
applications killed
                 Key: YARN-6590
                 URL: https://issues.apache.org/jira/browse/YARN-6590
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.7.3
         Environment: Linux
            Reporter: wuchang
            Priority: Critical


My yarn is configured as HA . It seems that because of the zk connection 
timeout , the active ResourceManager become standby and the standby one become 
active,namely , the ResourceManager active/standby transition. But both the 
process of two RM  is OK . Below is the ResourceManager error log :
{noformat}
2017-05-12 12:47:40,150 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Sending NMToken for nodeId : 10.120.117.100:37900 for container : 
container_1494505293131_4378_01_000007
2017-05-12 12:47:40,150 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1494505293131_4378_01_000007 Container Transitioned from ALLOCATED to 
ACQUIRED
2017-05-12 12:47:40,150 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Sending NMToken for nodeId : 10.120.117.108:46066 for container : 
container_1494505293131_4378_01_000008
2017-05-12 12:47:40,150 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
container_1494505293131_4378_01_000008 Container Transitioned from ALLOCATED to 
ACQUIRED
2017-05-12 12:47:40,166 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server 10.120.117.104/10.120.117.104:2181. Will not attempt to 
authenticate using SASL (unknown error)
2017-05-12 12:47:40,168 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to 10.120.117.104/10.120.117.104:2181, initiating session
2017-05-12 12:47:40,170 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
expired. Entering neutral mode and rejoining...
2017-05-12 12:47:40,170 INFO org.apache.zookeeper.ClientCnxn: Unable to 
reconnect to ZooKeeper service, session 0x685bcd9343dfc3f8 has expired, closing 
socket connection
2017-05-12 12:47:40,170 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session
{noformat}
In my opinion , this active/standby transition *should not* make my running 
application killed , but in fact , when this transition happened , all the 
running YARN-BASED MR and Spark jobs are killed. Below is some of my yarn 
configuration.

{code}
       <property>
                <name>yarn.resourcemanager.zk-address</name>
                
<value>zkServer1:2181,zkServer2:2181,zkServer3:2181,zkServer4:2181</value>
        </property>
        <property>
                <name>yarn.resourcemanager.zk-timeout-ms</name>
                <value>30000</value>
        </property>
        <property>
                <name>yarn.resourcemanager.store.class</name>
                
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
        </property>
        <property>
                <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
                <value>true</value>
        </property>
{code}

So , any configuration missing?I notice that I didn't configure the 
{noformat}yarn.resourcemanager.recovery.enabled{noformat} to true and the 
default value is false.But according to the official document , this 
configuration is used for ResourceManager restart, instead of for 
ResourceManager Active/Standby transition.
Any suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to