[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Attachment: YARN-4348-branch-2.7.004.patch Adding missing {{continue}} statement after calling {{syncInternal}} in the following block: {code} if (shouldRetryWithNewConnection(ke.code()) && retry < numRetries) { LOG.info("Retrying operation on ZK with new Connection. " + "Retry no. " + retry); Thread.sleep(zkRetryInterval); createConnection(); syncInternal(ke.getPath()); continue; } {code} > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348-branch-2.7.004.patch, > YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Attachment: YARN-4348-branch-2.7.003.patch > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, > YARN-4348-branch-2.7.003.patch, YARN-4348.001.patch, YARN-4348.001.patch, > log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Priority: Blocker (was: Major) > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa >Priority: Blocker > Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, > YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Target Version/s: 2.6.3, 2.7.3 (was: 2.7.3) > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, > YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Attachment: YARN-4348-branch-2.7.002.patch The test failure I mentioned is caused by using zkResyncWaitTime as the timeout value of sync operation - the default value of zkResyncWaitTime is smaller than zkSessionTimeout. We should use the timeout value which is larger than zkSessionTimeout, so just changing to use zkSessionTimeout * 3. In addition to this, we should care about the failure of sync operation at startup time to preventing RM from continuing to run in illegal state - ZK's inconsistent view. Attaching a patch to fix the test failure and the error handling at startup time(startInternal). [~jianhe], could you take a look? > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-4348-branch-2.7.002.patch, YARN-4348.001.patch, > YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-4348: -- Attachment: YARN-4348.001.patch upload same patch to kick jenkins > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Affects Version/s: 2.6.2 > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Attachment: YARN-4348.001.patch The issue looks similar to YARN-3753. One workaround is to change the timeout for the sync to zkResyncWaitTime as [~jianhe] changed on YARN-3753. Attaching a patch for this. If the timeout be increased, the probability of the case will be decreased, but it can still happen. e.g. ZK's server packet for the reply against sync is dropped after the operation itself success. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Description: The current internal ZK configuration of ZKRMStateStore can cause a following situation: 1. syncInternal timeouts, 2. but sync succeeded later on. We should use was: The current internal ZK configuration of ZKRMStateStore can cause a following situation: 1. syncInternal timeouts, 2. but sync succeeded later on. {quote} 2015-11-11 11:54:05,728 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1241)) - Failed to sync with ZK new connection. -<--- sync failed 2015-11-11 11:54:05,728 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:runWithRetries(1244)) - Maxed out ZK retries. Giving up! 2015-11-11 11:54:05,728 ERROR recovery.RMStateStore (RMStateStore.java:transition(292)) - Error updating appAttempt: appattempt_1447242474882_0002_01 org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /rmstore/ZKRMStateRoot/RMAppRoot/application_1447242474882_0002/appattempt_1447242474882_0002_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$9.run(ZKRMStateStore.java:1082) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$9.run(ZKRMStateStore.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1164) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1197) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:716) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:286) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:269) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:1006) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1075) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:1070) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:182) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-11-11 11:54:05,729 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailed(1027)) - State store operation failed org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /rmstore/ZKRMStateRoot/RMAppRoot/application_1447242474882_0002/appattempt_1447242474882_0002_01 at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1073) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$9.run(ZKRMStateStore.java:1082) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$9.run(ZKRMStateStore.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1164) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1197) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:1079) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:716) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:286) at
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Description: The current internal ZK configuration of ZKRMStateStore can cause a following situation: 1. syncInternal timeouts, 2. but sync succeeded later on. We should use zkResyncWaitTime as the timeout value. was: The current internal ZK configuration of ZKRMStateStore can cause a following situation: 1. syncInternal timeouts, 2. but sync succeeded later on. We should use > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > > The current internal ZK configuration of ZKRMStateStore can cause a following > situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Attachment: log.txt Attaching a log file. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: log.txt > > > The current internal ZK configuration of ZKRMStateStore can cause a following > situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-4348: - Description: Jian mentioned that the current internal ZK configuration of ZKRMStateStore can cause a following situation: 1. syncInternal timeouts, 2. but sync succeeded later on. We should use zkResyncWaitTime as the timeout value. was: The current internal ZK configuration of ZKRMStateStore can cause a following situation: 1. syncInternal timeouts, 2. but sync succeeded later on. We should use zkResyncWaitTime as the timeout value. > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)