[jira] [Commented] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332142#comment-14332142 ] Bartosz Ługowski commented on YARN-1621: I had extracted {{yarn container}} method from ApplicationCLI to new file(ContainerCLI) and changed {{yarn container -list Application Attempt ID}} to {{yarn container -list Application ID|Application Attempt ID}}. Now container states filtering is done at server side(not at CLI, as suggested [~Naganarasimha]), and works for both: applications and application attempts. I didn't include changes in {{yarn.cmd}} because it brakes the patch(git can't apply it). Add CLI to list rows of task attempt ID, container ID, host of container, state of container -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Assignee: Bartosz Ługowski Fix For: 2.7.0 Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch, YARN-1621.4.patch As more applications are moved to YARN, we need generic CLI to list rows of task attempt ID, container ID, host of container, state of container. Today if YARN application running in a container does hang, there is no way to find out more info because a user does not know where each attempt is running in. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers -applicationId appId [-containerState state of container] where containerState is optional filter to list container in given state only. container state can be running/succeeded/killed/failed/all. A user can specify more than one container state at once e.g. KILLED,FAILED. task attempt ID container ID host of container state of container {code} CLI should work with running application/completed application. If a container runs many task attempts, all attempts should be shown. That will likely be the case of Tez container-reuse application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332137#comment-14332137 ] Hudson commented on YARN-2797: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/]) YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev fe7a302473251b7310105a936edf220e401c613f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java * hadoop-yarn-project/CHANGES.txt TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Fix For: 2.7.0 Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332132#comment-14332132 ] Hudson commented on YARN-2797: -- FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/846/]) YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev fe7a302473251b7310105a936edf220e401c613f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Fix For: 2.7.0 Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332136#comment-14332136 ] Hudson commented on YARN-3238: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/]) YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java Connection timeouts to nodemanagers are retried at multiple levels -- Key: YARN-3238 URL: https://issues.apache.org/jira/browse/YARN-3238 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3238.001.patch The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created. This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried. The end result is that NM clients can wait a very, very long time for the connection to finally fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
[ https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332138#comment-14332138 ] Hudson commented on YARN-3236: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/]) YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev e3d290244c8a39edc37146d992cf34e6963b6851) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java * hadoop-yarn-project/CHANGES.txt cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. - Key: YARN-3236 URL: https://issues.apache.org/jira/browse/YARN-3236 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: cleanup, maintenance Fix For: 2.7.0 Attachments: YARN-3236.000.patch cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the code which use AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better remove it to avoid confusion since it is only introduced for a very short time and no one use it now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332077#comment-14332077 ] zhihai xu commented on YARN-3242: - I find out the oldZkClient is not useful any more, the added activeZkClient can replace it. uploaded a new patch YARN-3242.001.patch which remove oldZkClient. Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. --- Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore when the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332105#comment-14332105 ] Hadoop QA commented on YARN-3242: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700085/YARN-3242.001.patch against trunk revision fe7a302. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6691//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6691//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6691//console This message is automatically generated. Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. --- Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
[ https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332133#comment-14332133 ] Hudson commented on YARN-3236: -- FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/846/]) YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev e3d290244c8a39edc37146d992cf34e6963b6851) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. - Key: YARN-3236 URL: https://issues.apache.org/jira/browse/YARN-3236 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: cleanup, maintenance Fix For: 2.7.0 Attachments: YARN-3236.000.patch cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the code which use AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better remove it to avoid confusion since it is only introduced for a very short time and no one use it now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332131#comment-14332131 ] Hudson commented on YARN-3238: -- FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/846/]) YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java Connection timeouts to nodemanagers are retried at multiple levels -- Key: YARN-3238 URL: https://issues.apache.org/jira/browse/YARN-3238 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3238.001.patch The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created. This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried. The end result is that NM clients can wait a very, very long time for the connection to finally fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332172#comment-14332172 ] Hudson commented on YARN-3238: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/]) YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java Connection timeouts to nodemanagers are retried at multiple levels -- Key: YARN-3238 URL: https://issues.apache.org/jira/browse/YARN-3238 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3238.001.patch The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created. This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried. The end result is that NM clients can wait a very, very long time for the connection to finally fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
[ https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332174#comment-14332174 ] Hudson commented on YARN-3236: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/]) YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev e3d290244c8a39edc37146d992cf34e6963b6851) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java * hadoop-yarn-project/CHANGES.txt cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. - Key: YARN-3236 URL: https://issues.apache.org/jira/browse/YARN-3236 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: cleanup, maintenance Fix For: 2.7.0 Attachments: YARN-3236.000.patch cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the code which use AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better remove it to avoid confusion since it is only introduced for a very short time and no one use it now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332173#comment-14332173 ] Hudson commented on YARN-2797: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/]) YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev fe7a302473251b7310105a936edf220e401c613f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java * hadoop-yarn-project/CHANGES.txt TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Fix For: 2.7.0 Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bartosz Ługowski updated YARN-1621: --- Attachment: YARN-1621.4.patch Add CLI to list rows of task attempt ID, container ID, host of container, state of container -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Assignee: Bartosz Ługowski Fix For: 2.7.0 Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch, YARN-1621.4.patch As more applications are moved to YARN, we need generic CLI to list rows of task attempt ID, container ID, host of container, state of container. Today if YARN application running in a container does hang, there is no way to find out more info because a user does not know where each attempt is running in. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers -applicationId appId [-containerState state of container] where containerState is optional filter to list container in given state only. container state can be running/succeeded/killed/failed/all. A user can specify more than one container state at once e.g. KILLED,FAILED. task attempt ID container ID host of container state of container {code} CLI should work with running application/completed application. If a container runs many task attempts, all attempts should be shown. That will likely be the case of Tez container-reuse application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
[ https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332180#comment-14332180 ] Hudson commented on YARN-3236: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/]) YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev e3d290244c8a39edc37146d992cf34e6963b6851) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java * hadoop-yarn-project/CHANGES.txt cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. - Key: YARN-3236 URL: https://issues.apache.org/jira/browse/YARN-3236 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: cleanup, maintenance Fix For: 2.7.0 Attachments: YARN-3236.000.patch cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the code which use AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better remove it to avoid confusion since it is only introduced for a very short time and no one use it now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332178#comment-14332178 ] Hudson commented on YARN-3238: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/]) YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt Connection timeouts to nodemanagers are retried at multiple levels -- Key: YARN-3238 URL: https://issues.apache.org/jira/browse/YARN-3238 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3238.001.patch The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created. This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried. The end result is that NM clients can wait a very, very long time for the connection to finally fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332179#comment-14332179 ] Hudson commented on YARN-2797: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/]) YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev fe7a302473251b7310105a936edf220e401c613f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Fix For: 2.7.0 Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332189#comment-14332189 ] Hudson commented on YARN-2797: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/]) YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev fe7a302473251b7310105a936edf220e401c613f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Fix For: 2.7.0 Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3237) AppLogAggregatorImpl fails to log error cause
[ https://issues.apache.org/jira/browse/YARN-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332197#comment-14332197 ] Hudson commented on YARN-3237: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/]) YARN-3237. AppLogAggregatorImpl fails to log error cause. Contributed by (xgong: rev f56c65bb3eb9436b67de2df63098e26589e70e56) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/CHANGES.txt AppLogAggregatorImpl fails to log error cause - Key: YARN-3237 URL: https://issues.apache.org/jira/browse/YARN-3237 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Rushabh S Shah Assignee: Rushabh S Shah Fix For: 2.7.0 Attachments: YARN-3237-v2.patch, YARN-3237.patch AppLogAggregatorImpl fails to log the error if it is unable to create LogWriter. Below is the log output: [LogAggregationService #24011] ERROR logaggregation.AppLogAggregatorImpl: Cannot create writer for app app_id. Disabling log-aggregation for this app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
[ https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332200#comment-14332200 ] Hudson commented on YARN-3236: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/]) YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev e3d290244c8a39edc37146d992cf34e6963b6851) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. - Key: YARN-3236 URL: https://issues.apache.org/jira/browse/YARN-3236 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: cleanup, maintenance Fix For: 2.7.0 Attachments: YARN-3236.000.patch cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the code which use AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better remove it to avoid confusion since it is only introduced for a very short time and no one use it now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332198#comment-14332198 ] Hudson commented on YARN-3238: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/]) YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/CHANGES.txt Connection timeouts to nodemanagers are retried at multiple levels -- Key: YARN-3238 URL: https://issues.apache.org/jira/browse/YARN-3238 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3238.001.patch The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created. This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried. The end result is that NM clients can wait a very, very long time for the connection to finally fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2799) cleanup TestLogAggregationService based on the change in YARN-90
[ https://issues.apache.org/jira/browse/YARN-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332188#comment-14332188 ] Hudson commented on YARN-2799: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/]) YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. Contributed by Zhihai Xu (junping_du: rev c33ae271c24f0770c9735ccd2086cafda4f4e0b2) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/CHANGES.txt cleanup TestLogAggregationService based on the change in YARN-90 Key: YARN-2799 URL: https://issues.apache.org/jira/browse/YARN-2799 Project: Hadoop YARN Issue Type: Improvement Components: test Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Fix For: 2.7.0 Attachments: YARN-2799.000.patch, YARN-2799.001.patch, YARN-2799.002.patch cleanup TestLogAggregationService based on the change in YARN-90. The following code is added to setup in YARN-90, {code} dispatcher = createDispatcher(); appEventHandler = mock(EventHandler.class); dispatcher.register(ApplicationEventType.class, appEventHandler); {code} In this case, we should remove all these code from each test function to avoid duplicate code. Same for dispatcher.stop() which is in tearDown, we can remove dispatcher.stop() from from each test function also because it will always be called from tearDown for each test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Summary: Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. (was: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332190#comment-14332190 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/]) YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. Contributed by Zhihai Xu (junping_du: rev c33ae271c24f0770c9735ccd2086cafda4f4e0b2) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/CHANGES.txt NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Description: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { {code} was: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } {code} Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Description: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } {code} was: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. --- Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient =
[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Attachment: YARN-3242.002.patch Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3194) RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332196#comment-14332196 ] Hudson commented on YARN-3194: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/]) YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java * hadoop-yarn-project/CHANGES.txt RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node --- Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Fix For: 2.7.0 Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels
[ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332226#comment-14332226 ] Hudson commented on YARN-3238: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: rev 92d67ace3248930c0c0335070cc71a480c566a36) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java Connection timeouts to nodemanagers are retried at multiple levels -- Key: YARN-3238 URL: https://issues.apache.org/jira/browse/YARN-3238 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3238.001.patch The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created. This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried. The end result is that NM clients can wait a very, very long time for the connection to finally fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
[ https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332228#comment-14332228 ] Hudson commented on YARN-3236: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev e3d290244c8a39edc37146d992cf34e6963b6851) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. - Key: YARN-3236 URL: https://issues.apache.org/jira/browse/YARN-3236 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: zhihai xu Assignee: zhihai xu Priority: Trivial Labels: cleanup, maintenance Fix For: 2.7.0 Attachments: YARN-3236.000.patch cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the code which use AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would better remove it to avoid confusion since it is only introduced for a very short time and no one use it now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3194) RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node
[ https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332224#comment-14332224 ] Hudson commented on YARN-3194: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node --- Key: YARN-3194 URL: https://issues.apache.org/jira/browse/YARN-3194 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: NM restart is enabled Reporter: Rohith Assignee: Rohith Priority: Blocker Fix For: 2.7.0 Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state. # Node added event : Again here 2 scenario's can occur ## New node is registering with different ip:port – NOT A PROBLEM ## Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM # Node reconnected event : ## Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted ### NM RESTART NOT Enabled – NOT A PROBLEM ### NM RESTART is Enabled Some applications are running on this node – *Problem is here* Zero applications are running on this node – NOT A PROBLEM Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2799) cleanup TestLogAggregationService based on the change in YARN-90
[ https://issues.apache.org/jira/browse/YARN-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332216#comment-14332216 ] Hudson commented on YARN-2799: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. Contributed by Zhihai Xu (junping_du: rev c33ae271c24f0770c9735ccd2086cafda4f4e0b2) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java cleanup TestLogAggregationService based on the change in YARN-90 Key: YARN-2799 URL: https://issues.apache.org/jira/browse/YARN-2799 Project: Hadoop YARN Issue Type: Improvement Components: test Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Fix For: 2.7.0 Attachments: YARN-2799.000.patch, YARN-2799.001.patch, YARN-2799.002.patch cleanup TestLogAggregationService based on the change in YARN-90. The following code is added to setup in YARN-90, {code} dispatcher = createDispatcher(); appEventHandler = mock(EventHandler.class); dispatcher.register(ApplicationEventType.class, appEventHandler); {code} In this case, we should remove all these code from each test function to avoid duplicate code. Same for dispatcher.stop() which is in tearDown, we can remove dispatcher.stop() from from each test function also because it will always be called from tearDown for each test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3230) Clarify application states on the web UI
[ https://issues.apache.org/jira/browse/YARN-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332229#comment-14332229 ] Hudson commented on YARN-3230: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-3230. Clarify application states on the web UI. (Jian He via wangda) (wangda: rev ce5bf927c3d9f212798de1bf8706e5e9def235a1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlock.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppBlock.java Clarify application states on the web UI Key: YARN-3230 URL: https://issues.apache.org/jira/browse/YARN-3230 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Fix For: 2.7.0 Attachments: YARN-3230.1.patch, YARN-3230.2.patch, YARN-3230.3.patch, YARN-3230.3.patch, application page.png Today, application state are simply surfaced as a single word on the web UI. Not everyone understands the meaning of NEW_SAVING, SUBMITTED, ACCEPTED. This jira is to clarify the meaning of these states, things like what the application is waiting for at this state. In addition,the difference between application state and FinalStatus are fairly confusing to users, especially when state=FINISHED, but FinalStatus=FAILED -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3237) AppLogAggregatorImpl fails to log error cause
[ https://issues.apache.org/jira/browse/YARN-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332225#comment-14332225 ] Hudson commented on YARN-3237: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-3237. AppLogAggregatorImpl fails to log error cause. Contributed by (xgong: rev f56c65bb3eb9436b67de2df63098e26589e70e56) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/CHANGES.txt AppLogAggregatorImpl fails to log error cause - Key: YARN-3237 URL: https://issues.apache.org/jira/browse/YARN-3237 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Rushabh S Shah Assignee: Rushabh S Shah Fix For: 2.7.0 Attachments: YARN-3237-v2.patch, YARN-3237.patch AppLogAggregatorImpl fails to log the error if it is unable to create LogWriter. Below is the log output: [LogAggregationService #24011] ERROR logaggregation.AppLogAggregatorImpl: Cannot create writer for app app_id. Disabling log-aggregation for this app. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332217#comment-14332217 ] Hudson commented on YARN-2797: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev fe7a302473251b7310105a936edf220e401c613f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/CHANGES.txt TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Fix For: 2.7.0 Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again
[ https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332218#comment-14332218 ] Hudson commented on YARN-90: FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/]) YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. Contributed by Zhihai Xu (junping_du: rev c33ae271c24f0770c9735ccd2086cafda4f4e0b2) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java NodeManager should identify failed disks becoming good again Key: YARN-90 URL: https://issues.apache.org/jira/browse/YARN-90 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Ravi Gummadi Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes down, it is marked as failed forever. To reuse that disk (after it becomes good), NodeManager needs restart. This JIRA is to improve NodeManager to reuse good disks(which could be bad some time back). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332298#comment-14332298 ] zhihai xu commented on YARN-3242: - I uploaded a new patch YARN-3242.002.patch which add a test case: old client session Disconnected event won't affect the current client session. Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Attachment: YARN-3242.001.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. --- Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore when the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Description: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. was: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore when the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. --- Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Description: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { public void disconnect() { if (LOG.isDebugEnabled()) { LOG.debug(Disconnecting client for session: 0x + Long.toHexString(getSessionId())); } sendThread.close(); eventThread.queueEventOfDeath(); } {code} was: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break;
[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332344#comment-14332344 ] Hadoop QA commented on YARN-3242: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700107/YARN-3242.002.patch against trunk revision fe7a302. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6693//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6693//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6693//console This message is automatically generated. Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher ,
[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332353#comment-14332353 ] zhihai xu commented on YARN-3242: - I checked the findbugs warning message, all these findbugs are related to my change. Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Description: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { public void disconnect() { if (LOG.isDebugEnabled()) { LOG.debug(Disconnecting client for session: 0x + Long.toHexString(getSessionId())); } sendThread.close(); eventThread.queueEventOfDeath(); } public void close() throws IOException { if (LOG.isDebugEnabled()) { LOG.debug(Closing client for session: 0x + Long.toHexString(getSessionId())); } try { RequestHeader h = new RequestHeader(); h.setType(ZooDefs.OpCode.closeSession); submitRequest(h, null, null, null); } catch (InterruptedException e) { // ignore, close the send/event threads } finally { disconnect(); } } {code} was: Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until
[jira] [Resolved] (YARN-1778) TestFSRMStateStore fails on trunk
[ https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA resolved YARN-1778. -- Resolution: Duplicate This problem will be fixed on YARN-2820. Closing this as duplicated one. TestFSRMStateStore fails on trunk - Key: YARN-1778 URL: https://issues.apache.org/jira/browse/YARN-1778 Project: Hadoop YARN Issue Type: Test Reporter: Xuan Gong Assignee: zhihai xu Attachments: YARN-1778.000.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332443#comment-14332443 ] zhihai xu commented on YARN-3242: - I uploaded a new patch YARN-3242.003.patch which add more test cases to send watcher event to both previous and current client session. Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch, YARN-3242.003.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { public void disconnect() { if (LOG.isDebugEnabled()) { LOG.debug(Disconnecting client for session: 0x + Long.toHexString(getSessionId())); } sendThread.close(); eventThread.queueEventOfDeath(); } public void close() throws IOException { if (LOG.isDebugEnabled()) { LOG.debug(Closing client for session: 0x + Long.toHexString(getSessionId())); } try { RequestHeader h = new RequestHeader(); h.setType(ZooDefs.OpCode.closeSession); submitRequest(h, null, null, null); } catch (InterruptedException e) { // ignore, close the send/event threads } finally { disconnect(); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3242: Attachment: YARN-3242.003.patch Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch, YARN-3242.003.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while calling watcher , t); } } } else { public void disconnect() { if (LOG.isDebugEnabled()) { LOG.debug(Disconnecting client for session: 0x + Long.toHexString(getSessionId())); } sendThread.close(); eventThread.queueEventOfDeath(); } public void close() throws IOException { if (LOG.isDebugEnabled()) { LOG.debug(Closing client for session: 0x + Long.toHexString(getSessionId())); } try { RequestHeader h = new RequestHeader(); h.setType(ZooDefs.OpCode.closeSession); submitRequest(h, null, null, null); } catch (InterruptedException e) { // ignore, close the send/event threads } finally { disconnect(); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3154) Should not upload partial logs for MR jobs or other short-running' applications
[ https://issues.apache.org/jira/browse/YARN-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332501#comment-14332501 ] Hadoop QA commented on YARN-3154: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700128/YARN-3154.2.patch against trunk revision fe7a302. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6695//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6695//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6695//console This message is automatically generated. Should not upload partial logs for MR jobs or other short-running' applications - Key: YARN-3154 URL: https://issues.apache.org/jira/browse/YARN-3154 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-3154.1.patch, YARN-3154.2.patch Currently, if we are running a MR job, and we do not set the log interval properly, we will have their partial logs uploaded and then removed from the local filesystem which is not right. We only upload the partial logs for LRS applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3239) WebAppProxy does not support a final tracking url which has query fragments and params
[ https://issues.apache.org/jira/browse/YARN-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332516#comment-14332516 ] Hadoop QA commented on YARN-3239: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700132/YARN-3239.1.patch against trunk revision fe7a302. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6696//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6696//console This message is automatically generated. WebAppProxy does not support a final tracking url which has query fragments and params --- Key: YARN-3239 URL: https://issues.apache.org/jira/browse/YARN-3239 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-3239.1.patch Examples of failures: Expected: {{http://uihost:8080/#/main/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424384418229_0005}} Actual: {{http://uihost:8080}} Tried with a minor change to remove the #. Saw a different issue: Expected: {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424388018547_0001}} Actual: {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez/}} yarn application -status appId returns the expected value correctly. However, invoking an http get on http://rm:8088/proxy/appId/ returns the wrong value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3154) Should not upload partial logs for MR jobs or other short-running' applications
[ https://issues.apache.org/jira/browse/YARN-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3154: Attachment: YARN-3154.2.patch update the code based on the latest trunk Should not upload partial logs for MR jobs or other short-running' applications - Key: YARN-3154 URL: https://issues.apache.org/jira/browse/YARN-3154 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-3154.1.patch, YARN-3154.2.patch Currently, if we are running a MR job, and we do not set the log interval properly, we will have their partial logs uploaded and then removed from the local filesystem which is not right. We only upload the partial logs for LRS applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332483#comment-14332483 ] Hadoop QA commented on YARN-3242: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12700122/YARN-3242.003.patch against trunk revision fe7a302. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6694//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6694//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6694//console This message is automatically generated. Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session. Key: YARN-3242 URL: https://issues.apache.org/jira/browse/YARN-3242 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch, YARN-3242.003.patch Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session. The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed. This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session. We only have one ZKRMStateStore but we can have multiple ZK client sessions. Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed. For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null {code} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {code} Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again. Then we will see all the ZKRMStateStore operations fail with IOException Wait for ZKClient creation timed out until RM shutdown. The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty. {code} while (true) { Object event = waitingEvents.take(); if (event == eventOfDeath) { wasKilled = true; } else { processEvent(event); } if (wasKilled) synchronized (waitingEvents) { if (waitingEvents.isEmpty()) { isRunning = false; break; } } } private void processEvent(Object event) { try { if (event instanceof WatcherSetEventPair) { // each watcher will process the event WatcherSetEventPair pair = (WatcherSetEventPair) event; for (Watcher watcher : pair.watchers) { try { watcher.process(pair.event); } catch (Throwable t) { LOG.error(Error while
[jira] [Updated] (YARN-3239) WebAppProxy does not support a final tracking url which has query fragments and params
[ https://issues.apache.org/jira/browse/YARN-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3239: -- Attachment: YARN-3239.1.patch Uploaded a patch to fix the issue by appending user-provided path and query parameters to the registered tracking url. WebAppProxy does not support a final tracking url which has query fragments and params --- Key: YARN-3239 URL: https://issues.apache.org/jira/browse/YARN-3239 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Jian He Attachments: YARN-3239.1.patch Examples of failures: Expected: {{http://uihost:8080/#/main/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424384418229_0005}} Actual: {{http://uihost:8080}} Tried with a minor change to remove the #. Saw a different issue: Expected: {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424388018547_0001}} Actual: {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez/}} yarn application -status appId returns the expected value correctly. However, invoking an http get on http://rm:8088/proxy/appId/ returns the wrong value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)