date:20150222


[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332137#comment-14332137
 ] 

Hudson commented on YARN-2797:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/])
YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev 
fe7a302473251b7310105a936edf220e401c613f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java
* hadoop-yarn-project/CHANGES.txt


 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Fix For: 2.7.0

 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase


[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332132#comment-14332132
 ] 

Hudson commented on YARN-2797:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/846/])
YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev 
fe7a302473251b7310105a936edf220e401c613f)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java


 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Fix For: 2.7.0

 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels


[ 
https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332136#comment-14332136
 ] 

Hudson commented on YARN-3238:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/])
YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: 
rev 92d67ace3248930c0c0335070cc71a480c566a36)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java


 Connection timeouts to nodemanagers are retried at multiple levels
 --

 Key: YARN-3238
 URL: https://issues.apache.org/jira/browse/YARN-3238
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 2.7.0

 Attachments: YARN-3238.001.patch


 The IPC layer will retry connection timeouts automatically (see Client.java), 
 but we are also retrying them with YARN's RetryPolicy put in place when the 
 NM proxy is created.  This causes a two-level retry mechanism where the IPC 
 layer has already retried quite a few times (45 by default) for each YARN 
 RetryPolicy error that is retried.  The end result is that NM clients can 
 wait a very, very long time for the connection to finally fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.


[ 
https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332138#comment-14332138
 ] 

Hudson commented on YARN-3236:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/112/])
YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev 
e3d290244c8a39edc37146d992cf34e6963b6851)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 -

 Key: YARN-3236
 URL: https://issues.apache.org/jira/browse/YARN-3236
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial
  Labels: cleanup, maintenance
 Fix For: 2.7.0

 Attachments: YARN-3236.000.patch


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the 
 code which use  AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would 
 better remove it to avoid confusion since it is only introduced for a very 
 short time and no one use it now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.

[
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332077#comment-14332077
]

zhihai xu commented on YARN-3242:
-

I find out the oldZkClient is not useful any more, the added activeZkClient can
replace it.
uploaded a new patch YARN-3242.001.patch which remove oldZkClient.

Old ZK client session watcher event messed up new ZK client session due to
ZooKeeper asynchronously closing client session.
---

Key: YARN-3242
URL: https://issues.apache.org/jira/browse/YARN-3242
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
Attachments: YARN-3242.000.patch, YARN-3242.001.patch

Old ZK client session watcher event messed up new ZK client session due to
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to
ZKRMStateStore when the old ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher
event is from current session. So the watcher event from old ZK client
session which just is closed will still be processed.
For example, If a Disconnected event received from old session after new
session is connected, the zkClient will be set to null
{code}
case Disconnected:
LOG.info(ZKRMStateStore Session disconnected);
oldZkClient = zkClient;
zkClient = null;
break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session
because new session is already in SyncConnected state and it won't send
SyncConnected event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException
Wait for ZKClient creation timed out until RM shutdown.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.


[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332105#comment-14332105
 ] 

Hadoop QA commented on YARN-3242:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12700085/YARN-3242.001.patch
  against trunk revision fe7a302.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6691//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6691//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6691//console

This message is automatically generated.

 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 ---

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.


[ 
https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332133#comment-14332133
 ] 

Hudson commented on YARN-3236:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/846/])
YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev 
e3d290244c8a39edc37146d992cf34e6963b6851)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 -

 Key: YARN-3236
 URL: https://issues.apache.org/jira/browse/YARN-3236
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial
  Labels: cleanup, maintenance
 Fix For: 2.7.0

 Attachments: YARN-3236.000.patch


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the 
 code which use  AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would 
 better remove it to avoid confusion since it is only introduced for a very 
 short time and no one use it now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels


[ 
https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332131#comment-14332131
 ] 

Hudson commented on YARN-3238:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #846 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/846/])
YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: 
rev 92d67ace3248930c0c0335070cc71a480c566a36)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java


 Connection timeouts to nodemanagers are retried at multiple levels
 --

 Key: YARN-3238
 URL: https://issues.apache.org/jira/browse/YARN-3238
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 2.7.0

 Attachments: YARN-3238.001.patch


 The IPC layer will retry connection timeouts automatically (see Client.java), 
 but we are also retrying them with YARN's RetryPolicy put in place when the 
 NM proxy is created.  This causes a two-level retry mechanism where the IPC 
 layer has already retried quite a few times (45 by default) for each YARN 
 RetryPolicy error that is retried.  The end result is that NM clients can 
 wait a very, very long time for the connection to finally fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels


[ 
https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332172#comment-14332172
 ] 

Hudson commented on YARN-3238:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/])
YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: 
rev 92d67ace3248930c0c0335070cc71a480c566a36)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java


 Connection timeouts to nodemanagers are retried at multiple levels
 --

 Key: YARN-3238
 URL: https://issues.apache.org/jira/browse/YARN-3238
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 2.7.0

 Attachments: YARN-3238.001.patch


 The IPC layer will retry connection timeouts automatically (see Client.java), 
 but we are also retrying them with YARN's RetryPolicy put in place when the 
 NM proxy is created.  This causes a two-level retry mechanism where the IPC 
 layer has already retried quite a few times (45 by default) for each YARN 
 RetryPolicy error that is retried.  The end result is that NM clients can 
 wait a very, very long time for the connection to finally fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.


[ 
https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332174#comment-14332174
 ] 

Hudson commented on YARN-3236:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/])
YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev 
e3d290244c8a39edc37146d992cf34e6963b6851)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 -

 Key: YARN-3236
 URL: https://issues.apache.org/jira/browse/YARN-3236
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial
  Labels: cleanup, maintenance
 Fix For: 2.7.0

 Attachments: YARN-3236.000.patch


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the 
 code which use  AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would 
 better remove it to avoid confusion since it is only introduced for a very 
 short time and no one use it now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase


[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332173#comment-14332173
 ] 

Hudson commented on YARN-2797:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2044 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2044/])
YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev 
fe7a302473251b7310105a936edf220e401c613f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java
* hadoop-yarn-project/CHANGES.txt


 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Fix For: 2.7.0

 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1621) Add CLI to list rows of task attempt ID, container ID, host of container, state of container

2015-02-22 Thread JIRA

[
https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bartosz Ługowski updated YARN-1621:
---
Attachment: YARN-1621.4.patch

Add CLI to list rows of task attempt ID, container ID, host of container,
state of container
--

Key: YARN-1621
URL: https://issues.apache.org/jira/browse/YARN-1621
Project: Hadoop YARN
Issue Type: Improvement
Affects Versions: 2.2.0
Reporter: Tassapol Athiapinya
Assignee: Bartosz Ługowski
Fix For: 2.7.0

Attachments: YARN-1621.1.patch, YARN-1621.2.patch, YARN-1621.3.patch,
YARN-1621.4.patch

As more applications are moved to YARN, we need generic CLI to list rows of
task attempt ID, container ID, host of container, state of container. Today
if YARN application running in a container does hang, there is no way to find
out more info because a user does not know where each attempt is running in.
For each running application, it is useful to differentiate between
running/succeeded/failed/killed containers.

{code:title=proposed yarn cli}
$ yarn application -list-containers -applicationId appId [-containerState
state of container]
where containerState is optional filter to list container in given state only.
container state can be running/succeeded/killed/failed/all.
A user can specify more than one container state at once e.g. KILLED,FAILED.
task attempt ID container ID host of container state of container
{code}
CLI should work with running application/completed application. If a
container runs many task attempts, all attempts should be shown. That will
likely be the case of Tez container-reuse application.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.


[ 
https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332180#comment-14332180
 ] 

Hudson commented on YARN-3236:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/])
YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev 
e3d290244c8a39edc37146d992cf34e6963b6851)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java
* hadoop-yarn-project/CHANGES.txt


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 -

 Key: YARN-3236
 URL: https://issues.apache.org/jira/browse/YARN-3236
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial
  Labels: cleanup, maintenance
 Fix For: 2.7.0

 Attachments: YARN-3236.000.patch


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the 
 code which use  AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would 
 better remove it to avoid confusion since it is only introduced for a very 
 short time and no one use it now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels


[ 
https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332178#comment-14332178
 ] 

Hudson commented on YARN-3238:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/])
YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: 
rev 92d67ace3248930c0c0335070cc71a480c566a36)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt


 Connection timeouts to nodemanagers are retried at multiple levels
 --

 Key: YARN-3238
 URL: https://issues.apache.org/jira/browse/YARN-3238
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 2.7.0

 Attachments: YARN-3238.001.patch


 The IPC layer will retry connection timeouts automatically (see Client.java), 
 but we are also retrying them with YARN's RetryPolicy put in place when the 
 NM proxy is created.  This causes a two-level retry mechanism where the IPC 
 layer has already retried quite a few times (45 by default) for each YARN 
 RetryPolicy error that is retried.  The end result is that NM clients can 
 wait a very, very long time for the connection to finally fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase


[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332179#comment-14332179
 ] 

Hudson commented on YARN-2797:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #103 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/103/])
YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev 
fe7a302473251b7310105a936edf220e401c613f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java


 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Fix For: 2.7.0

 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase


[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332189#comment-14332189
 ] 

Hudson commented on YARN-2797:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/])
YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev 
fe7a302473251b7310105a936edf220e401c613f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java


 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Fix For: 2.7.0

 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3237) AppLogAggregatorImpl fails to log error cause


[ 
https://issues.apache.org/jira/browse/YARN-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332197#comment-14332197
 ] 

Hudson commented on YARN-3237:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/])
YARN-3237. AppLogAggregatorImpl fails to log error cause. Contributed by 
(xgong: rev f56c65bb3eb9436b67de2df63098e26589e70e56)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* hadoop-yarn-project/CHANGES.txt


 AppLogAggregatorImpl fails to log error cause
 -

 Key: YARN-3237
 URL: https://issues.apache.org/jira/browse/YARN-3237
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Fix For: 2.7.0

 Attachments: YARN-3237-v2.patch, YARN-3237.patch


 AppLogAggregatorImpl fails to log the error if it is unable to create 
 LogWriter.
 Below is the log output:
 [LogAggregationService #24011] ERROR logaggregation.AppLogAggregatorImpl: 
 Cannot create writer for app app_id. Disabling log-aggregation for this app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.


[ 
https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332200#comment-14332200
 ] 

Hudson commented on YARN-3236:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/])
YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev 
e3d290244c8a39edc37146d992cf34e6963b6851)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 -

 Key: YARN-3236
 URL: https://issues.apache.org/jira/browse/YARN-3236
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial
  Labels: cleanup, maintenance
 Fix For: 2.7.0

 Attachments: YARN-3236.000.patch


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the 
 code which use  AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would 
 better remove it to avoid confusion since it is only introduced for a very 
 short time and no one use it now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels


[ 
https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332198#comment-14332198
 ] 

Hudson commented on YARN-3238:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/])
YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: 
rev 92d67ace3248930c0c0335070cc71a480c566a36)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java
* hadoop-yarn-project/CHANGES.txt


 Connection timeouts to nodemanagers are retried at multiple levels
 --

 Key: YARN-3238
 URL: https://issues.apache.org/jira/browse/YARN-3238
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 2.7.0

 Attachments: YARN-3238.001.patch


 The IPC layer will retry connection timeouts automatically (see Client.java), 
 but we are also retrying them with YARN's RetryPolicy put in place when the 
 NM proxy is created.  This causes a two-level retry mechanism where the IPC 
 layer has already retried quite a few times (45 by default) for each YARN 
 RetryPolicy error that is retried.  The end result is that NM clients can 
 wait a very, very long time for the connection to finally fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2799) cleanup TestLogAggregationService based on the change in YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332188#comment-14332188
 ] 

Hudson commented on YARN-2799:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/])
YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. 
Contributed by Zhihai Xu (junping_du: rev 
c33ae271c24f0770c9735ccd2086cafda4f4e0b2)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* hadoop-yarn-project/CHANGES.txt


 cleanup TestLogAggregationService based on the change in YARN-90
 

 Key: YARN-2799
 URL: https://issues.apache.org/jira/browse/YARN-2799
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Fix For: 2.7.0

 Attachments: YARN-2799.000.patch, YARN-2799.001.patch, 
 YARN-2799.002.patch


 cleanup TestLogAggregationService based on the change in YARN-90.
 The following code is added to setup in YARN-90, 
 {code}
 dispatcher = createDispatcher();
 appEventHandler = mock(EventHandler.class);
 dispatcher.register(ApplicationEventType.class, appEventHandler);
 {code}
 In this case, we should remove all these code from each test function to 
 avoid duplicate code.
 Same for dispatcher.stop() which is in tearDown,
 we can remove dispatcher.stop() from from each test function also because it 
 will always be called from tearDown for each test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Summary: Old ZK client session watcher event causes ZKRMStateStore out of 
sync with current ZK client session due to ZooKeeper asynchronously closing 
client session.  (was: Old ZK client session watcher event messed up new ZK 
client session due to ZooKeeper asynchronously closing client session.)

 Old ZK client session watcher event causes ZKRMStateStore out of sync with 
 current ZK client session due to ZooKeeper asynchronously closing client 
 session.
 

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.
 The following code from zookeeper(ClientCnxn#EventThread) show even after 
 receive eventOfDeath, EventThread will still process all the events until  
 waitingEvents queue is empty.
 {code}
   while (true) {
  Object event = waitingEvents.take();
  if (event == eventOfDeath) {
 wasKilled = true;
  } else {
 processEvent(event);
  }
  if (wasKilled)
 synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
   isRunning = false;
   break;
}
 }
   }
   private void processEvent(Object event) {
   try {
   if (event instanceof WatcherSetEventPair) {
   // each watcher will process the event
   WatcherSetEventPair pair = (WatcherSetEventPair) event;
   for (Watcher watcher : pair.watchers) {
   try {
   watcher.process(pair.event);
   } catch (Throwable t) {
   LOG.error(Error while calling watcher , t);
   }
   }
   } else {
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332190#comment-14332190
 ] 

Hudson commented on YARN-90:


FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/])
YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. 
Contributed by Zhihai Xu (junping_du: rev 
c33ae271c24f0770c9735ccd2086cafda4f4e0b2)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java
* hadoop-yarn-project/CHANGES.txt


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Description: 
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.

The following code from zookeeper(ClientCnxn#EventThread) show even after 
receive eventOfDeath, EventThread will still process all the events until  
waitingEvents queue is empty.
{code}
  while (true) {
 Object event = waitingEvents.take();
 if (event == eventOfDeath) {
wasKilled = true;
 } else {
processEvent(event);
 }
 if (wasKilled)
synchronized (waitingEvents) {
   if (waitingEvents.isEmpty()) {
  isRunning = false;
  break;
   }
}
  }

  private void processEvent(Object event) {
  try {
  if (event instanceof WatcherSetEventPair) {
  // each watcher will process the event
  WatcherSetEventPair pair = (WatcherSetEventPair) event;
  for (Watcher watcher : pair.watchers) {
  try {
  watcher.process(pair.event);
  } catch (Throwable t) {
  LOG.error(Error while calling watcher , t);
  }
  }
  } else {
{code}

  was:
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.

The following code from zookeeper(ClientCnxn#EventThread) show even after 
receive eventOfDeath, EventThread will still process all the events until  
waitingEvents queue is empty.
{code}
  while (true) {
 Object event = waitingEvents.take();
 if (event == eventOfDeath) {
wasKilled = true;
 } else {
processEvent(event);
 }
 if (wasKilled)
synchronized (waitingEvents) {
   if (waitingEvents.isEmpty()) {
  isRunning = false;
  break;
   }
}
  }
{code}


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.

[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Description: 
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.

The following code from zookeeper(ClientCnxn#EventThread) show even after 
receive eventOfDeath, EventThread will still process all the events until  
waitingEvents queue is empty.
{code}
  while (true) {
 Object event = waitingEvents.take();
 if (event == eventOfDeath) {
wasKilled = true;
 } else {
processEvent(event);
 }
 if (wasKilled)
synchronized (waitingEvents) {
   if (waitingEvents.isEmpty()) {
  isRunning = false;
  break;
   }
}
  }
{code}

  was:
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 ---

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient =

[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Attachment: YARN-3242.002.patch

 Old ZK client session watcher event causes ZKRMStateStore out of sync with 
 current ZK client session due to ZooKeeper asynchronously closing client 
 session.
 

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch, 
 YARN-3242.002.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.
 The following code from zookeeper(ClientCnxn#EventThread) show even after 
 receive eventOfDeath, EventThread will still process all the events until  
 waitingEvents queue is empty.
 {code}
   while (true) {
  Object event = waitingEvents.take();
  if (event == eventOfDeath) {
 wasKilled = true;
  } else {
 processEvent(event);
  }
  if (wasKilled)
 synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
   isRunning = false;
   break;
}
 }
   }
   private void processEvent(Object event) {
   try {
   if (event instanceof WatcherSetEventPair) {
   // each watcher will process the event
   WatcherSetEventPair pair = (WatcherSetEventPair) event;
   for (Watcher watcher : pair.watchers) {
   try {
   watcher.process(pair.event);
   } catch (Throwable t) {
   LOG.error(Error while calling watcher , t);
   }
   }
   } else {
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3194) RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node


[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332196#comment-14332196
 ] 

Hudson commented on YARN-3194:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/])
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if 
NM is Reconnected node. Contributed by Rohith (jlowe: rev 
a64dd3d24bfcb9af21eb63869924f6482b147fd3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
* hadoop-yarn-project/CHANGES.txt


 RM should handle NMContainerStatuses sent by NM while registering if NM is 
 Reconnected node
 ---

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
 registration. The registration can be treated by RM as New node or 
 Reconnecting node. RM triggers corresponding event on the basis of node added 
 or node reconnected state. 
 # Node added event : Again here 2 scenario's can occur 
 ## New node is registering with different ip:port – NOT A PROBLEM
 ## Old node is re-registering because of RESYNC command from RM when RM 
 restart – NOT A PROBLEM
 # Node reconnected event : 
 ## Existing node is re-registering i.e RM treat it as reconnecting node when 
 RM is not restarted 
 ### NM RESTART NOT Enabled – NOT A PROBLEM
 ### NM RESTART is Enabled 
  Some applications are running on this node – *Problem is here*
  Zero applications are running on this node – NOT A PROBLEM
 Since NMContainerStatus are not handled, RM never get to know about 
 completedContainer and never release resource held be containers. RM will not 
 allocate new containers for pending resource request as long as the 
 completedContainer event is triggered. This results in applications to wait 
 indefinitly because of pending containers are not served by RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels


[ 
https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332226#comment-14332226
 ] 

Hudson commented on YARN-3238:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-3238. Connection timeouts to nodemanagers are retried at multiple (xgong: 
rev 92d67ace3248930c0c0335070cc71a480c566a36)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java


 Connection timeouts to nodemanagers are retried at multiple levels
 --

 Key: YARN-3238
 URL: https://issues.apache.org/jira/browse/YARN-3238
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Blocker
 Fix For: 2.7.0

 Attachments: YARN-3238.001.patch


 The IPC layer will retry connection timeouts automatically (see Client.java), 
 but we are also retrying them with YARN's RetryPolicy put in place when the 
 NM proxy is created.  This causes a two-level retry mechanism where the IPC 
 layer has already retried quite a few times (45 by default) for each YARN 
 RetryPolicy error that is retried.  The end result is that NM clients can 
 wait a very, very long time for the connection to finally fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3236) cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.


[ 
https://issues.apache.org/jira/browse/YARN-3236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332228#comment-14332228
 ] 

Hudson commented on YARN-3236:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-3236. Cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY. (xgong: rev 
e3d290244c8a39edc37146d992cf34e6963b6851)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 -

 Key: YARN-3236
 URL: https://issues.apache.org/jira/browse/YARN-3236
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial
  Labels: cleanup, maintenance
 Fix For: 2.7.0

 Attachments: YARN-3236.000.patch


 cleanup RMAuthenticationFilter#AUTH_HANDLER_PROPERTY.
 RMAuthenticationFilter#AUTH_HANDLER_PROPERTY is added in YARN-2247. but the 
 code which use  AUTH_HANDLER_PROPERTY is removed at YARN-2656. We would 
 better remove it to avoid confusion since it is only introduced for a very 
 short time and no one use it now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3194) RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node


[ 
https://issues.apache.org/jira/browse/YARN-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332224#comment-14332224
 ] 

Hudson commented on YARN-3194:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if 
NM is Reconnected node. Contributed by Rohith (jlowe: rev 
a64dd3d24bfcb9af21eb63869924f6482b147fd3)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java


 RM should handle NMContainerStatuses sent by NM while registering if NM is 
 Reconnected node
 ---

 Key: YARN-3194
 URL: https://issues.apache.org/jira/browse/YARN-3194
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: NM restart is enabled
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
 Fix For: 2.7.0

 Attachments: 0001-YARN-3194.patch, 0001-yarn-3194-v1.patch


 On NM restart ,NM sends all the outstanding NMContainerStatus to RM during 
 registration. The registration can be treated by RM as New node or 
 Reconnecting node. RM triggers corresponding event on the basis of node added 
 or node reconnected state. 
 # Node added event : Again here 2 scenario's can occur 
 ## New node is registering with different ip:port – NOT A PROBLEM
 ## Old node is re-registering because of RESYNC command from RM when RM 
 restart – NOT A PROBLEM
 # Node reconnected event : 
 ## Existing node is re-registering i.e RM treat it as reconnecting node when 
 RM is not restarted 
 ### NM RESTART NOT Enabled – NOT A PROBLEM
 ### NM RESTART is Enabled 
  Some applications are running on this node – *Problem is here*
  Zero applications are running on this node – NOT A PROBLEM
 Since NMContainerStatus are not handled, RM never get to know about 
 completedContainer and never release resource held be containers. RM will not 
 allocate new containers for pending resource request as long as the 
 completedContainer event is triggered. This results in applications to wait 
 indefinitly because of pending containers are not served by RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2799) cleanup TestLogAggregationService based on the change in YARN-90


[ 
https://issues.apache.org/jira/browse/YARN-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332216#comment-14332216
 ] 

Hudson commented on YARN-2799:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. 
Contributed by Zhihai Xu (junping_du: rev 
c33ae271c24f0770c9735ccd2086cafda4f4e0b2)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java


 cleanup TestLogAggregationService based on the change in YARN-90
 

 Key: YARN-2799
 URL: https://issues.apache.org/jira/browse/YARN-2799
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Minor
 Fix For: 2.7.0

 Attachments: YARN-2799.000.patch, YARN-2799.001.patch, 
 YARN-2799.002.patch


 cleanup TestLogAggregationService based on the change in YARN-90.
 The following code is added to setup in YARN-90, 
 {code}
 dispatcher = createDispatcher();
 appEventHandler = mock(EventHandler.class);
 dispatcher.register(ApplicationEventType.class, appEventHandler);
 {code}
 In this case, we should remove all these code from each test function to 
 avoid duplicate code.
 Same for dispatcher.stop() which is in tearDown,
 we can remove dispatcher.stop() from from each test function also because it 
 will always be called from tearDown for each test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3230) Clarify application states on the web UI


[ 
https://issues.apache.org/jira/browse/YARN-3230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332229#comment-14332229
 ] 

Hudson commented on YARN-3230:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-3230. Clarify application states on the web UI. (Jian He via wangda) 
(wangda: rev ce5bf927c3d9f212798de1bf8706e5e9def235a1)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlock.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppBlock.java


 Clarify application states on the web UI
 

 Key: YARN-3230
 URL: https://issues.apache.org/jira/browse/YARN-3230
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.7.0

 Attachments: YARN-3230.1.patch, YARN-3230.2.patch, YARN-3230.3.patch, 
 YARN-3230.3.patch, application page.png


 Today, application state are simply surfaced as a single word on the web UI. 
 Not everyone understands the meaning of NEW_SAVING, SUBMITTED, ACCEPTED. 
 This jira is to clarify the meaning of these states, things like what the 
 application is waiting for at this state. 
 In addition,the difference between application state and FinalStatus are 
 fairly confusing to users, especially when state=FINISHED, but 
 FinalStatus=FAILED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3237) AppLogAggregatorImpl fails to log error cause


[ 
https://issues.apache.org/jira/browse/YARN-3237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332225#comment-14332225
 ] 

Hudson commented on YARN-3237:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-3237. AppLogAggregatorImpl fails to log error cause. Contributed by 
(xgong: rev f56c65bb3eb9436b67de2df63098e26589e70e56)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java
* hadoop-yarn-project/CHANGES.txt


 AppLogAggregatorImpl fails to log error cause
 -

 Key: YARN-3237
 URL: https://issues.apache.org/jira/browse/YARN-3237
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.0
Reporter: Rushabh S Shah
Assignee: Rushabh S Shah
 Fix For: 2.7.0

 Attachments: YARN-3237-v2.patch, YARN-3237.patch


 AppLogAggregatorImpl fails to log the error if it is unable to create 
 LogWriter.
 Below is the log output:
 [LogAggregationService #24011] ERROR logaggregation.AppLogAggregatorImpl: 
 Cannot create writer for app app_id. Disabling log-aggregation for this app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase


[ 
https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332217#comment-14332217
 ] 

Hudson commented on YARN-2797:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-2797. TestWorkPreservingRMRestart should use (xgong: rev 
fe7a302473251b7310105a936edf220e401c613f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/ParameterizedSchedulerTestBase.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java
* hadoop-yarn-project/CHANGES.txt


 TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
 

 Key: YARN-2797
 URL: https://issues.apache.org/jira/browse/YARN-2797
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla
Priority: Minor
 Fix For: 2.7.0

 Attachments: yarn-2797-1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-90) NodeManager should identify failed disks becoming good again


[ 
https://issues.apache.org/jira/browse/YARN-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332218#comment-14332218
 ] 

Hudson commented on YARN-90:


FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/])
YARN-2799. Cleanup TestLogAggregationService based on the change in YARN-90. 
Contributed by Zhihai Xu (junping_du: rev 
c33ae271c24f0770c9735ccd2086cafda4f4e0b2)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java


 NodeManager should identify failed disks becoming good again
 

 Key: YARN-90
 URL: https://issues.apache.org/jira/browse/YARN-90
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Ravi Gummadi
Assignee: Varun Vasudev
 Fix For: 2.6.0

 Attachments: YARN-90.1.patch, YARN-90.patch, YARN-90.patch, 
 YARN-90.patch, YARN-90.patch, apache-yarn-90.0.patch, apache-yarn-90.1.patch, 
 apache-yarn-90.10.patch, apache-yarn-90.2.patch, apache-yarn-90.3.patch, 
 apache-yarn-90.4.patch, apache-yarn-90.5.patch, apache-yarn-90.6.patch, 
 apache-yarn-90.7.patch, apache-yarn-90.8.patch, apache-yarn-90.9.patch


 MAPREDUCE-3121 makes NodeManager identify disk failures. But once a disk goes 
 down, it is marked as failed forever. To reuse that disk (after it becomes 
 good), NodeManager needs restart. This JIRA is to improve NodeManager to 
 reuse good disks(which could be bad some time back).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332298#comment-14332298
 ] 

zhihai xu commented on YARN-3242:
-

I uploaded a new patch YARN-3242.002.patch which add a test case: old client 
session Disconnected event won't affect the current client session.

 Old ZK client session watcher event causes ZKRMStateStore out of sync with 
 current ZK client session due to ZooKeeper asynchronously closing client 
 session.
 

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch, 
 YARN-3242.002.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.
 The following code from zookeeper(ClientCnxn#EventThread) show even after 
 receive eventOfDeath, EventThread will still process all the events until  
 waitingEvents queue is empty.
 {code}
   while (true) {
  Object event = waitingEvents.take();
  if (event == eventOfDeath) {
 wasKilled = true;
  } else {
 processEvent(event);
  }
  if (wasKilled)
 synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
   isRunning = false;
   break;
}
 }
   }
   private void processEvent(Object event) {
   try {
   if (event instanceof WatcherSetEventPair) {
   // each watcher will process the event
   WatcherSetEventPair pair = (WatcherSetEventPair) event;
   for (Watcher watcher : pair.watchers) {
   try {
   watcher.process(pair.event);
   } catch (Throwable t) {
   LOG.error(Error while calling watcher , t);
   }
   }
   } else {
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Attachment: YARN-3242.001.patch

 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 ---

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore when the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.

[
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhihai xu updated YARN-3242:

Description:
Old ZK client session watcher event messed up new ZK client session due to
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to
ZKRMStateStore after the old ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher
event is from current session. So the watcher event from old ZK client session
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new
session is connected, the zkClient will be set to null
{code}
case Disconnected:
LOG.info(ZKRMStateStore Session disconnected);
oldZkClient = zkClient;
zkClient = null;
break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because
new session is already in SyncConnected state and it won't send SyncConnected
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait
for ZKClient creation timed out until RM shutdown.

was:
Old ZK client session watcher event messed up new ZK client session due to
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to
ZKRMStateStore when the old ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher
event is from current session. So the watcher event from old ZK client session
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new
session is connected, the zkClient will be set to null
{code}
case Disconnected:
LOG.info(ZKRMStateStore Session disconnected);
oldZkClient = zkClient;
zkClient = null;
break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because
new session is already in SyncConnected state and it won't send SyncConnected
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait
for ZKClient creation timed out until RM shutdown.

Old ZK client session watcher event messed up new ZK client session due to
ZooKeeper asynchronously closing client session.
---

Old ZK client session watcher event messed up new ZK client session due to
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to
ZKRMStateStore after the old ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher
event is from current session. So the watcher event from old ZK client
session which just is closed will still be processed.
For example, If a Disconnected event received from old session after new
session is connected, the zkClient will be set to null
{code}
case Disconnected:
LOG.info(ZKRMStateStore Session disconnected);
oldZkClient = zkClient;
zkClient = null;
break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session
because new session is already in SyncConnected state and it won't send
SyncConnected event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException
Wait for ZKClient creation timed out until RM shutdown.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Description: 
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.

The following code from zookeeper(ClientCnxn#EventThread) show even after 
receive eventOfDeath, EventThread will still process all the events until  
waitingEvents queue is empty.
{code}
  while (true) {
 Object event = waitingEvents.take();
 if (event == eventOfDeath) {
wasKilled = true;
 } else {
processEvent(event);
 }
 if (wasKilled)
synchronized (waitingEvents) {
   if (waitingEvents.isEmpty()) {
  isRunning = false;
  break;
   }
}
  }

  private void processEvent(Object event) {
  try {
  if (event instanceof WatcherSetEventPair) {
  // each watcher will process the event
  WatcherSetEventPair pair = (WatcherSetEventPair) event;
  for (Watcher watcher : pair.watchers) {
  try {
  watcher.process(pair.event);
  } catch (Throwable t) {
  LOG.error(Error while calling watcher , t);
  }
  }
  } else {

public void disconnect() {
if (LOG.isDebugEnabled()) {
LOG.debug(Disconnecting client for session: 0x
  + Long.toHexString(getSessionId()));
}

sendThread.close();
eventThread.queueEventOfDeath();
}
{code}

  was:
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.

The following code from zookeeper(ClientCnxn#EventThread) show even after 
receive eventOfDeath, EventThread will still process all the events until  
waitingEvents queue is empty.
{code}
  while (true) {
 Object event = waitingEvents.take();
 if (event == eventOfDeath) {
wasKilled = true;
 } else {
processEvent(event);
 }
 if (wasKilled)
synchronized (waitingEvents) {
   if (waitingEvents.isEmpty()) {
  isRunning = false;
  break;

[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332344#comment-14332344
 ] 

Hadoop QA commented on YARN-3242:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12700107/YARN-3242.002.patch
  against trunk revision fe7a302.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6693//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6693//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6693//console

This message is automatically generated.

 Old ZK client session watcher event causes ZKRMStateStore out of sync with 
 current ZK client session due to ZooKeeper asynchronously closing client 
 session.
 

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch, 
 YARN-3242.002.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.
 The following code from zookeeper(ClientCnxn#EventThread) show even after 
 receive eventOfDeath, EventThread will still process all the events until  
 waitingEvents queue is empty.
 {code}
   while (true) {
  Object event = waitingEvents.take();
  if (event == eventOfDeath) {
 wasKilled = true;
  } else {
 processEvent(event);
  }
  if (wasKilled)
 synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
   isRunning = false;
   break;
}
 }
   }
   private void processEvent(Object event) {
   try {
   if (event instanceof WatcherSetEventPair) {
   // each watcher will process the event
   WatcherSetEventPair pair = (WatcherSetEventPair) event;
   for (Watcher watcher : pair.watchers) {
   try {
   watcher.process(pair.event);
   } catch (Throwable t) {
   LOG.error(Error while calling watcher ,

[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332353#comment-14332353
 ] 

zhihai xu commented on YARN-3242:
-

I checked the findbugs warning message, all these findbugs are related to my 
change.

 Old ZK client session watcher event causes ZKRMStateStore out of sync with 
 current ZK client session due to ZooKeeper asynchronously closing client 
 session.
 

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch, 
 YARN-3242.002.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.
 The following code from zookeeper(ClientCnxn#EventThread) show even after 
 receive eventOfDeath, EventThread will still process all the events until  
 waitingEvents queue is empty.
 {code}
   while (true) {
  Object event = waitingEvents.take();
  if (event == eventOfDeath) {
 wasKilled = true;
  } else {
 processEvent(event);
  }
  if (wasKilled)
 synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
   isRunning = false;
   break;
}
 }
   }
   private void processEvent(Object event) {
   try {
   if (event instanceof WatcherSetEventPair) {
   // each watcher will process the event
   WatcherSetEventPair pair = (WatcherSetEventPair) event;
   for (Watcher watcher : pair.watchers) {
   try {
   watcher.process(pair.event);
   } catch (Throwable t) {
   LOG.error(Error while calling watcher , t);
   }
   }
   } else {
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Description: 
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.

The following code from zookeeper(ClientCnxn#EventThread) show even after 
receive eventOfDeath, EventThread will still process all the events until  
waitingEvents queue is empty.
{code}
  while (true) {
 Object event = waitingEvents.take();
 if (event == eventOfDeath) {
wasKilled = true;
 } else {
processEvent(event);
 }
 if (wasKilled)
synchronized (waitingEvents) {
   if (waitingEvents.isEmpty()) {
  isRunning = false;
  break;
   }
}
  }

  private void processEvent(Object event) {
  try {
  if (event instanceof WatcherSetEventPair) {
  // each watcher will process the event
  WatcherSetEventPair pair = (WatcherSetEventPair) event;
  for (Watcher watcher : pair.watchers) {
  try {
  watcher.process(pair.event);
  } catch (Throwable t) {
  LOG.error(Error while calling watcher , t);
  }
  }
  } else {

public void disconnect() {
if (LOG.isDebugEnabled()) {
LOG.debug(Disconnecting client for session: 0x
  + Long.toHexString(getSessionId()));
}

sendThread.close();
eventThread.queueEventOfDeath();
}

public void close() throws IOException {
if (LOG.isDebugEnabled()) {
LOG.debug(Closing client for session: 0x
  + Long.toHexString(getSessionId()));
}

try {
RequestHeader h = new RequestHeader();
h.setType(ZooDefs.OpCode.closeSession);

submitRequest(h, null, null, null);
} catch (InterruptedException e) {
// ignore, close the send/event threads
} finally {
disconnect();
}
}
{code}

  was:
Old ZK client session watcher event messed up new ZK client session due to 
ZooKeeper asynchronously closing client session.
The watcher event from old ZK client session can still be sent to 
ZKRMStateStore after the old  ZK client session is closed.
This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
session.
We only have one ZKRMStateStore but we can have multiple ZK client sessions.
Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
event is from current session. So the watcher event from old ZK client session 
which just is closed will still be processed.
For example, If a Disconnected event received from old session after new 
session is connected, the zkClient will be set to null
{code}
case Disconnected:
  LOG.info(ZKRMStateStore Session disconnected);
  oldZkClient = zkClient;
  zkClient = null;
  break;
{code}
Then ZKRMStateStore won't receive SyncConnected event from new session because 
new session is already in SyncConnected state and it won't send SyncConnected 
event until it is disconnected and connected again.
Then we will see all the ZKRMStateStore operations fail with IOException Wait 
for ZKClient creation timed out until  RM shutdown.

The following code from zookeeper(ClientCnxn#EventThread) show even after 
receive eventOfDeath, EventThread will still process all the events until

[jira] [Resolved] (YARN-1778) TestFSRMStateStore fails on trunk

2015-02-22 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA resolved YARN-1778.
--
Resolution: Duplicate

This problem will be fixed on YARN-2820. Closing this as duplicated one.

 TestFSRMStateStore fails on trunk
 -

 Key: YARN-1778
 URL: https://issues.apache.org/jira/browse/YARN-1778
 Project: Hadoop YARN
  Issue Type: Test
Reporter: Xuan Gong
Assignee: zhihai xu
 Attachments: YARN-1778.000.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


[ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332443#comment-14332443
 ] 

zhihai xu commented on YARN-3242:
-

I uploaded a new patch YARN-3242.003.patch which add more test cases to send 
watcher event to both previous and current client session.

 Old ZK client session watcher event causes ZKRMStateStore out of sync with 
 current ZK client session due to ZooKeeper asynchronously closing client 
 session.
 

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch, 
 YARN-3242.002.patch, YARN-3242.003.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.
 The following code from zookeeper(ClientCnxn#EventThread) show even after 
 receive eventOfDeath, EventThread will still process all the events until  
 waitingEvents queue is empty.
 {code}
   while (true) {
  Object event = waitingEvents.take();
  if (event == eventOfDeath) {
 wasKilled = true;
  } else {
 processEvent(event);
  }
  if (wasKilled)
 synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
   isRunning = false;
   break;
}
 }
   }
   private void processEvent(Object event) {
   try {
   if (event instanceof WatcherSetEventPair) {
   // each watcher will process the event
   WatcherSetEventPair pair = (WatcherSetEventPair) event;
   for (Watcher watcher : pair.watchers) {
   try {
   watcher.process(pair.event);
   } catch (Throwable t) {
   LOG.error(Error while calling watcher , t);
   }
   }
   } else {
 public void disconnect() {
 if (LOG.isDebugEnabled()) {
 LOG.debug(Disconnecting client for session: 0x
   + Long.toHexString(getSessionId()));
 }
 sendThread.close();
 eventThread.queueEventOfDeath();
 }
 public void close() throws IOException {
 if (LOG.isDebugEnabled()) {
 LOG.debug(Closing client for session: 0x
   + Long.toHexString(getSessionId()));
 }
 try {
 RequestHeader h = new RequestHeader();
 h.setType(ZooDefs.OpCode.closeSession);
 submitRequest(h, null, null, null);
 } catch (InterruptedException e) {
 // ignore, close the send/event threads
 } finally {
 disconnect();
 }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.


 [ 
https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3242:

Attachment: YARN-3242.003.patch

 Old ZK client session watcher event causes ZKRMStateStore out of sync with 
 current ZK client session due to ZooKeeper asynchronously closing client 
 session.
 

 Key: YARN-3242
 URL: https://issues.apache.org/jira/browse/YARN-3242
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3242.000.patch, YARN-3242.001.patch, 
 YARN-3242.002.patch, YARN-3242.003.patch


 Old ZK client session watcher event messed up new ZK client session due to 
 ZooKeeper asynchronously closing client session.
 The watcher event from old ZK client session can still be sent to 
 ZKRMStateStore after the old  ZK client session is closed.
 This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper 
 session.
 We only have one ZKRMStateStore but we can have multiple ZK client sessions.
 Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher 
 event is from current session. So the watcher event from old ZK client 
 session which just is closed will still be processed.
 For example, If a Disconnected event received from old session after new 
 session is connected, the zkClient will be set to null
 {code}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {code}
 Then ZKRMStateStore won't receive SyncConnected event from new session 
 because new session is already in SyncConnected state and it won't send 
 SyncConnected event until it is disconnected and connected again.
 Then we will see all the ZKRMStateStore operations fail with IOException 
 Wait for ZKClient creation timed out until  RM shutdown.
 The following code from zookeeper(ClientCnxn#EventThread) show even after 
 receive eventOfDeath, EventThread will still process all the events until  
 waitingEvents queue is empty.
 {code}
   while (true) {
  Object event = waitingEvents.take();
  if (event == eventOfDeath) {
 wasKilled = true;
  } else {
 processEvent(event);
  }
  if (wasKilled)
 synchronized (waitingEvents) {
if (waitingEvents.isEmpty()) {
   isRunning = false;
   break;
}
 }
   }
   private void processEvent(Object event) {
   try {
   if (event instanceof WatcherSetEventPair) {
   // each watcher will process the event
   WatcherSetEventPair pair = (WatcherSetEventPair) event;
   for (Watcher watcher : pair.watchers) {
   try {
   watcher.process(pair.event);
   } catch (Throwable t) {
   LOG.error(Error while calling watcher , t);
   }
   }
   } else {
 public void disconnect() {
 if (LOG.isDebugEnabled()) {
 LOG.debug(Disconnecting client for session: 0x
   + Long.toHexString(getSessionId()));
 }
 sendThread.close();
 eventThread.queueEventOfDeath();
 }
 public void close() throws IOException {
 if (LOG.isDebugEnabled()) {
 LOG.debug(Closing client for session: 0x
   + Long.toHexString(getSessionId()));
 }
 try {
 RequestHeader h = new RequestHeader();
 h.setType(ZooDefs.OpCode.closeSession);
 submitRequest(h, null, null, null);
 } catch (InterruptedException e) {
 // ignore, close the send/event threads
 } finally {
 disconnect();
 }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3154) Should not upload partial logs for MR jobs or other short-running' applications


[ 
https://issues.apache.org/jira/browse/YARN-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332501#comment-14332501
 ] 

Hadoop QA commented on YARN-3154:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12700128/YARN-3154.2.patch
  against trunk revision fe7a302.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 5 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6695//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/6695//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6695//console

This message is automatically generated.

 Should not upload partial logs for MR jobs or other short-running' 
 applications 
 -

 Key: YARN-3154
 URL: https://issues.apache.org/jira/browse/YARN-3154
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-3154.1.patch, YARN-3154.2.patch


 Currently, if we are running a MR job, and we do not set the log interval 
 properly, we will have their partial logs uploaded and then removed from the 
 local filesystem which is not right.
 We only upload the partial logs for LRS applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3239) WebAppProxy does not support a final tracking url which has query fragments and params


[ 
https://issues.apache.org/jira/browse/YARN-3239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332516#comment-14332516
 ] 

Hadoop QA commented on YARN-3239:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12700132/YARN-3239.1.patch
  against trunk revision fe7a302.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/6696//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6696//console

This message is automatically generated.

 WebAppProxy does not support a final tracking url which has query fragments 
 and params 
 ---

 Key: YARN-3239
 URL: https://issues.apache.org/jira/browse/YARN-3239
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Jian He
 Attachments: YARN-3239.1.patch


 Examples of failures:
 Expected: 
 {{http://uihost:8080/#/main/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424384418229_0005}}
 Actual: {{http://uihost:8080}}
 Tried with a minor change to remove the #. Saw a different issue:
 Expected: 
 {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez?viewPath=%2F%23%2Ftez-app%2Fapplication_1424388018547_0001}}
 Actual: {{http://uihost:8080/views/TEZ/0.5.2.2.2.2.0-947/tez/}}
 yarn application -status appId returns the expected value correctly. However, 
 invoking an http get on http://rm:8088/proxy/appId/ returns the wrong value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3154) Should not upload partial logs for MR jobs or other short-running' applications

2015-02-22 Thread Xuan Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-3154:

Attachment: YARN-3154.2.patch

update the code based on the latest trunk

 Should not upload partial logs for MR jobs or other short-running' 
 applications 
 -

 Key: YARN-3154
 URL: https://issues.apache.org/jira/browse/YARN-3154
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-3154.1.patch, YARN-3154.2.patch


 Currently, if we are running a MR job, and we do not set the log interval 
 properly, we will have their partial logs uploaded and then removed from the 
 local filesystem which is not right.
 We only upload the partial logs for LRS applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3242) Old ZK client session watcher event causes ZKRMStateStore out of sync with current ZK client session due to ZooKeeper asynchronously closing client session.