[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage

2015-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541730#comment-14541730
 ] 

Junping Du commented on YARN-3411:
--

Sure. Cancel the patch until we have new version. Thx!

 [Storage implementation] explore the native HBase write schema for storage
 --

 Key: YARN-3411
 URL: https://issues.apache.org/jira/browse/YARN-3411
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Sangjin Lee
Assignee: Vrushali C
Priority: Critical
 Attachments: ATSv2BackendHBaseSchemaproposal.pdf, 
 YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, YARN-3411.poc.4.txt, 
 YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, YARN-3411.poc.txt


 There is work that's in progress to implement the storage based on a Phoenix 
 schema (YARN-3134).
 In parallel, we would like to explore an implementation based on a native 
 HBase schema for the write path. Such a schema does not exclude using 
 Phoenix, especially for reads and offline queries.
 Once we have basic implementations of both options, we could evaluate them in 
 terms of performance, scalability, usability, etc. and make a call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Rohith (JIRA)
Rohith created YARN-3640:


 Summary: NodeManager JVM continues to run after SHUTDOWN event is 
triggered.
 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith


We faced strange issue in the cluster that NodeManager did not exitted when the 
SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and 
verified it, but did not get much idea why NM jvm not exited. 

At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread 
dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541742#comment-14541742
 ] 

Rohith commented on YARN-3640:
--

It looks similar, but I did not get how did you taken *jni leveldb thread 
stack* which mentioned in YARN-3585

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, 
 nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

2015-05-13 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-160:
---
Attachment: YARN-160.007.patch

Uploaded 007.patch which improves some logging.

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
  Labels: BB2015-05-TBR
 Attachments: YARN-160.005.patch, YARN-160.006.patch, 
 YARN-160.007.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, 
 apache-yarn-160.2.patch, apache-yarn-160.3.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Peng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541751#comment-14541751
 ] 

Peng Zhang commented on YARN-3640:
--

I use pstack to get it

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, 
 nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using

2015-05-13 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated YARN-3638:
--
Component/s: yarn
 scheduler
 resourcemanager
 capacityscheduler

 Yarn Resource Manager Scheduler page - show percentage of total cluster that 
 each queue is using
 

 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler, yarn
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor

 Request to show % of total cluster resources each queue is currently 
 consuming for jobs on the Yarn Resource Manager Scheduler page.
 Currently the Yarn Resource Manager Scheduler page shows the % of total used 
 for root queue and the % of each given queue's configured capacity that is 
 used (often showing say 150% if the max capacity is greater than configured 
 capacity to allow bursting where there are free resources). This is fine, but 
 it would be good to additionally show the % of total cluster that each given 
 queue is consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated YARN-3639:
--
Assignee: (was: Xianyin Xin)

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin

 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3640:
-
Attachment: hadoop-rohith-nodemanager-test123.log

Attached the nodemanager log file. It contains log from stopping the services.

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, 
 nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3634) TestMRTimelineEventHandling and TestApplication are broken

2015-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541722#comment-14541722
 ] 

Junping Du commented on YARN-3634:
--

Thanks [~sjlee0] to report the issue and deliver the patch to fix it. Patch 
looks mostly good to me.  Only one minor issue:
{code}
+if (nmCollectorService == null) {
+  synchronized (this) {
+Configuration conf = getConfig();
+InetSocketAddress nmCollectorServiceAddress = conf.getSocketAddr(
+YarnConfiguration.NM_BIND_HOST,
+YarnConfiguration.NM_COLLECTOR_SERVICE_ADDRESS,
+YarnConfiguration.DEFAULT_NM_COLLECTOR_SERVICE_ADDRESS,
+YarnConfiguration.DEFAULT_NM_COLLECTOR_SERVICE_PORT);
+LOG.info(nmCollectorServiceAddress:  + nmCollectorServiceAddress);
+final YarnRPC rpc = YarnRPC.create(conf);
+
+// TODO Security settings.
+nmCollectorService = (CollectorNodemanagerProtocol) rpc.getProxy(
+CollectorNodemanagerProtocol.class,
+nmCollectorServiceAddress, conf);
+  }
+}
{code}
The synchronized block sounds unnecessary, as this is the only place to update 
nmCollectorService which get called by serviceStart() - which get called by 
single thread only. The race condition could happen with other reader threads. 
But given writer is always single thread and we already mark nmCollectorService 
as volatile in this patch, it should safe to remove the synchronized block.


 TestMRTimelineEventHandling and TestApplication are broken
 --

 Key: YARN-3634
 URL: https://issues.apache.org/jira/browse/YARN-3634
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Attachments: YARN-3634-YARN-2928.001.patch, 
 YARN-3634-YARN-2928.002.patch, YARN-3634-YARN-2928.003.patch


 TestMRTimelineEventHandling is broken. Relevant error message:
 {noformat}
 2015-05-12 06:28:56,415 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 0 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:28:57,416 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 1 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:28:58,416 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 2 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:28:59,417 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 3 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:00,418 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 4 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:01,419 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 5 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:02,420 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 6 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:03,420 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 7 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:04,421 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 8 time(s); retry 
 policy is 

[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541743#comment-14541743
 ] 

Rohith commented on YARN-3640:
--

I missed this issue, thanks for pointing out. 

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, 
 nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2921) MockRM#waitForState methods can be too slow and flaky

2015-05-13 Thread Tsuyoshi Ozawa (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi Ozawa updated YARN-2921:
-
Attachment: YARN-2921.008.patch

Resubmitting a patch again.

 MockRM#waitForState methods can be too slow and flaky
 -

 Key: YARN-2921
 URL: https://issues.apache.org/jira/browse/YARN-2921
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0, 2.7.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
 YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, 
 YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, 
 YARN-2921.008.patch


 MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
 second). This leads to slow tests and sometimes failures if the 
 App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2921) MockRM#waitForState methods can be too slow and flaky

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541727#comment-14541727
 ] 

Hadoop QA commented on YARN-2921:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 40s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   7m 32s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 32s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 40s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 39s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |  49m 46s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  90m 20s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732511/YARN-2921.008.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 92c38e4 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7915/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7915/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7915/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7915/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7915/console |


This message was automatically generated.

 MockRM#waitForState methods can be too slow and flaky
 -

 Key: YARN-2921
 URL: https://issues.apache.org/jira/browse/YARN-2921
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0, 2.7.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
 YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, 
 YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, 
 YARN-2921.008.patch


 MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
 second). This leads to slow tests and sometimes failures if the 
 App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)
Xianyin Xin created YARN-3639:
-

 Summary: It takes too long time for RM to recover all apps if the 
original active RM and namenode is deployed on the same node.
 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
Assignee: Xianyin Xin


If the node on which the active RM runs dies and if the active namenode is 
running on the same node, the new RM will take long time to recover all apps. 
After analysis, we found the root cause is renewing HDFS tokens in the 
recovering process. The HDFS client created by the renewer would firstly try to 
connect to the original namenode, the result of which is time-out after 10~20s, 
and then the client tries to connect to the new namenode. The entire recovery 
cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541590#comment-14541590
 ] 

Hadoop QA commented on YARN-2423:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12697840/YARN-2423.007.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / e82067b |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7914/console |


This message was automatically generated.

 TimelineClient should wrap all GET APIs to facilitate Java users
 

 Key: YARN-2423
 URL: https://issues.apache.org/jira/browse/YARN-2423
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Robert Kanter
  Labels: BB2015-05-TBR
 Attachments: YARN-2423.004.patch, YARN-2423.005.patch, 
 YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, 
 YARN-2423.patch


 TimelineClient provides the Java method to put timeline entities. It's also 
 good to wrap over all GET APIs (both entity and domain), and deserialize the 
 json response into Java POJO objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree

2015-05-13 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541592#comment-14541592
 ] 

Tsuyoshi Ozawa commented on YARN-2336:
--

I see. Should we remove childQueue when childQueue is null for the consistency? 
CapacityScheduler doesn't return childQueues if queue is null(empty).

 Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
 --

 Key: YARN-2336
 URL: https://issues.apache.org/jira/browse/YARN-2336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.1
Reporter: Kenji Kikushima
Assignee: Kenji Kikushima
  Labels: BB2015-05-RFC
 Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336-4.patch, 
 YARN-2336.005.patch, YARN-2336.patch


 When we have sub queues in Fair Scheduler, REST api returns a missing '[' 
 blacket JSON for childQueues.
 This issue found by [~ajisakaa] at YARN-1050.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1039) Add parameter for YARN resource requests to indicate long lived

2015-05-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541612#comment-14541612
 ] 

Steve Loughran commented on YARN-1039:
--

+1 for a long-lived bit. Services can set the flag, and it is up for future 
versions of Hadoop to implement the logic to go with it. 

FWIW, I'd make the first use of the patch the YARN-1079 progress bar. 

Why? it's the least amount of server-side code changes (no scheduling patches), 
it fixes a tangible problem for users (progress bar is confusing), and it 
provides an immediate benefit to the apps —so encouraging them to set the flag, 
maybe even by reflection if they want to stay compatible across hadoop versions.

 Add parameter for YARN resource requests to indicate long lived
 -

 Key: YARN-1039
 URL: https://issues.apache.org/jira/browse/YARN-1039
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 3.0.0, 2.1.1-beta
Reporter: Steve Loughran
Assignee: Craig Welch
 Attachments: YARN-1039.1.patch, YARN-1039.2.patch, YARN-1039.3.patch


 A container request could support a new parameter long-lived. This could be 
 used by a scheduler that would know not to host the service on a transient 
 (cloud: spot priced) node.
 Schedulers could also decide whether or not to allocate multiple long-lived 
 containers on the same node



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3638) Yarn Scheduler show percentage of total cluster that a queue is using

2015-05-13 Thread Hari Sekhon (JIRA)
Hari Sekhon created YARN-3638:
-

 Summary: Yarn Scheduler show percentage of total cluster that a 
queue is using
 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor


Request to show % of total cluster resources each queue is consuming on the 
Yarn Resource Manager Scheduler page.

Currently the Yarn Resource Manager Scheduler page shows the % of total used 
for root queue and the % of each given queue's configured capacity that is used 
(often showing say 150% if the max capacity is greater than configured capacity 
to allow bursting where there are free resources). This is fine, but it would 
be good to additionally show the % of total cluster that each given queue is 
consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure

2015-05-13 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541705#comment-14541705
 ] 

Varun Vasudev commented on YARN-3591:
-

[~zxu], [~lavkesh] - instead of checking listing the directory contents every 
time, can we use the signalling mechanism that [~zxu] added in YARN-3491? When 
a local dir goes bad, the trackers listener gets called and it remove all the 
localized resources from the data structure. That way we are re-using the 
existing checks to make sure that a directory is good.

 Resource Localisation on a bad disk causes subsequent containers failure 
 -

 Key: YARN-3591
 URL: https://issues.apache.org/jira/browse/YARN-3591
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Lavkesh Lahngir
 Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, 
 YARN-3591.2.patch


 It happens when a resource is localised on the disk, after localising that 
 disk has gone bad. NM keeps paths for localised resources in memory.  At the 
 time of resource request isResourcePresent(rsrc) will be called which calls 
 file.exists() on the localised path.
 In some cases when disk has gone bad, inodes are stilled cached and 
 file.exists() returns true. But at the time of reading, file will not open.
 Note: file.exists() actually calls stat64 natively which returns true because 
 it was able to find inode information from the OS.
 A proposal is to call file.list() on the parent path of the resource, which 
 will call open() natively. If the disk is good it should return an array of 
 paths with length at-least 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-3640:
-
Attachment: nm_143.out
nm_141.out

Attaching the thread of 2 NM's. 

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: nm_141.out, nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using

2015-05-13 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated YARN-3638:
--
Summary: Yarn Resource Manager Scheduler page - show percentage of total 
cluster that each queue is using  (was: Yarn Scheduler show percentage of total 
cluster that a queue is using)

 Yarn Resource Manager Scheduler page - show percentage of total cluster that 
 each queue is using
 

 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor

 Request to show % of total cluster resources each queue is consuming on the 
 Yarn Resource Manager Scheduler page.
 Currently the Yarn Resource Manager Scheduler page shows the % of total used 
 for root queue and the % of each given queue's configured capacity that is 
 used (often showing say 150% if the max capacity is greater than configured 
 capacity to allow bursting where there are free resources). This is fine, but 
 it would be good to additionally show the % of total cluster that each given 
 queue is consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using

2015-05-13 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated YARN-3638:
--
Description: 
Request to show % of total cluster resources each queue is currently consuming 
for jobs on the Yarn Resource Manager Scheduler page.

Currently the Yarn Resource Manager Scheduler page shows the % of total used 
for root queue and the % of each given queue's configured capacity that is used 
(often showing say 150% if the max capacity is greater than configured capacity 
to allow bursting where there are free resources). This is fine, but it would 
be good to additionally show the % of total cluster that each given queue is 
consuming and not just the % of that queue's configured capacity.

  was:
Request to show % of total cluster resources each queue is consuming on the 
Yarn Resource Manager Scheduler page.

Currently the Yarn Resource Manager Scheduler page shows the % of total used 
for root queue and the % of each given queue's configured capacity that is used 
(often showing say 150% if the max capacity is greater than configured capacity 
to allow bursting where there are free resources). This is fine, but it would 
be good to additionally show the % of total cluster that each given queue is 
consuming and not just the % of that queue's configured capacity.


 Yarn Resource Manager Scheduler page - show percentage of total cluster that 
 each queue is using
 

 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor

 Request to show % of total cluster resources each queue is currently 
 consuming for jobs on the Yarn Resource Manager Scheduler page.
 Currently the Yarn Resource Manager Scheduler page shows the % of total used 
 for root queue and the % of each given queue's configured capacity that is 
 used (often showing say 150% if the max capacity is greater than configured 
 capacity to allow bursting where there are free resources). This is fine, but 
 it would be good to additionally show the % of total cluster that each given 
 queue is consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541669#comment-14541669
 ] 

Junping Du commented on YARN-41:


bq. In the case of work-preserving NM restart (or under supervision as 
YARN-2331 calls it), we can make the NM not do an unregister?
I think latest patch (-4.patch) already did this, but my concern is a little 
broader: does user (or management tools for YARN cluster, like: Ambari) expect 
the same behavior for kill -9 on NM daemon and shutdown for NM daemon? With 
current patch (assume NM work preserving is disabled), user will find RM don't 
have this NM info anymore if shutdown NM daemon while the kill -9 on NM daemon 
has the old behavior (RM still has NM info with running state and switch to 
LOST after timeout). Previously, the behavior of these two operations is the 
same. However, I don't think we care too much about consistency behavior for 
these two operations, but would like to call it out loudly to make sure we 
don't miss anything important. 

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread nijel (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541697#comment-14541697
 ] 

nijel commented on YARN-3639:
-

hi [~xinxianyin]
Thanks for reporting this issue.
Can you attach the logs of this issue ? 

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin

 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2921) MockRM#waitForState methods can be too slow and flaky

2015-05-13 Thread Tsuyoshi Ozawa (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541734#comment-14541734
 ] 

Tsuyoshi Ozawa commented on YARN-2921:
--

[~leftnoteasy] all tests are green. Could you check v8 patch and my comments?
https://issues.apache.org/jira/browse/YARN-2921?focusedCommentId=14539843page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14539843

 MockRM#waitForState methods can be too slow and flaky
 -

 Key: YARN-2921
 URL: https://issues.apache.org/jira/browse/YARN-2921
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0, 2.7.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
 YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, 
 YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, 
 YARN-2921.008.patch


 MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
 second). This leads to slow tests and sometimes failures if the 
 App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Peng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541733#comment-14541733
 ] 

Peng Zhang commented on YARN-3640:
--

I'v encounter the same problem, and filed YARN-3585.

I think it's related with leveldb thread. I also see it in your thread out.

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, 
 nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Rohith (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith resolved YARN-3640.
--
Resolution: Duplicate

Closing as duplicate.

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, 
 nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3617) Fix unused variable to get CPU frequency on Windows systems

2015-05-13 Thread J.Andreina (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J.Andreina updated YARN-3617:
-
Attachment: YARN-3617.1.patch

Attached an initial patch.
Please review.

 Fix unused variable to get CPU frequency on Windows systems
 ---

 Key: YARN-3617
 URL: https://issues.apache.org/jira/browse/YARN-3617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.0
 Environment: Windows 7 x64 SP1
Reporter: Georg Berendt
Assignee: J.Andreina
Priority: Minor
 Attachments: YARN-3617.1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, 
 there is an unused variable for CPU frequency.
  /** {@inheritDoc} */
   @Override
   public long getCpuFrequency() {
 refreshIfNeeded();
 return -1;   
   }
 Please change '-1' to use 'cpuFrequencyKhz'.
 org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541899#comment-14541899
 ] 

Hadoop QA commented on YARN-3170:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   2m 51s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 56s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 11s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732566/YARN-3170-006.patch |
| Optional Tests | site |
| git revision | trunk / 065d8f2 |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7918/console |


This message was automatically generated.

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
  Labels: BB2015-05-TBR
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3641:
-
  Component/s: rolling upgrade
   nodemanager
Affects Version/s: 2.6.0
  Summary: NodeManager: stopRecoveryStore() shouldn't be skipped 
when exceptions happen in stopping NM's sub-services.  (was: 
stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping 
NM's sub-services.)

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical

 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a final block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-05-13 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541846#comment-14541846
 ] 

Brahma Reddy Battula commented on YARN-3170:


Thanks [~ozawa] for reminding .. After allen comments, Yes I missed this... 
Kindly review the latest patch..

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
  Labels: BB2015-05-TBR
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3170) YARN architecture document needs updating

2015-05-13 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541845#comment-14541845
 ] 

Brahma Reddy Battula commented on YARN-3170:


Thanks [~ozawa] for reminding .. After allen comments, Yes I missed this... 
Kindly review the latest patch..

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
  Labels: BB2015-05-TBR
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541881#comment-14541881
 ] 

Jason Lowe commented on YARN-41:


In light of NM restart, one of the problems with having the NM check for active 
applications and then take different actions is that the NM has a significantly 
delayed view of the cluster relative to the RM.  The RM could have decided to 
assign new containers (and thus new applications) to the node, but the NM 
hasn't seen the launch request from the AM yet.  This has already caused other 
issues, see the early discussions in YARN-3535 where containers were killed 
because the node reconnected with no active applications reported and was 
handled as a node removed/node added sequence.

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541892#comment-14541892
 ] 

Hadoop QA commented on YARN-3489:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732569/YARN-3489-branch-2.7.02.patch
 |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 065d8f2 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7919/console |


This message was automatically generated.

 RMServerUtils.validateResourceRequests should only obtain queue info once
 -

 Key: YARN-3489
 URL: https://issues.apache.org/jira/browse/YARN-3489
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
  Labels: BB2015-05-RFC
 Attachments: YARN-3489-branch-2.7.02.patch, 
 YARN-3489-branch-2.7.patch, YARN-3489.01.patch, YARN-3489.02.patch, 
 YARN-3489.03.patch


 Since the label support was added we now get the queue info for each request 
 being validated in SchedulerUtils.validateResourceRequest.  If 
 validateResourceRequests needs to validate a lot of requests at a time (e.g.: 
 large cluster with lots of varied locality in the requests) then it will get 
 the queue info for each request.  Since we build the queue info this 
 generates a lot of unnecessary garbage, as the queue isn't changing between 
 requests.  We should grab the queue info once and pass it down rather than 
 building it again for each request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

2015-05-13 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541861#comment-14541861
 ] 

Varun Vasudev commented on YARN-160:


The test failure is unrelated to the patch.

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
  Labels: BB2015-05-TBR
 Attachments: YARN-160.005.patch, YARN-160.006.patch, 
 YARN-160.007.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, 
 apache-yarn-160.2.patch, apache-yarn-160.3.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3641:
-
Description: 
If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NodeManager at 
c6403.ambari.apache.org/192.168.64.103
/
{noformat}

The related code is as below in NodeManager.java:
{code}
  @Override
  protected void serviceStop() throws Exception {
if (isStopping.getAndSet(true)) {
  return;
}
super.serviceStop();
stopRecoveryStore();
DefaultMetricsSystem.shutdown();
  }
{code}
We can see we stop all NM registered services (NodeStatusUpdater, 
LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
services get stopped with exception could cause stopRecoveryStore() get skipped 
which means levelDB store is not get closed. So next time NM start, it will get 
failed with exception above. 
We should put stopRecoveryStore(); in a finally block.

  was:
If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) 

[jira] [Updated] (YARN-3579) getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String

2015-05-13 Thread Sunil G (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil G updated YARN-3579:
--
Attachment: 0004-YARN-3579.patch

Updating patch after a compilation problem.

 getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead 
 of label name as String
 

 Key: YARN-3579
 URL: https://issues.apache.org/jira/browse/YARN-3579
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Minor
 Attachments: 0001-YARN-3579.patch, 0002-YARN-3579.patch, 
 0003-YARN-3579.patch, 0004-YARN-3579.patch


 CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is 
 not passing information such as Exclusivity etc back to REST interface apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3641) stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)
Junping Du created YARN-3641:


 Summary: stopRecoveryStore() shouldn't be skipped when exceptions 
happen in stopping NM's sub-services.
 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical


If NM' services not get stopped properly, we cannot start NM with enabling NM 
restart with work preserving. The exception is as following:
{noformat}
org.apache.hadoop.service.ServiceStateException: 
org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
/var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
temporarily unavailable
at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
Resource temporarily unavailable
at 
org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
... 5 more
2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
(LogAdapter.java:info(45)) - SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down NodeManager at 
c6403.ambari.apache.org/192.168.64.103
/
{noformat}

The related code is as below in NodeManager.java:
{code}
  @Override
  protected void serviceStop() throws Exception {
if (isStopping.getAndSet(true)) {
  return;
}
super.serviceStop();
stopRecoveryStore();
DefaultMetricsSystem.shutdown();
  }
{code}
We can see we stop all NM registered services (NodeStatusUpdater, 
LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
services get stopped with exception could cause stopRecoveryStore() get skipped 
which means levelDB store is not get closed. So next time NM start, it will get 
failed with exception above. 
We should put stopRecoveryStore(); in a final block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated YARN-3641:
-
Attachment: YARN-3641.patch

Upload a quick patch to fix it. The issue is obviously and the solution is 
simple enough, not need unit test.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3627) Preemption not triggered in Fair scheduler when maxResources is set on parent queue

2015-05-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541782#comment-14541782
 ] 

Bibin A Chundatt commented on YARN-3627:


Hi [~kasha] Thank you for looking into the same . But YARN-3405 also doesn't 
change *shouldAttemptPreemption()*  So the primary check of threshold will 
happen . And subqueue Q1-1 is not preempted since its below threshold  and Q1-2 
will starve for resource.

 Preemption not triggered in Fair scheduler when maxResources is set on parent 
 queue
 ---

 Key: YARN-3627
 URL: https://issues.apache.org/jira/browse/YARN-3627
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler, scheduler
 Environment: Suse 11 SP3, 2 NM 
Reporter: Bibin A Chundatt

 Consider the below scenario of fair configuration 
  
 Root (10Gb cluster resource)
 --Q1 (maxResources  4gb) 
 Q1.1 (maxResources 4gb) 
 Q1.2  (maxResources  4gb) 
 --Q2 (maxResources 6GB)
  
 No applications are running in Q2
  
 Submit one application with to Q1.1 with 50 maps   4Gb gets allocated to Q1.1
 Now submit application to  Q1.2 the same will be starving for memory always.
  
 Preemption will never get triggered since 
 yarn.scheduler.fair.preemption.cluster-utilization-threshold =.8 and the 
 cluster utilization is below .8.
  
 *Fairscheduler.java*
 {code}
   private boolean shouldAttemptPreemption() {
 if (preemptionEnabled) {
   return (preemptionUtilizationThreshold  Math.max(
   (float) rootMetrics.getAllocatedMB() / clusterResource.getMemory(),
   (float) rootMetrics.getAllocatedVirtualCores() /
   clusterResource.getVirtualCores()));
 }
 return false;
   }
 {code}
 Are we supposed to configure in running cluster maxResources  0mb and 0 
 cores  so that all queues can take full cluster resources always if 
 available??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541772#comment-14541772
 ] 

Hudson commented on YARN-3613:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/])
YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in 
setup instead of individual methods. (nijel via kasha) (kasha: rev 
fe0df596271340788095cb43a1944e19ac4c2cf7)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java


 TestContainerManagerSecurity should init and start Yarn cluster in setup 
 instead of individual methods
 --

 Key: YARN-3613
 URL: https://issues.apache.org/jira/browse/YARN-3613
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: nijel
Priority: Minor
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3613-1.patch, yarn-3613-2.patch


 In TestContainerManagerSecurity, individual tests init and start Yarn 
 cluster. This duplication can be avoided by moving that to setup. 
 Further, one could merge the two @Test methods to avoid bringing up another 
 mini-cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541773#comment-14541773
 ] 

Hudson commented on YARN-3539:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/])
YARN-3539. Updated timeline server documentation and marked REST APIs evolving. 
Contributed by Steve Loughran. (zjshen: rev 
fcd0702c10ce574b887280476aba63d6682d5271)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java
* hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java


 Compatibility doc to state that ATS v1 is a stable REST API
 ---

 Key: YARN-3539
 URL: https://issues.apache.org/jira/browse/YARN-3539
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Fix For: 2.7.1

 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, 
 TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, 
 YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, 
 YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, 
 YARN-3539.11.patch, timeline_get_api_examples.txt


 The ATS v2 discussion and YARN-2423 have raised the question: how stable are 
 the ATSv1 APIs?
 The existing compatibility document actually states that the History Server 
 is [a stable REST 
 API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs],
  which effectively means that ATSv1 has already been declared as a stable API.
 Clarify this by patching the compatibility document appropriately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps

2015-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541785#comment-14541785
 ] 

Junping Du commented on YARN-3505:
--

Latest patch LGTM. 
[~jianhe], about your previous comments, 
LogAggregationReport#(get/set)getNodeId already been removed, and 
LogAggregationReport#(get/set)DiagnosticMessage is necessary (see: 
AppLogAggregatorImpl.java). Any further comments from you?


 Node's Log Aggregation Report with SUCCEED should not cached in RMApps
 --

 Key: YARN-3505
 URL: https://issues.apache.org/jira/browse/YARN-3505
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Affects Versions: 2.8.0
Reporter: Junping Du
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-3505.1.patch, YARN-3505.2.patch, 
 YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, 
 YARN-3505.5.patch


 Per discussions in YARN-1402, we shouldn't cache all node's log aggregation 
 reports in RMApps for always, especially for those finished with SUCCEED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541794#comment-14541794
 ] 

Hudson commented on YARN-3613:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/926/])
YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in 
setup instead of individual methods. (nijel via kasha) (kasha: rev 
fe0df596271340788095cb43a1944e19ac4c2cf7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java
* hadoop-yarn-project/CHANGES.txt


 TestContainerManagerSecurity should init and start Yarn cluster in setup 
 instead of individual methods
 --

 Key: YARN-3613
 URL: https://issues.apache.org/jira/browse/YARN-3613
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: nijel
Priority: Minor
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3613-1.patch, yarn-3613-2.patch


 In TestContainerManagerSecurity, individual tests init and start Yarn 
 cluster. This duplication can be avoided by moving that to setup. 
 Further, one could merge the two @Test methods to avoid bringing up another 
 mini-cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3513) Remove unused variables in ContainersMonitorImpl and add debug log for overall resource usage by all containers

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541799#comment-14541799
 ] 

Hudson commented on YARN-3513:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/926/])
YARN-3513. Remove unused variables in ContainersMonitorImpl and add debug 
(devaraj: rev 8badd82ce256e4dc8c234961120d62a88358ab39)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java
* hadoop-yarn-project/CHANGES.txt


 Remove unused variables in ContainersMonitorImpl and add debug log for 
 overall resource usage by all containers 
 

 Key: YARN-3513
 URL: https://issues.apache.org/jira/browse/YARN-3513
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Trivial
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, 
 YARN-3513.20150506-1.patch, YARN-3513.20150507-1.patch, 
 YARN-3513.20150508-1.patch, YARN-3513.20150508-1.patch, 
 YARN-3513.20150511-1.patch


 Some local variables in MonitoringThread.run()  : {{vmemStillInUsage and 
 pmemStillInUsage}} are not used and just updated. 
 Instead we need to add debug log for overall resource usage by all containers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541795#comment-14541795
 ] 

Hudson commented on YARN-3539:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/926/])
YARN-3539. Updated timeline server documentation and marked REST APIs evolving. 
Contributed by Steve Loughran. (zjshen: rev 
fcd0702c10ce574b887280476aba63d6682d5271)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java
* hadoop-project/src/site/site.xml
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java
* hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java


 Compatibility doc to state that ATS v1 is a stable REST API
 ---

 Key: YARN-3539
 URL: https://issues.apache.org/jira/browse/YARN-3539
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Fix For: 2.7.1

 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, 
 TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, 
 YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, 
 YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, 
 YARN-3539.11.patch, timeline_get_api_examples.txt


 The ATS v2 discussion and YARN-2423 have raised the question: how stable are 
 the ATSv1 APIs?
 The existing compatibility document actually states that the History Server 
 is [a stable REST 
 API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs],
  which effectively means that ATSv1 has already been declared as a stable API.
 Clarify this by patching the compatibility document appropriately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541802#comment-14541802
 ] 

Hudson commented on YARN-3629:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/926/])
YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 
5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java


 NodeID is always printed as null in node manager initialization log.
 --

 Key: YARN-3629
 URL: https://issues.apache.org/jira/browse/YARN-3629
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Fix For: 2.8.0

 Attachments: YARN-3629-1.patch


 In Node manager log during startup the following logs is printed
 2015-05-12 11:20:02,347 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized 
 nodemanager for *null* : physical-memory=4096 virtual-memory=8602 
 virtual-cores=8
 This line is printed from NodeStatusUpdaterImpl.serviceInit.
 But the nodeid assignment is happening only in 
 NodeStatusUpdaterImpl.serviceStart
 {code}
   protected void serviceStart() throws Exception {
 // NodeManager is the last service to start, so NodeId is available.
 this.nodeId = this.context.getNodeId();
 {code}
 Assigning the node id in serviceinit is not feasible since it is generated by 
  ContainerManagerImpl.serviceStart.
 The log can be moved to service start to give right information to user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541848#comment-14541848
 ] 

Hadoop QA commented on YARN-160:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 57s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 5 new or modified test files. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  0s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 40s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m 29s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 33s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | tools/hadoop tests |  15m  2s | Tests passed in 
hadoop-gridmix. |
| {color:green}+1{color} | yarn tests |   0m 23s | Tests passed in 
hadoop-yarn-api. |
| {color:red}-1{color} | yarn tests |   1m 57s | Tests failed in 
hadoop-yarn-common. |
| {color:red}-1{color} | yarn tests |   0m 17s | Tests failed in 
hadoop-yarn-server-nodemanager. |
| | |  60m 30s | |
\\
\\
|| Reason || Tests ||
| Failed unit tests | hadoop.yarn.nodelabels.TestFileSystemNodeLabelsStore |
| Failed build | hadoop-yarn-server-nodemanager |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732537/YARN-160.007.patch |
| Optional Tests | javac unit findbugs checkstyle javadoc |
| git revision | trunk / 92c38e4 |
| hadoop-gridmix test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-gridmix.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7916/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7916/console |


This message was automatically generated.

 nodemanagers should obtain cpu/memory values from underlying OS
 ---

 Key: YARN-160
 URL: https://issues.apache.org/jira/browse/YARN-160
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Varun Vasudev
  Labels: BB2015-05-TBR
 Attachments: YARN-160.005.patch, YARN-160.006.patch, 
 YARN-160.007.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, 
 apache-yarn-160.2.patch, apache-yarn-160.3.patch


 As mentioned in YARN-2
 *NM memory and CPU configs*
 Currently these values are coming from the config of the NM, we should be 
 able to obtain those values from the OS (ie, in the case of Linux from 
 /proc/meminfo  /proc/cpuinfo). As this is highly OS dependent we should have 
 an interface that obtains this information. In addition implementations of 
 this interface should be able to specify a mem/cpu offset (amount of mem/cpu 
 not to be avail as YARN resource), this would allow to reserve mem/cpu for 
 the OS and other services outside of YARN containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541780#comment-14541780
 ] 

Hudson commented on YARN-3629:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/])
YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 
5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java


 NodeID is always printed as null in node manager initialization log.
 --

 Key: YARN-3629
 URL: https://issues.apache.org/jira/browse/YARN-3629
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Fix For: 2.8.0

 Attachments: YARN-3629-1.patch


 In Node manager log during startup the following logs is printed
 2015-05-12 11:20:02,347 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized 
 nodemanager for *null* : physical-memory=4096 virtual-memory=8602 
 virtual-cores=8
 This line is printed from NodeStatusUpdaterImpl.serviceInit.
 But the nodeid assignment is happening only in 
 NodeStatusUpdaterImpl.serviceStart
 {code}
   protected void serviceStart() throws Exception {
 // NodeManager is the last service to start, so NodeId is available.
 this.nodeId = this.context.getNodeId();
 {code}
 Assigning the node id in serviceinit is not feasible since it is generated by 
  ContainerManagerImpl.serviceStart.
 The log can be moved to service start to give right information to user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3513) Remove unused variables in ContainersMonitorImpl and add debug log for overall resource usage by all containers

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541777#comment-14541777
 ] 

Hudson commented on YARN-3513:
--

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/])
YARN-3513. Remove unused variables in ContainersMonitorImpl and add debug 
(devaraj: rev 8badd82ce256e4dc8c234961120d62a88358ab39)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java


 Remove unused variables in ContainersMonitorImpl and add debug log for 
 overall resource usage by all containers 
 

 Key: YARN-3513
 URL: https://issues.apache.org/jira/browse/YARN-3513
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Naganarasimha G R
Assignee: Naganarasimha G R
Priority: Trivial
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, 
 YARN-3513.20150506-1.patch, YARN-3513.20150507-1.patch, 
 YARN-3513.20150508-1.patch, YARN-3513.20150508-1.patch, 
 YARN-3513.20150511-1.patch


 Some local variables in MonitoringThread.run()  : {{vmemStillInUsage and 
 pmemStillInUsage}} are not used and just updated. 
 Instead we need to add debug log for overall resource usage by all containers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.

2015-05-13 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541790#comment-14541790
 ] 

Rohith commented on YARN-3640:
--

I am able to reproduce this always..!!!

 NodeManager JVM continues to run after SHUTDOWN event is triggered.
 ---

 Key: YARN-3640
 URL: https://issues.apache.org/jira/browse/YARN-3640
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.0
Reporter: Rohith
 Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, 
 nm_143.out


 We faced strange issue in the cluster that NodeManager did not exitted when 
 the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread 
 dump and verified it, but did not get much idea why NM jvm not exited. 
 At a time, for 3 NodeManger got this problem, and all the 3 NodeManager 
 thread dump looks similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled

2015-05-13 Thread Peng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541812#comment-14541812
 ] 

Peng Zhang commented on YARN-3585:
--

As YARN-3640, Rohith has encountered the same problem. And we all see leveldb 
thread in thread stack. 
I think it's probably related with NM recovery. Decommission is not the key 
matter.

[~devaraj.k] Do you enable NM recovery in your env?

 Nodemanager cannot exit when decommission with NM recovery enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3170) YARN architecture document needs updating

2015-05-13 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3170:
---
Attachment: YARN-3170-006.patch

 YARN architecture document needs updating
 -

 Key: YARN-3170
 URL: https://issues.apache.org/jira/browse/YARN-3170
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Allen Wittenauer
Assignee: Brahma Reddy Battula
  Labels: BB2015-05-TBR
 Attachments: YARN-3170-002.patch, YARN-3170-003.patch, 
 YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch


 The marketing paragraph at the top, NextGen MapReduce, etc are all 
 marketing rather than actual descriptions. It also needs some general 
 updates, esp given it reads as though 0.23 was just released yesterday.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3617) Fix unused variable to get CPU frequency on Windows systems

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541854#comment-14541854
 ] 

Hadoop QA commented on YARN-3617:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m  8s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 41s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |  10m  1s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 52s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 24s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   1m 58s | Tests passed in 
hadoop-yarn-common. |
| | |  39m 35s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732542/YARN-3617.1.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 065d8f2 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7917/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7917/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7917/console |


This message was automatically generated.

 Fix unused variable to get CPU frequency on Windows systems
 ---

 Key: YARN-3617
 URL: https://issues.apache.org/jira/browse/YARN-3617
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.0
 Environment: Windows 7 x64 SP1
Reporter: Georg Berendt
Assignee: J.Andreina
Priority: Minor
 Attachments: YARN-3617.1.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, 
 there is an unused variable for CPU frequency.
  /** {@inheritDoc} */
   @Override
   public long getCpuFrequency() {
 refreshIfNeeded();
 return -1;   
   }
 Please change '-1' to use 'cpuFrequencyKhz'.
 org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once

2015-05-13 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena updated YARN-3489:
---
Attachment: YARN-3489-branch-2.7.02.patch

 RMServerUtils.validateResourceRequests should only obtain queue info once
 -

 Key: YARN-3489
 URL: https://issues.apache.org/jira/browse/YARN-3489
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jason Lowe
Assignee: Varun Saxena
  Labels: BB2015-05-RFC
 Attachments: YARN-3489-branch-2.7.02.patch, 
 YARN-3489-branch-2.7.patch, YARN-3489.01.patch, YARN-3489.02.patch, 
 YARN-3489.03.patch


 Since the label support was added we now get the queue info for each request 
 being validated in SchedulerUtils.validateResourceRequest.  If 
 validateResourceRequests needs to validate a lot of requests at a time (e.g.: 
 large cluster with lots of varied locality in the requests) then it will get 
 the queue info for each request.  Since we build the queue info this 
 generates a lot of unnecessary garbage, as the queue isn't changing between 
 requests.  We should grab the queue info once and pass it down rather than 
 building it again for each request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541885#comment-14541885
 ] 

Jason Lowe commented on YARN-3638:
--

Isn't this the Absolute Used Capacity metric that is shown for each leaf 
queue?

 Yarn Resource Manager Scheduler page - show percentage of total cluster that 
 each queue is using
 

 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler, yarn
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor

 Request to show % of total cluster resources each queue is currently 
 consuming for jobs on the Yarn Resource Manager Scheduler page.
 Currently the Yarn Resource Manager Scheduler page shows the % of total used 
 for root queue and the % of each given queue's configured capacity that is 
 used (often showing say 150% if the max capacity is greater than configured 
 capacity to allow bursting where there are free resources). This is fine, but 
 it would be good to additionally show the % of total cluster that each given 
 queue is consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using

2015-05-13 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541950#comment-14541950
 ] 

Hari Sekhon commented on YARN-3638:
---

[~jlowe] yes I believe it is % of absolute capacity that is shown, which is 
useful to seeing how much you're bursting over.

It would be nice if RM would also show the % of the total cluster's capacity 
that the leaf queue was consuming. You could also extend this idea show the % 
of total cluster capacity that each job is consuming too.

 Yarn Resource Manager Scheduler page - show percentage of total cluster that 
 each queue is using
 

 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler, yarn
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor

 Request to show % of total cluster resources each queue is currently 
 consuming for jobs on the Yarn Resource Manager Scheduler page.
 Currently the Yarn Resource Manager Scheduler page shows the % of total used 
 for root queue and the % of each given queue's configured capacity that is 
 used (often showing say 150% if the max capacity is greater than configured 
 capacity to allow bursting where there are free resources). This is fine, but 
 it would be good to additionally show the % of total cluster that each given 
 queue is consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541980#comment-14541980
 ] 

Hudson commented on YARN-3629:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2124 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2124/])
YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 
5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* hadoop-yarn-project/CHANGES.txt


 NodeID is always printed as null in node manager initialization log.
 --

 Key: YARN-3629
 URL: https://issues.apache.org/jira/browse/YARN-3629
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Fix For: 2.8.0

 Attachments: YARN-3629-1.patch


 In Node manager log during startup the following logs is printed
 2015-05-12 11:20:02,347 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized 
 nodemanager for *null* : physical-memory=4096 virtual-memory=8602 
 virtual-cores=8
 This line is printed from NodeStatusUpdaterImpl.serviceInit.
 But the nodeid assignment is happening only in 
 NodeStatusUpdaterImpl.serviceStart
 {code}
   protected void serviceStart() throws Exception {
 // NodeManager is the last service to start, so NodeId is available.
 this.nodeId = this.context.getNodeId();
 {code}
 Assigning the node id in serviceinit is not feasible since it is generated by 
  ContainerManagerImpl.serviceStart.
 The log can be moved to service start to give right information to user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541973#comment-14541973
 ] 

Hudson commented on YARN-3539:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2124 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2124/])
YARN-3539. Updated timeline server documentation and marked REST APIs evolving. 
Contributed by Steve Loughran. (zjshen: rev 
fcd0702c10ce574b887280476aba63d6682d5271)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java
* hadoop-yarn-project/CHANGES.txt
* hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java


 Compatibility doc to state that ATS v1 is a stable REST API
 ---

 Key: YARN-3539
 URL: https://issues.apache.org/jira/browse/YARN-3539
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Fix For: 2.7.1

 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, 
 TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, 
 YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, 
 YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, 
 YARN-3539.11.patch, timeline_get_api_examples.txt


 The ATS v2 discussion and YARN-2423 have raised the question: how stable are 
 the ATSv1 APIs?
 The existing compatibility document actually states that the History Server 
 is [a stable REST 
 API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs],
  which effectively means that ATSv1 has already been declared as a stable API.
 Clarify this by patching the compatibility document appropriately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541999#comment-14541999
 ] 

Hudson commented on YARN-3629:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #184 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/184/])
YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 
5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* hadoop-yarn-project/CHANGES.txt


 NodeID is always printed as null in node manager initialization log.
 --

 Key: YARN-3629
 URL: https://issues.apache.org/jira/browse/YARN-3629
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: nijel
Assignee: nijel
 Fix For: 2.8.0

 Attachments: YARN-3629-1.patch


 In Node manager log during startup the following logs is printed
 2015-05-12 11:20:02,347 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized 
 nodemanager for *null* : physical-memory=4096 virtual-memory=8602 
 virtual-cores=8
 This line is printed from NodeStatusUpdaterImpl.serviceInit.
 But the nodeid assignment is happening only in 
 NodeStatusUpdaterImpl.serviceStart
 {code}
   protected void serviceStart() throws Exception {
 // NodeManager is the last service to start, so NodeId is available.
 this.nodeId = this.context.getNodeId();
 {code}
 Assigning the node id in serviceinit is not feasible since it is generated by 
  ContainerManagerImpl.serviceStart.
 The log can be moved to service start to give right information to user. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541993#comment-14541993
 ] 

Hudson commented on YARN-3539:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #184 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/184/])
YARN-3539. Updated timeline server documentation and marked REST APIs evolving. 
Contributed by Steve Loughran. (zjshen: rev 
fcd0702c10ce574b887280476aba63d6682d5271)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java
* hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java
* hadoop-project/src/site/site.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java


 Compatibility doc to state that ATS v1 is a stable REST API
 ---

 Key: YARN-3539
 URL: https://issues.apache.org/jira/browse/YARN-3539
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.7.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Fix For: 2.7.1

 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, 
 TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, 
 YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, 
 YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, 
 YARN-3539.11.patch, timeline_get_api_examples.txt


 The ATS v2 discussion and YARN-2423 have raised the question: how stable are 
 the ATSv1 APIs?
 The existing compatibility document actually states that the History Server 
 is [a stable REST 
 API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs],
  which effectively means that ATSv1 has already been declared as a stable API.
 Clarify this by patching the compatibility document appropriately



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541969#comment-14541969
 ] 

Jason Lowe commented on YARN-3638:
--

bq.  I believe it is % of absolute capacity that is shown

No, the Absolute Used Capacity field is the amount of total cluster capacity 
being used by this queue.  From the 2.6 code in 
CSQueueUtils.updateQueueStatistics:
{code}
  absoluteUsedCapacity = 
  Resources.divide(calculator, clusterResource, 
  usedResources, clusterResource);
{code}


 Yarn Resource Manager Scheduler page - show percentage of total cluster that 
 each queue is using
 

 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler, yarn
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor

 Request to show % of total cluster resources each queue is currently 
 consuming for jobs on the Yarn Resource Manager Scheduler page.
 Currently the Yarn Resource Manager Scheduler page shows the % of total used 
 for root queue and the % of each given queue's configured capacity that is 
 used (often showing say 150% if the max capacity is greater than configured 
 capacity to allow bursting where there are free resources). This is fine, but 
 it would be good to additionally show the % of total cluster that each given 
 queue is consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541972#comment-14541972
 ] 

Hudson commented on YARN-3613:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #2124 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/2124/])
YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in 
setup instead of individual methods. (nijel via kasha) (kasha: rev 
fe0df596271340788095cb43a1944e19ac4c2cf7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java
* hadoop-yarn-project/CHANGES.txt


 TestContainerManagerSecurity should init and start Yarn cluster in setup 
 instead of individual methods
 --

 Key: YARN-3613
 URL: https://issues.apache.org/jira/browse/YARN-3613
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: nijel
Priority: Minor
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3613-1.patch, yarn-3613-2.patch


 In TestContainerManagerSecurity, individual tests init and start Yarn 
 cluster. This duplication can be avoided by moving that to setup. 
 Further, one could merge the two @Test methods to avoid bringing up another 
 mini-cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3579) getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541978#comment-14541978
 ] 

Hadoop QA commented on YARN-3579:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 16s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 2 new or modified test files. |
| {color:green}+1{color} | javac |   7m 43s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 52s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 25s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 52s | The applied patch generated  1 
new checkstyle issues (total was 34, now 34). |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 15  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 24s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   1m 58s | Tests passed in 
hadoop-yarn-common. |
| | |  39m 40s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732573/0004-YARN-3579.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 065d8f2 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7920/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7920/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7920/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7920/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7920/console |


This message was automatically generated.

 getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead 
 of label name as String
 

 Key: YARN-3579
 URL: https://issues.apache.org/jira/browse/YARN-3579
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Minor
 Attachments: 0001-YARN-3579.patch, 0002-YARN-3579.patch, 
 0003-YARN-3579.patch, 0004-YARN-3579.patch


 CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is 
 not passing information such as Exclusivity etc back to REST interface apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542000#comment-14542000
 ] 

Jason Lowe commented on YARN-3641:
--

I think the patch approach is OK, but I'm not sure I agree with the problem 
analysis.  We kill -9 the NM during rolling upgrades, which obviously will not 
cleanly shutdown the state store, yet we don't have the IO error lock problem.  
The issue is that the old NM process must still be running, which is why 
leveldb refuses to open the still-in-use database.  In that sense this JIRA 
appears to be a duplicate of the same problems described in YARN-3585 and 
YARN-3640.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542028#comment-14542028
 ] 

Junping Du commented on YARN-3641:
--

bq. We kill -9 the NM during rolling upgrades, which obviously will not cleanly 
shutdown the state store, yet we don't have the IO error lock problem.
Yes. I also suspect that if old NM is still running. The bad news is our 
original environment is gone, may need sometime to reproduce this to see if the 
same problem of YARN-3585.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-05-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542045#comment-14542045
 ] 

Junping Du commented on YARN-41:


Thanks [~jlowe] for sharing this prospective. I think YARN-3212 is facing the 
same situation as your last comment there. 
However, in this case, we may make things simpler if we don't care if running 
applications on work preserving enabled nodes. We just simply don't do 
unRegister NM from RM when work-preserving is enabled - this is not only 
simpler for implementation, but also simpler for user to understand - or it 
will make shutdown NM daemon's behavior sounds more randomly as sometimes it 
get disappeared from RM while sometimes not - this sounds the behavior is not 
controlled by configuration but controlled by randomly container allocation. 
Thoughts?

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3628) ContainerMetrics should support always-flush mode.

2015-05-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542036#comment-14542036
 ] 

Karthik Kambatla commented on YARN-3628:


bq. So the empty content is shown for the active container metrics until it is 
finished.
Where are we showing this? jmx or a specific metrics sink?  

I am not convinced we should support a period of 0 ms, let alone by default: 
each container will be constantly publishing its usage metrics. Assuming 
non-positive period implies flushOnExit seems like the right approach to me. 
Also, since this was released as part of 2.7, we should avoid incompatible 
changes. 

 ContainerMetrics should support always-flush mode.
 --

 Key: YARN-3628
 URL: https://issues.apache.org/jira/browse/YARN-3628
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-3628.000.patch


 ContainerMetrics should support always-flush mode.
 It will be good to set ContainerMetrics as always-flush mode if 
 yarn.nodemanager.container-metrics.period-ms is configured as 0.
 Currently both 0 and -1 mean flush on completion.
 Also the current default value for 
 yarn.nodemanager.container-metrics.period-ms is -1 and the default value  for 
 yarn.nodemanager.container-metrics.enable is true. So the empty content is 
 shown for the active container metrics until it is finished.
 The default value for yarn.nodemanager.container-metrics.period-ms should not 
 be -1.
 flushOnPeriod is always false if flushPeriodMs is -1,
 the content will only be shown when the container is finished.
 {code}
 if (finished || flushOnPeriod) {
   registry.snapshot(collector.addRecord(registry.info()), all);
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled

2015-05-13 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542078#comment-14542078
 ] 

Devaraj K commented on YARN-3585:
-

Thanks for reply. I have enabled NM recovery in my env.

 Nodemanager cannot exit when decommission with NM recovery enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541992#comment-14541992
 ] 

Hudson commented on YARN-3613:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #184 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/184/])
YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in 
setup instead of individual methods. (nijel via kasha) (kasha: rev 
fe0df596271340788095cb43a1944e19ac4c2cf7)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java
* hadoop-yarn-project/CHANGES.txt


 TestContainerManagerSecurity should init and start Yarn cluster in setup 
 instead of individual methods
 --

 Key: YARN-3613
 URL: https://issues.apache.org/jira/browse/YARN-3613
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: nijel
Priority: Minor
  Labels: newbie
 Fix For: 2.8.0

 Attachments: YARN-3613-1.patch, yarn-3613-2.patch


 In TestContainerManagerSecurity, individual tests init and start Yarn 
 cluster. This duplication can be avoided by moving that to setup. 
 Further, one could merge the two @Test methods to avoid bringing up another 
 mini-cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542052#comment-14542052
 ] 

Hadoop QA commented on YARN-3641:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 39s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 35s | There were no new checkstyle 
issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m  3s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   6m  0s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  42m  5s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732578/YARN-3641.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 065d8f2 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/whitespace.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7921/console |


This message was automatically generated.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
 

[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542051#comment-14542051
 ] 

Jason Lowe commented on YARN-41:


I think avoiding the unregister during shutdown _if_ the NM is under 
supervision (i.e.: we know it will be restarted momentarily) is fine.  I was 
only bringing up the point since you mentioned the latest patch already covered 
this, but that patch is checking for active applications to decide whether to 
unregister.

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3628) ContainerMetrics should support always-flush mode.

2015-05-13 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-3628:

Issue Type: Improvement  (was: Bug)

 ContainerMetrics should support always-flush mode.
 --

 Key: YARN-3628
 URL: https://issues.apache.org/jira/browse/YARN-3628
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
 Attachments: YARN-3628.000.patch


 ContainerMetrics should support always-flush mode.
 It will be good to set ContainerMetrics as always-flush mode if 
 yarn.nodemanager.container-metrics.period-ms is configured as 0.
 Currently both 0 and -1 mean flush on completion.
 Also the current default value for 
 yarn.nodemanager.container-metrics.period-ms is -1 and the default value  for 
 yarn.nodemanager.container-metrics.enable is true. So the empty content is 
 shown for the active container metrics until it is finished.
 The default value for yarn.nodemanager.container-metrics.period-ms should not 
 be -1.
 flushOnPeriod is always false if flushPeriodMs is -1,
 the content will only be shown when the container is finished.
 {code}
 if (finished || flushOnPeriod) {
   registry.snapshot(collector.addRecord(registry.info()), all);
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using

2015-05-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542563#comment-14542563
 ] 

Wangda Tan commented on YARN-3638:
--

I think this is useful. Maybe one possible way is add a switch in RM scheduler 
UI to change used-capacity and absolute-used-capacity showing in queue bars.

 Yarn Resource Manager Scheduler page - show percentage of total cluster that 
 each queue is using
 

 Key: YARN-3638
 URL: https://issues.apache.org/jira/browse/YARN-3638
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler, resourcemanager, scheduler, yarn
Affects Versions: 2.6.0
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor

 Request to show % of total cluster resources each queue is currently 
 consuming for jobs on the Yarn Resource Manager Scheduler page.
 Currently the Yarn Resource Manager Scheduler page shows the % of total used 
 for root queue and the % of each given queue's configured capacity that is 
 used (often showing say 150% if the max capacity is greater than configured 
 capacity to allow bursting where there are free resources). This is fine, but 
 it would be good to additionally show the % of total cluster that each given 
 queue is consuming and not just the % of that queue's configured capacity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542553#comment-14542553
 ] 

Hadoop QA commented on YARN-3626:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 44s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 32s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 46s | The applied patch generated  3 
new checkstyle issues (total was 211, now 214). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 23s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | mapreduce tests |   0m 46s | Tests passed in 
hadoop-mapreduce-client-common. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   5m 59s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  46m 57s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12732645/YARN-3626.6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / cdec12d |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-mapreduce-client-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/testrun_hadoop-mapreduce-client-common.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7924/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7924/console |


This message was automatically generated.

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java

2015-05-13 Thread Lee Hounshell (JIRA)
Lee Hounshell created YARN-3642:
---

 Summary: Hadoop2 yarn.resourcemanager.scheduler.address not loaded 
by RMProxy.java
 Key: YARN-3642
 URL: https://issues.apache.org/jira/browse/YARN-3642
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.7.0
 Environment: yarn-site.xml:
configuration

   property
  nameyarn.nodemanager.aux-services/name
  valuemapreduce_shuffle/value
   /property

   property
  nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
  valueorg.apache.hadoop.mapred.ShuffleHandler/value
   /property

   property
  nameyarn.resourcemanager.hostname/name
  valueqadoop-nn001.apsalar.com/value
   /property

   property
  nameyarn.resourcemanager.scheduler.address/name
  valueqadoop-nn001.apsalar.com:8030/value
   /property

   property
  nameyarn.resourcemanager.address/name
  valueqadoop-nn001.apsalar.com:8032/value
   /property

   property
  nameyarn.resourcemanager.webap.address/name
  valueqadoop-nn001.apsalar.com:8088/value
   /property

   property
  nameyarn.resourcemanager.resource-tracker.address/name
  valueqadoop-nn001.apsalar.com:8031/value
   /property

   property
  nameyarn.resourcemanager.admin.address/name
  valueqadoop-nn001.apsalar.com:8033/value
   /property

   property
  nameyarn.log-aggregation-enable/name
  valuetrue/value
   /property

   property
  descriptionWhere to aggregate logs to./description
  nameyarn.nodemanager.remote-app-log-dir/name
  value/var/log/hadoop/apps/value
   /property

   property
  nameyarn.web-proxy.address/name
  valueqadoop-nn001.apsalar.com:8088/value
   /property

/configuration


core-site.xml:
configuration

   property
  namefs.defaultFS/name
  valuehdfs://qadoop-nn001.apsalar.com/value
   /property

   property
  namehadoop.proxyuser.hdfs.hosts/name
  value*/value
   /property

   property
  namehadoop.proxyuser.hdfs.groups/name
  value*/value
   /property

/configuration


hdfs-site.xml:
configuration

   property
  namedfs.replication/name
  value2/value
   /property

   property
  namedfs.namenode.name.dir/name
  valuefile:/hadoop/nn/value
   /property

   property
  namedfs.datanode.data.dir/name
  valuefile:/hadoop/dn/dfs/value
   /property

   property
  namedfs.http.address/name
  valueqadoop-nn001.apsalar.com:50070/value
   /property

   property
  namedfs.secondary.http.address/name
  valueqadoop-nn002.apsalar.com:50090/value
   /property

/configuration


mapred-site.xml:
configuration

   property 
  namemapred.job.tracker/name 
  valueqadoop-nn001.apsalar.com:8032/value 
   /property

   property
  namemapreduce.framework.name/name
  valueyarn/value
   /property

   property
  namemapreduce.jobhistory.address/name
  valueqadoop-nn001.apsalar.com:10020/value
  descriptionthe JobHistoryServer address./description
   /property

   property  
  namemapreduce.jobhistory.webapp.address/name  
  valueqadoop-nn001.apsalar.com:19888/value  
  descriptionthe JobHistoryServer web address/description
   /property

/configuration


hbase-site.xml:
configuration

property 
namehbase.master/name 
valueqadoop-nn001.apsalar.com:6/value 
/property 

property 
namehbase.rootdir/name 
valuehdfs://qadoop-nn001.apsalar.com:8020/hbase/value 
/property 

property 
namehbase.cluster.distributed/name 
valuetrue/value 
/property 

property
namehbase.zookeeper.property.dataDir/name
value/opt/local/zookeeper/value
/property 

property
namehbase.zookeeper.property.clientPort/name
value2181/value 
/property

property 
namehbase.zookeeper.quorum/name 
valueqadoop-nn001.apsalar.com/value 
/property 

property 
namezookeeper.session.timeout/name 
value18/value 
/property 

/configuration

Reporter: Lee Hounshell


There is an issue with Hadoop 2.7.0 when in distributed operation the datanode 
is unable to reach the yarn scheduler.  In our yarn-site.xml, we have defined 
this path to be:

   property
  nameyarn.resourcemanager.scheduler.address/name
  valueqadoop-nn001.apsalar.com:8030/value
   /property

But when running an oozie job, the problem manifests when looking at the job 
logs for the yarn container.
We see logs similar to the following showing the connection problem:

Showing 4096 bytes. Click here for full log
[main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 64065
2015-05-13 17:49:33,930 INFO [main] org.mortbay.log: jetty-6.1.26
2015-05-13 17:49:33,971 INFO [main] org.mortbay.log: Extract 

[jira] [Resolved] (YARN-2221) WebUI: RM scheduler page's queue filter status will affect appllication page

2015-05-13 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong resolved YARN-2221.
-
Resolution: Duplicate

 WebUI: RM scheduler page's queue filter status will affect appllication page
 

 Key: YARN-2221
 URL: https://issues.apache.org/jira/browse/YARN-2221
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Peng Zhang
Priority: Minor

 Apps queue filter added by clicking queue bar in scheduler page will affect 
 display of applications page.
 No filter query is shown on applications page, this makes confusions.
 Also we cannot reset the filter query on application page, and we must come 
 back to scheduler page, click root queue to reset. 
 Reproduce steps: 
 {code}
 1) Configure two queues under root( A  B)
 2) Run some apps using queue A and B respectively
 3) Click “A” queue in scheduler page
 4) Click “Applications”, only apps of queue A show
 5) Click “B” queue in scheduler page
 6) Click “Applications”, only apps of queue B show
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2221) WebUI: RM scheduler page's queue filter status will affect appllication page

2015-05-13 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542390#comment-14542390
 ] 

Xuan Gong commented on YARN-2221:
-

Actually, they are duplicate. Close this ticket as duplicate. We could fix them 
together at https://issues.apache.org/jira/browse/YARN-2238

 WebUI: RM scheduler page's queue filter status will affect appllication page
 

 Key: YARN-2221
 URL: https://issues.apache.org/jira/browse/YARN-2221
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager, webapp
Affects Versions: 2.4.0
Reporter: Peng Zhang
Priority: Minor

 Apps queue filter added by clicking queue bar in scheduler page will affect 
 display of applications page.
 No filter query is shown on applications page, this makes confusions.
 Also we cannot reset the filter query on application page, and we must come 
 back to scheduler page, click root queue to reset. 
 Reproduce steps: 
 {code}
 1) Configure two queues under root( A  B)
 2) Run some apps using queue A and B respectively
 3) Click “A” queue in scheduler page
 4) Click “Applications”, only apps of queue A show
 5) Click “B” queue in scheduler page
 6) Click “Applications”, only apps of queue B show
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled

2015-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3585:
--
Priority: Critical  (was: Major)
Target Version/s: 2.7.1

Marking it as critical for 2.7.1 whichever way we go..

 Nodemanager cannot exit when decommission with NM recovery enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-3641:
--
Target Version/s: 2.7.1  (was: 2.8.0)

Marking it as critical for 2.7.1 whichever way we go..

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3634) TestMRTimelineEventHandling and TestApplication are broken

2015-05-13 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542423#comment-14542423
 ] 

Sangjin Lee commented on YARN-3634:
---

Thanks [~djp]!

 TestMRTimelineEventHandling and TestApplication are broken
 --

 Key: YARN-3634
 URL: https://issues.apache.org/jira/browse/YARN-3634
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Affects Versions: YARN-2928
Reporter: Sangjin Lee
Assignee: Sangjin Lee
 Fix For: YARN-2928

 Attachments: YARN-3634-YARN-2928.001.patch, 
 YARN-3634-YARN-2928.002.patch, YARN-3634-YARN-2928.003.patch, 
 YARN-3634-YARN-2928.004.patch


 TestMRTimelineEventHandling is broken. Relevant error message:
 {noformat}
 2015-05-12 06:28:56,415 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 0 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:28:57,416 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 1 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:28:58,416 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 2 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:28:59,417 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 3 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:00,418 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 4 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:01,419 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 5 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:02,420 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 6 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:03,420 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 7 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:04,421 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 8 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:05,422 INFO  [AsyncDispatcher event handler] ipc.Client 
 (Client.java:handleConnectionFailure(882)) - Retrying connect to server: 
 asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 9 time(s); retry 
 policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
 MILLISECONDS)
 2015-05-12 06:29:05,424 ERROR [AsyncDispatcher event handler] 
 collector.NodeTimelineCollectorManager 
 (NodeTimelineCollectorManager.java:postPut(121)) - Failed to communicate with 
 NM Collector Service for application_1431412130291_0001
 2015-05-12 06:29:05,425 WARN  [AsyncDispatcher event handler] 
 containermanager.AuxServices 
 (AuxServices.java:logWarningWhenAuxServiceThrowExceptions(261)) - The 
 auxService name is timeline_collector and it got an error at event: 
 CONTAINER_INIT
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 
 to asf904.gq1.ygridcore.net:0 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 

[jira] [Updated] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-13 Thread Craig Welch (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Welch updated YARN-3626:
--
Attachment: YARN-3626.6.patch

Fix broken unit tests

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled

2015-05-13 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542432#comment-14542432
 ] 

Jason Lowe commented on YARN-3585:
--

This is very likely a case where the leveldb state store was not closed 
properly on shutdown.   That was probably triggered by another exception that 
occurred during shutdown that short-circuited the shutdown of other services 
(like the state store).  See YARN-3641.

Could you check the NM logs for the case where it hung and see if another 
exception was logged during shutdown that may explain how the leveldb store 
failed to close?

 Nodemanager cannot exit when decommission with NM recovery enabled
 --

 Key: YARN-3585
 URL: https://issues.apache.org/jira/browse/YARN-3585
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Peng Zhang
Priority: Critical

 With NM recovery enabled, after decommission, nodemanager log show stop but 
 process cannot end. 
 non daemon thread:
 {noformat}
 DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on 
 condition [0x]
 leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable 
 [0x]
 VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable 
 Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 
 nid=0x29ed runnable 
 Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 
 nid=0x29ee runnable 
 Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 
 nid=0x29ef runnable 
 Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 
 nid=0x29f0 runnable 
 Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 
 nid=0x29f1 runnable 
 Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 
 nid=0x29f2 runnable 
 Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 
 nid=0x29f3 runnable 
 Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 
 nid=0x29f4 runnable 
 Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 
 runnable 
 Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 
 nid=0x29f5 runnable 
 Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 
 nid=0x29f6 runnable 
 VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting 
 on condition 
 {noformat}
 and jni leveldb thread stack
 {noformat}
 Thread 12 (Thread 0x7f33dd842700 (LWP 10903)):
 #0  0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib64/libpthread.so.0
 #1  0x7f33dfce2a3b in leveldb::(anonymous 
 namespace)::PosixEnv::BGThreadWrapper(void*) () from 
 /tmp/libleveldbjni-64-1-6922178968300745716.8
 #2  0x003d83407851 in start_thread () from /lib64/libpthread.so.0
 #3  0x003d830e811d in clone () from /lib64/libc.so.6
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.

2015-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542430#comment-14542430
 ] 

Vinod Kumar Vavilapalli commented on YARN-41:
-

It will be a much easier discussion if someone here can write down a truth 
table with various dimensions and when we want to/don't want to have the NM 
unregister.

 The RM should handle the graceful shutdown of the NM.
 -

 Key: YARN-41
 URL: https://issues.apache.org/jira/browse/YARN-41
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager, resourcemanager
Reporter: Ravi Teja Ch N V
Assignee: Devaraj K
  Labels: BB2015-05-TBR
 Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, 
 MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, 
 YARN-41-4.patch, YARN-41.patch


 Instead of waiting for the NM expiry, RM should remove and handle the NM, 
 which is shutdown gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-3643) Provide a way to store only running applications in the state store

2015-05-13 Thread Varun Saxena (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Saxena reassigned YARN-3643:
--

Assignee: Varun Saxena

 Provide a way to store only running applications in the state store
 ---

 Key: YARN-3643
 URL: https://issues.apache.org/jira/browse/YARN-3643
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Karthik Kambatla
Assignee: Varun Saxena

 Today, we have a config that determines the number of applications that can 
 be stored in the state-store. Since there is no easy way to figure out the 
 maximum number of running applications at any point in time, users are forced 
 to use a conservative estimate. Our default ends up being even more 
 conservative.
 It would be nice to allow storing all running applications with a 
 conservative upper bound for it. This should allow for shorter recovery times 
 in most deployments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2921) Fix MockRM/MockAM#waitForState sleep too long

2015-05-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2921:
-
Summary: Fix MockRM/MockAM#waitForState sleep too long  (was: 
MockRM#waitForState methods can be too slow and flaky)

 Fix MockRM/MockAM#waitForState sleep too long
 -

 Key: YARN-2921
 URL: https://issues.apache.org/jira/browse/YARN-2921
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0, 2.7.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi Ozawa
 Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
 YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, 
 YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, 
 YARN-2921.008.patch


 MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
 second). This leads to slow tests and sometimes failures if the 
 App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-13 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542609#comment-14542609
 ] 

Xuan Gong commented on YARN-3626:
-

Committed into trunk/branch-2/branch-2.7. Thanks, Craig.

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Fix For: 2.7.1

 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3579) CommonNodeLabelsManager should support NodeLabel instead of string label name when getting node-to-label/label-to-label mappings

2015-05-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3579:
-
Summary: CommonNodeLabelsManager should support NodeLabel instead of string 
label name when getting node-to-label/label-to-label mappings  (was: 
getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of 
label name as String)

 CommonNodeLabelsManager should support NodeLabel instead of string label name 
 when getting node-to-label/label-to-label mappings
 

 Key: YARN-3579
 URL: https://issues.apache.org/jira/browse/YARN-3579
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Sunil G
Assignee: Sunil G
Priority: Minor
 Attachments: 0001-YARN-3579.patch, 0002-YARN-3579.patch, 
 0003-YARN-3579.patch, 0004-YARN-3579.patch


 CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is 
 not passing information such as Exclusivity etc back to REST interface apis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3521) Support return structured NodeLabel objects in REST API

2015-05-13 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-3521:
-
Summary: Support return structured NodeLabel objects in REST API  (was: 
Support return structured NodeLabel objects in REST API when call 
getClusterNodeLabels)

 Support return structured NodeLabel objects in REST API
 ---

 Key: YARN-3521
 URL: https://issues.apache.org/jira/browse/YARN-3521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 
 0003-YARN-3521.patch, 0004-YARN-3521.patch, 0005-YARN-3521.patch, 
 0006-YARN-3521.patch, 0007-YARN-3521.patch


 In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should 
 make the same change in REST API side to make them consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2921) Fix MockRM/MockAM#waitForState sleep too long

2015-05-13 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542678#comment-14542678
 ] 

Karthik Kambatla commented on YARN-2921:


Thanks Tsuyoshi for fixing this. Just curious - do we know how much improvement 
this leads to when running RM tests. 

 Fix MockRM/MockAM#waitForState sleep too long
 -

 Key: YARN-2921
 URL: https://issues.apache.org/jira/browse/YARN-2921
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: test
Affects Versions: 2.6.0, 2.7.0
Reporter: Karthik Kambatla
Assignee: Tsuyoshi Ozawa
 Fix For: 2.8.0

 Attachments: YARN-2921.001.patch, YARN-2921.002.patch, 
 YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, 
 YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, 
 YARN-2921.008.patch


 MockRM#waitForState methods currently sleep for too long (2 seconds and 1 
 second). This leads to slow tests and sometimes failures if the 
 App/AppAttempt moves to another state. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3630) YARN should suggest a heartbeat interval for applications

2015-05-13 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542747#comment-14542747
 ] 

Wangda Tan commented on YARN-3630:
--

+1 for the general idea, [~xinxianyin], I think one very good point which you 
mentioned in 
https://issues.apache.org/jira/browse/YARN-3630?focusedCommentId=14539662page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14539662,
 showing the events waiting in the scheduler event handler queue in web UI is 
more important to figure out if scheduler being overloaded. Which could be 
addressed in separated JIRA.

 YARN should suggest a heartbeat interval for applications
 -

 Key: YARN-3630
 URL: https://issues.apache.org/jira/browse/YARN-3630
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager, scheduler
Affects Versions: 2.7.0
Reporter: Zoltán Zvara
Assignee: Xianyin Xin
Priority: Minor

 It seems currently applications - for example Spark - are not adaptive to RM 
 regarding heartbeat intervals. RM should be able to suggest a desired 
 heartbeat interval to applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.

2015-05-13 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543187#comment-14543187
 ] 

Xianyin Xin commented on YARN-3639:
---

Yes, you're right [~aw]. 

 It takes too long time for RM to recover all apps if the original active RM 
 and namenode is deployed on the same node.
 --

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.

2015-05-13 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated YARN-3639:
--
Summary: It takes too long time for RM to recover all apps if the original 
active RM and NN go down at the same time.  (was: It takes too long time for RM 
to recover all apps if the original active RM and namenode is deployed on the 
same node.)

 It takes too long time for RM to recover all apps if the original active RM 
 and NN go down at the same time.
 

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the node on which the active RM runs dies and if the active namenode is 
 running on the same node, the new RM will take long time to recover all apps. 
 After analysis, we found the root cause is renewing HDFS tokens in the 
 recovering process. The HDFS client created by the renewer would firstly try 
 to connect to the original namenode, the result of which is time-out after 
 10~20s, and then the client tries to connect to the new namenode. The entire 
 recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree

2015-05-13 Thread Akira AJISAKA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira AJISAKA reassigned YARN-2336:
---

Assignee: Akira AJISAKA  (was: Kenji Kikushima)

 Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
 --

 Key: YARN-2336
 URL: https://issues.apache.org/jira/browse/YARN-2336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.1
Reporter: Kenji Kikushima
Assignee: Akira AJISAKA
  Labels: BB2015-05-RFC
 Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336-4.patch, 
 YARN-2336.005.patch, YARN-2336.patch


 When we have sub queues in Fair Scheduler, REST api returns a missing '[' 
 blacket JSON for childQueues.
 This issue found by [~ajisakaa] at YARN-1050.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree

2015-05-13 Thread Akira AJISAKA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543192#comment-14543192
 ] 

Akira AJISAKA commented on YARN-2336:
-

bq. Should we remove childQueue when childQueue is null for the consistency? 
Agree. I'll remove it from FairSchedulerLeafQueueInfo.

 Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
 --

 Key: YARN-2336
 URL: https://issues.apache.org/jira/browse/YARN-2336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.4.1
Reporter: Kenji Kikushima
Assignee: Akira AJISAKA
  Labels: BB2015-05-RFC
 Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336-4.patch, 
 YARN-2336.005.patch, YARN-2336.patch


 When we have sub queues in Fair Scheduler, REST api returns a missing '[' 
 blacket JSON for childQueues.
 This issue found by [~ajisakaa] at YARN-1050.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.

2015-05-13 Thread Xianyin Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyin Xin updated YARN-3639:
--
Description: If the active RM and NN go down at the same time, the new RM 
will take long time to recover all apps. After analysis, we found the root 
cause is renewing HDFS tokens in the recovering process. The HDFS client 
created by the renewer would firstly try to connect to the original NN, the 
result of which is time-out after 10~20s, and then the client tries to connect 
to the new NN. The entire recovery cost 15*#apps seconds according our test.  
(was: If the node on which the active RM runs dies and if the active namenode 
is running on the same node, the new RM will take long time to recover all 
apps. After analysis, we found the root cause is renewing HDFS tokens in the 
recovering process. The HDFS client created by the renewer would firstly try to 
connect to the original namenode, the result of which is time-out after 10~20s, 
and then the client tries to connect to the new namenode. The entire recovery 
cost 15*#apps seconds according our test.)

 It takes too long time for RM to recover all apps if the original active RM 
 and NN go down at the same time.
 

 Key: YARN-3639
 URL: https://issues.apache.org/jira/browse/YARN-3639
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Xianyin Xin
 Attachments: YARN-3639-recovery_log_1_app.txt


 If the active RM and NN go down at the same time, the new RM will take long 
 time to recover all apps. After analysis, we found the root cause is renewing 
 HDFS tokens in the recovering process. The HDFS client created by the renewer 
 would firstly try to connect to the original NN, the result of which is 
 time-out after 10~20s, and then the client tries to connect to the new NN. 
 The entire recovery cost 15*#apps seconds according our test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543198#comment-14543198
 ] 

Rohith commented on YARN-3641:
--

Apologies for coming late into this JIRA.. I think 
{{DefaultMetricsSystem.shutdown();}} also should be called in the finally block 
otherwise if custom implementation of MetricsSinkAdapter like HADOOP-11932 
would hang the JVM.

 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-13 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542598#comment-14542598
 ] 

Xuan Gong commented on YARN-3626:
-

+1 LGTM. Will commit

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes

2015-05-13 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-3632:
---
Component/s: capacityscheduler

 Ordering policy should be allowed to reorder an application when demand 
 changes
 ---

 Key: YARN-3632
 URL: https://issues.apache.org/jira/browse/YARN-3632
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-3632.0.patch


 At present, ordering policies have the option to have an application 
 re-ordered (for allocation and preemption) when it is allocated to or a 
 container is recovered from the application.  Some ordering policies may also 
 need to reorder when demand changes if that is part of the ordering 
 comparison, this needs to be made available (and used by the 
 fairorderingpolicy when sizebasedweight is true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542694#comment-14542694
 ] 

Hudson commented on YARN-3521:
--

FAILURE: Integrated in Hadoop-trunk-Commit #7821 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7821/])
YARN-3521. Support return structured NodeLabel objects in REST API (Sunil G via 
wangda) (wangda: rev 7f19e7a2549a098236d06b29b7076bb037533f05)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeLabelInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/NodeIDsInfo.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeToLabelsEntryList.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeToLabelsEntry.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LabelsToNodesInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodeLabels.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeToLabelsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeLabelsInfo.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java


 Support return structured NodeLabel objects in REST API
 ---

 Key: YARN-3521
 URL: https://issues.apache.org/jira/browse/YARN-3521
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, client, resourcemanager
Reporter: Wangda Tan
Assignee: Sunil G
 Fix For: 2.8.0

 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 
 0003-YARN-3521.patch, 0004-YARN-3521.patch, 0005-YARN-3521.patch, 
 0006-YARN-3521.patch, 0007-YARN-3521.patch


 In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should 
 make the same change in REST API side to make them consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.

2015-05-13 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542742#comment-14542742
 ] 

Hudson commented on YARN-3641:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #7823 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/7823/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when 
exceptions happen in stopping NM's sub-services. Contributed by Junping Du 
(jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* hadoop-yarn-project/CHANGES.txt


 NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen 
 in stopping NM's sub-services.
 ---

 Key: YARN-3641
 URL: https://issues.apache.org/jira/browse/YARN-3641
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, rolling upgrade
Affects Versions: 2.6.0
Reporter: Junping Du
Assignee: Junping Du
Priority: Critical
 Fix For: 2.7.1

 Attachments: YARN-3641.patch


 If NM' services not get stopped properly, we cannot start NM with enabling NM 
 restart with work preserving. The exception is as following:
 {noformat}
 org.apache.hadoop.service.ServiceStateException: 
 org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock 
 /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource 
 temporarily unavailable
   at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
   at 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
 Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: 
 lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: 
 Resource temporarily unavailable
   at 
 org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
   at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
   at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
   at 
 org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
   at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
   ... 5 more
 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager 
 (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down NodeManager at 
 c6403.ambari.apache.org/192.168.64.103
 /
 {noformat}
 The related code is as below in NodeManager.java:
 {code}
   @Override
   protected void serviceStop() throws Exception {
 if (isStopping.getAndSet(true)) {
   return;
 }
 super.serviceStop();
 stopRecoveryStore();
 DefaultMetricsSystem.shutdown();
   }
 {code}
 We can see we stop all NM registered services (NodeStatusUpdater, 
 LogAggregationService, ResourceLocalizationService, etc.) first. Any of 
 services get stopped with exception could cause stopRecoveryStore() get 
 skipped which means levelDB store is not get closed. So next time NM start, 
 it will get failed with exception above. 
 We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >