[jira] [Commented] (YARN-3411) [Storage implementation] explore the native HBase write schema for storage
[ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541730#comment-14541730 ] Junping Du commented on YARN-3411: -- Sure. Cancel the patch until we have new version. Thx! [Storage implementation] explore the native HBase write schema for storage -- Key: YARN-3411 URL: https://issues.apache.org/jira/browse/YARN-3411 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Vrushali C Priority: Critical Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.2.txt, YARN-3411.poc.3.txt, YARN-3411.poc.4.txt, YARN-3411.poc.5.txt, YARN-3411.poc.6.txt, YARN-3411.poc.txt There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134). In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries. Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
Rohith created YARN-3640: Summary: NodeManager JVM continues to run after SHUTDOWN event is triggered. Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541742#comment-14541742 ] Rohith commented on YARN-3640: -- It looks similar, but I did not get how did you taken *jni leveldb thread stack* which mentioned in YARN-3585 NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-160: --- Attachment: YARN-160.007.patch Uploaded 007.patch which improves some logging. nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541751#comment-14541751 ] Peng Zhang commented on YARN-3640: -- I use pstack to get it NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using
[ https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated YARN-3638: -- Component/s: yarn scheduler resourcemanager capacityscheduler Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler, yarn Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is currently consuming for jobs on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated YARN-3639: -- Assignee: (was: Xianyin Xin) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3640: - Attachment: hadoop-rohith-nodemanager-test123.log Attached the nodemanager log file. It contains log from stopping the services. NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3634) TestMRTimelineEventHandling and TestApplication are broken
[ https://issues.apache.org/jira/browse/YARN-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541722#comment-14541722 ] Junping Du commented on YARN-3634: -- Thanks [~sjlee0] to report the issue and deliver the patch to fix it. Patch looks mostly good to me. Only one minor issue: {code} +if (nmCollectorService == null) { + synchronized (this) { +Configuration conf = getConfig(); +InetSocketAddress nmCollectorServiceAddress = conf.getSocketAddr( +YarnConfiguration.NM_BIND_HOST, +YarnConfiguration.NM_COLLECTOR_SERVICE_ADDRESS, +YarnConfiguration.DEFAULT_NM_COLLECTOR_SERVICE_ADDRESS, +YarnConfiguration.DEFAULT_NM_COLLECTOR_SERVICE_PORT); +LOG.info(nmCollectorServiceAddress: + nmCollectorServiceAddress); +final YarnRPC rpc = YarnRPC.create(conf); + +// TODO Security settings. +nmCollectorService = (CollectorNodemanagerProtocol) rpc.getProxy( +CollectorNodemanagerProtocol.class, +nmCollectorServiceAddress, conf); + } +} {code} The synchronized block sounds unnecessary, as this is the only place to update nmCollectorService which get called by serviceStart() - which get called by single thread only. The race condition could happen with other reader threads. But given writer is always single thread and we already mark nmCollectorService as volatile in this patch, it should safe to remove the synchronized block. TestMRTimelineEventHandling and TestApplication are broken -- Key: YARN-3634 URL: https://issues.apache.org/jira/browse/YARN-3634 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: YARN-3634-YARN-2928.001.patch, YARN-3634-YARN-2928.002.patch, YARN-3634-YARN-2928.003.patch TestMRTimelineEventHandling is broken. Relevant error message: {noformat} 2015-05-12 06:28:56,415 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:57,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:58,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:59,417 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:00,418 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:01,419 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:02,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:03,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:04,421 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 8 time(s); retry policy is
[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541743#comment-14541743 ] Rohith commented on YARN-3640: -- I missed this issue, thanks for pointing out. NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2921) MockRM#waitForState methods can be too slow and flaky
[ https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-2921: - Attachment: YARN-2921.008.patch Resubmitting a patch again. MockRM#waitForState methods can be too slow and flaky - Key: YARN-2921 URL: https://issues.apache.org/jira/browse/YARN-2921 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0, 2.7.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi Ozawa Attachments: YARN-2921.001.patch, YARN-2921.002.patch, YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, YARN-2921.008.patch MockRM#waitForState methods currently sleep for too long (2 seconds and 1 second). This leads to slow tests and sometimes failures if the App/AppAttempt moves to another state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2921) MockRM#waitForState methods can be too slow and flaky
[ https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541727#comment-14541727 ] Hadoop QA commented on YARN-2921: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 40s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 5 new or modified test files. | | {color:green}+1{color} | javac | 7m 32s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 32s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 40s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 39s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 49m 46s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 90m 20s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732511/YARN-2921.008.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 92c38e4 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7915/artifact/patchprocess/whitespace.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7915/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7915/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7915/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7915/console | This message was automatically generated. MockRM#waitForState methods can be too slow and flaky - Key: YARN-2921 URL: https://issues.apache.org/jira/browse/YARN-2921 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0, 2.7.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi Ozawa Attachments: YARN-2921.001.patch, YARN-2921.002.patch, YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, YARN-2921.008.patch MockRM#waitForState methods currently sleep for too long (2 seconds and 1 second). This leads to slow tests and sometimes failures if the App/AppAttempt moves to another state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
Xianyin Xin created YARN-3639: - Summary: It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Assignee: Xianyin Xin If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2423) TimelineClient should wrap all GET APIs to facilitate Java users
[ https://issues.apache.org/jira/browse/YARN-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541590#comment-14541590 ] Hadoop QA commented on YARN-2423: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12697840/YARN-2423.007.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / e82067b | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7914/console | This message was automatically generated. TimelineClient should wrap all GET APIs to facilitate Java users Key: YARN-2423 URL: https://issues.apache.org/jira/browse/YARN-2423 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Robert Kanter Labels: BB2015-05-TBR Attachments: YARN-2423.004.patch, YARN-2423.005.patch, YARN-2423.006.patch, YARN-2423.007.patch, YARN-2423.patch, YARN-2423.patch, YARN-2423.patch TimelineClient provides the Java method to put timeline entities. It's also good to wrap over all GET APIs (both entity and domain), and deserialize the json response into Java POJO objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541592#comment-14541592 ] Tsuyoshi Ozawa commented on YARN-2336: -- I see. Should we remove childQueue when childQueue is null for the consistency? CapacityScheduler doesn't return childQueues if queue is null(empty). Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree -- Key: YARN-2336 URL: https://issues.apache.org/jira/browse/YARN-2336 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Kenji Kikushima Assignee: Kenji Kikushima Labels: BB2015-05-RFC Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336-4.patch, YARN-2336.005.patch, YARN-2336.patch When we have sub queues in Fair Scheduler, REST api returns a missing '[' blacket JSON for childQueues. This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1039) Add parameter for YARN resource requests to indicate long lived
[ https://issues.apache.org/jira/browse/YARN-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541612#comment-14541612 ] Steve Loughran commented on YARN-1039: -- +1 for a long-lived bit. Services can set the flag, and it is up for future versions of Hadoop to implement the logic to go with it. FWIW, I'd make the first use of the patch the YARN-1079 progress bar. Why? it's the least amount of server-side code changes (no scheduling patches), it fixes a tangible problem for users (progress bar is confusing), and it provides an immediate benefit to the apps —so encouraging them to set the flag, maybe even by reflection if they want to stay compatible across hadoop versions. Add parameter for YARN resource requests to indicate long lived - Key: YARN-1039 URL: https://issues.apache.org/jira/browse/YARN-1039 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 3.0.0, 2.1.1-beta Reporter: Steve Loughran Assignee: Craig Welch Attachments: YARN-1039.1.patch, YARN-1039.2.patch, YARN-1039.3.patch A container request could support a new parameter long-lived. This could be used by a scheduler that would know not to host the service on a transient (cloud: spot priced) node. Schedulers could also decide whether or not to allocate multiple long-lived containers on the same node -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3638) Yarn Scheduler show percentage of total cluster that a queue is using
Hari Sekhon created YARN-3638: - Summary: Yarn Scheduler show percentage of total cluster that a queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is consuming on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3591) Resource Localisation on a bad disk causes subsequent containers failure
[ https://issues.apache.org/jira/browse/YARN-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541705#comment-14541705 ] Varun Vasudev commented on YARN-3591: - [~zxu], [~lavkesh] - instead of checking listing the directory contents every time, can we use the signalling mechanism that [~zxu] added in YARN-3491? When a local dir goes bad, the trackers listener gets called and it remove all the localized resources from the data structure. That way we are re-using the existing checks to make sure that a directory is good. Resource Localisation on a bad disk causes subsequent containers failure - Key: YARN-3591 URL: https://issues.apache.org/jira/browse/YARN-3591 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Attachments: 0001-YARN-3591.1.patch, 0001-YARN-3591.patch, YARN-3591.2.patch It happens when a resource is localised on the disk, after localising that disk has gone bad. NM keeps paths for localised resources in memory. At the time of resource request isResourcePresent(rsrc) will be called which calls file.exists() on the localised path. In some cases when disk has gone bad, inodes are stilled cached and file.exists() returns true. But at the time of reading, file will not open. Note: file.exists() actually calls stat64 natively which returns true because it was able to find inode information from the OS. A proposal is to call file.list() on the parent path of the resource, which will call open() natively. If the disk is good it should return an array of paths with length at-least 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3640: - Attachment: nm_143.out nm_141.out Attaching the thread of 2 NM's. NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using
[ https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated YARN-3638: -- Summary: Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using (was: Yarn Scheduler show percentage of total cluster that a queue is using) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is consuming on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using
[ https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hari Sekhon updated YARN-3638: -- Description: Request to show % of total cluster resources each queue is currently consuming for jobs on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. was: Request to show % of total cluster resources each queue is consuming on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is currently consuming for jobs on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541669#comment-14541669 ] Junping Du commented on YARN-41: bq. In the case of work-preserving NM restart (or under supervision as YARN-2331 calls it), we can make the NM not do an unregister? I think latest patch (-4.patch) already did this, but my concern is a little broader: does user (or management tools for YARN cluster, like: Ambari) expect the same behavior for kill -9 on NM daemon and shutdown for NM daemon? With current patch (assume NM work preserving is disabled), user will find RM don't have this NM info anymore if shutdown NM daemon while the kill -9 on NM daemon has the old behavior (RM still has NM info with running state and switch to LOST after timeout). Previously, the behavior of these two operations is the same. However, I don't think we care too much about consistency behavior for these two operations, but would like to call it out loudly to make sure we don't miss anything important. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Labels: BB2015-05-TBR Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541697#comment-14541697 ] nijel commented on YARN-3639: - hi [~xinxianyin] Thanks for reporting this issue. Can you attach the logs of this issue ? It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2921) MockRM#waitForState methods can be too slow and flaky
[ https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541734#comment-14541734 ] Tsuyoshi Ozawa commented on YARN-2921: -- [~leftnoteasy] all tests are green. Could you check v8 patch and my comments? https://issues.apache.org/jira/browse/YARN-2921?focusedCommentId=14539843page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14539843 MockRM#waitForState methods can be too slow and flaky - Key: YARN-2921 URL: https://issues.apache.org/jira/browse/YARN-2921 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0, 2.7.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi Ozawa Attachments: YARN-2921.001.patch, YARN-2921.002.patch, YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, YARN-2921.008.patch MockRM#waitForState methods currently sleep for too long (2 seconds and 1 second). This leads to slow tests and sometimes failures if the App/AppAttempt moves to another state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541733#comment-14541733 ] Peng Zhang commented on YARN-3640: -- I'v encounter the same problem, and filed YARN-3585. I think it's related with leveldb thread. I also see it in your thread out. NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith resolved YARN-3640. -- Resolution: Duplicate Closing as duplicate. NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3617) Fix unused variable to get CPU frequency on Windows systems
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J.Andreina updated YARN-3617: - Attachment: YARN-3617.1.patch Attached an initial patch. Please review. Fix unused variable to get CPU frequency on Windows systems --- Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541899#comment-14541899 ] Hadoop QA commented on YARN-3170: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 2m 51s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | site | 2m 56s | Site still builds. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | | | 6m 11s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732566/YARN-3170-006.patch | | Optional Tests | site | | git revision | trunk / 065d8f2 | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7918/console | This message was automatically generated. YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Labels: BB2015-05-TBR Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3641: - Component/s: rolling upgrade nodemanager Affects Version/s: 2.6.0 Summary: NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. (was: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a final block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541846#comment-14541846 ] Brahma Reddy Battula commented on YARN-3170: Thanks [~ozawa] for reminding .. After allen comments, Yes I missed this... Kindly review the latest patch.. YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Labels: BB2015-05-TBR Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541845#comment-14541845 ] Brahma Reddy Battula commented on YARN-3170: Thanks [~ozawa] for reminding .. After allen comments, Yes I missed this... Kindly review the latest patch.. YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Labels: BB2015-05-TBR Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541881#comment-14541881 ] Jason Lowe commented on YARN-41: In light of NM restart, one of the problems with having the NM check for active applications and then take different actions is that the NM has a significantly delayed view of the cluster relative to the RM. The RM could have decided to assign new containers (and thus new applications) to the node, but the NM hasn't seen the launch request from the AM yet. This has already caused other issues, see the early discussions in YARN-3535 where containers were killed because the node reconnected with no active applications reported and was handled as a node removed/node added sequence. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Labels: BB2015-05-TBR Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once
[ https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541892#comment-14541892 ] Hadoop QA commented on YARN-3489: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732569/YARN-3489-branch-2.7.02.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 065d8f2 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7919/console | This message was automatically generated. RMServerUtils.validateResourceRequests should only obtain queue info once - Key: YARN-3489 URL: https://issues.apache.org/jira/browse/YARN-3489 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-3489-branch-2.7.02.patch, YARN-3489-branch-2.7.patch, YARN-3489.01.patch, YARN-3489.02.patch, YARN-3489.03.patch Since the label support was added we now get the queue info for each request being validated in SchedulerUtils.validateResourceRequest. If validateResourceRequests needs to validate a lot of requests at a time (e.g.: large cluster with lots of varied locality in the requests) then it will get the queue info for each request. Since we build the queue info this generates a lot of unnecessary garbage, as the queue isn't changing between requests. We should grab the queue info once and pass it down rather than building it again for each request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541861#comment-14541861 ] Varun Vasudev commented on YARN-160: The test failure is unrelated to the patch. nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3641: - Description: If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. was: If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45))
[jira] [Updated] (YARN-3579) getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String
[ https://issues.apache.org/jira/browse/YARN-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-3579: -- Attachment: 0004-YARN-3579.patch Updating patch after a compilation problem. getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String Key: YARN-3579 URL: https://issues.apache.org/jira/browse/YARN-3579 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Priority: Minor Attachments: 0001-YARN-3579.patch, 0002-YARN-3579.patch, 0003-YARN-3579.patch, 0004-YARN-3579.patch CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is not passing information such as Exclusivity etc back to REST interface apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3641) stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
Junping Du created YARN-3641: Summary: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Junping Du Priority: Critical If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a final block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3641: - Attachment: YARN-3641.patch Upload a quick patch to fix it. The issue is obviously and the solution is simple enough, not need unit test. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3627) Preemption not triggered in Fair scheduler when maxResources is set on parent queue
[ https://issues.apache.org/jira/browse/YARN-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541782#comment-14541782 ] Bibin A Chundatt commented on YARN-3627: Hi [~kasha] Thank you for looking into the same . But YARN-3405 also doesn't change *shouldAttemptPreemption()* So the primary check of threshold will happen . And subqueue Q1-1 is not preempted since its below threshold and Q1-2 will starve for resource. Preemption not triggered in Fair scheduler when maxResources is set on parent queue --- Key: YARN-3627 URL: https://issues.apache.org/jira/browse/YARN-3627 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, scheduler Environment: Suse 11 SP3, 2 NM Reporter: Bibin A Chundatt Consider the below scenario of fair configuration Root (10Gb cluster resource) --Q1 (maxResources 4gb) Q1.1 (maxResources 4gb) Q1.2 (maxResources 4gb) --Q2 (maxResources 6GB) No applications are running in Q2 Submit one application with to Q1.1 with 50 maps 4Gb gets allocated to Q1.1 Now submit application to Q1.2 the same will be starving for memory always. Preemption will never get triggered since yarn.scheduler.fair.preemption.cluster-utilization-threshold =.8 and the cluster utilization is below .8. *Fairscheduler.java* {code} private boolean shouldAttemptPreemption() { if (preemptionEnabled) { return (preemptionUtilizationThreshold Math.max( (float) rootMetrics.getAllocatedMB() / clusterResource.getMemory(), (float) rootMetrics.getAllocatedVirtualCores() / clusterResource.getVirtualCores())); } return false; } {code} Are we supposed to configure in running cluster maxResources 0mb and 0 cores so that all queues can take full cluster resources always if available?? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods
[ https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541772#comment-14541772 ] Hudson commented on YARN-3613: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/]) YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods. (nijel via kasha) (kasha: rev fe0df596271340788095cb43a1944e19ac4c2cf7) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods -- Key: YARN-3613 URL: https://issues.apache.org/jira/browse/YARN-3613 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: nijel Priority: Minor Labels: newbie Fix For: 2.8.0 Attachments: YARN-3613-1.patch, yarn-3613-2.patch In TestContainerManagerSecurity, individual tests init and start Yarn cluster. This duplication can be avoided by moving that to setup. Further, one could merge the two @Test methods to avoid bringing up another mini-cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541773#comment-14541773 ] Hudson commented on YARN-3539: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/]) YARN-3539. Updated timeline server documentation and marked REST APIs evolving. Contributed by Steve Loughran. (zjshen: rev fcd0702c10ce574b887280476aba63d6682d5271) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java * hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.7.1 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, YARN-3539.11.patch, timeline_get_api_examples.txt The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3505) Node's Log Aggregation Report with SUCCEED should not cached in RMApps
[ https://issues.apache.org/jira/browse/YARN-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541785#comment-14541785 ] Junping Du commented on YARN-3505: -- Latest patch LGTM. [~jianhe], about your previous comments, LogAggregationReport#(get/set)getNodeId already been removed, and LogAggregationReport#(get/set)DiagnosticMessage is necessary (see: AppLogAggregatorImpl.java). Any further comments from you? Node's Log Aggregation Report with SUCCEED should not cached in RMApps -- Key: YARN-3505 URL: https://issues.apache.org/jira/browse/YARN-3505 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Affects Versions: 2.8.0 Reporter: Junping Du Assignee: Xuan Gong Priority: Critical Attachments: YARN-3505.1.patch, YARN-3505.2.patch, YARN-3505.2.rebase.patch, YARN-3505.3.patch, YARN-3505.4.patch, YARN-3505.5.patch Per discussions in YARN-1402, we shouldn't cache all node's log aggregation reports in RMApps for always, especially for those finished with SUCCEED. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods
[ https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541794#comment-14541794 ] Hudson commented on YARN-3613: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/926/]) YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods. (nijel via kasha) (kasha: rev fe0df596271340788095cb43a1944e19ac4c2cf7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/CHANGES.txt TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods -- Key: YARN-3613 URL: https://issues.apache.org/jira/browse/YARN-3613 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: nijel Priority: Minor Labels: newbie Fix For: 2.8.0 Attachments: YARN-3613-1.patch, yarn-3613-2.patch In TestContainerManagerSecurity, individual tests init and start Yarn cluster. This duplication can be avoided by moving that to setup. Further, one could merge the two @Test methods to avoid bringing up another mini-cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3513) Remove unused variables in ContainersMonitorImpl and add debug log for overall resource usage by all containers
[ https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541799#comment-14541799 ] Hudson commented on YARN-3513: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/926/]) YARN-3513. Remove unused variables in ContainersMonitorImpl and add debug (devaraj: rev 8badd82ce256e4dc8c234961120d62a88358ab39) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java * hadoop-yarn-project/CHANGES.txt Remove unused variables in ContainersMonitorImpl and add debug log for overall resource usage by all containers Key: YARN-3513 URL: https://issues.apache.org/jira/browse/YARN-3513 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Trivial Labels: newbie Fix For: 2.8.0 Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, YARN-3513.20150506-1.patch, YARN-3513.20150507-1.patch, YARN-3513.20150508-1.patch, YARN-3513.20150508-1.patch, YARN-3513.20150511-1.patch Some local variables in MonitoringThread.run() : {{vmemStillInUsage and pmemStillInUsage}} are not used and just updated. Instead we need to add debug log for overall resource usage by all containers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541795#comment-14541795 ] Hudson commented on YARN-3539: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/926/]) YARN-3539. Updated timeline server documentation and marked REST APIs evolving. Contributed by Steve Loughran. (zjshen: rev fcd0702c10ce574b887280476aba63d6682d5271) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java * hadoop-project/src/site/site.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java * hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.7.1 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, YARN-3539.11.patch, timeline_get_api_examples.txt The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.
[ https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541802#comment-14541802 ] Hudson commented on YARN-3629: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #926 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/926/]) YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java NodeID is always printed as null in node manager initialization log. -- Key: YARN-3629 URL: https://issues.apache.org/jira/browse/YARN-3629 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Fix For: 2.8.0 Attachments: YARN-3629-1.patch In Node manager log during startup the following logs is printed 2015-05-12 11:20:02,347 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized nodemanager for *null* : physical-memory=4096 virtual-memory=8602 virtual-cores=8 This line is printed from NodeStatusUpdaterImpl.serviceInit. But the nodeid assignment is happening only in NodeStatusUpdaterImpl.serviceStart {code} protected void serviceStart() throws Exception { // NodeManager is the last service to start, so NodeId is available. this.nodeId = this.context.getNodeId(); {code} Assigning the node id in serviceinit is not feasible since it is generated by ContainerManagerImpl.serviceStart. The log can be moved to service start to give right information to user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541848#comment-14541848 ] Hadoop QA commented on YARN-160: \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 57s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 5 new or modified test files. | | {color:green}+1{color} | javac | 7m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 0s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 40s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 29s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 33s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | tools/hadoop tests | 15m 2s | Tests passed in hadoop-gridmix. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:red}-1{color} | yarn tests | 1m 57s | Tests failed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 0m 17s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 60m 30s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.nodelabels.TestFileSystemNodeLabelsStore | | Failed build | hadoop-yarn-server-nodemanager | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732537/YARN-160.007.patch | | Optional Tests | javac unit findbugs checkstyle javadoc | | git revision | trunk / 92c38e4 | | hadoop-gridmix test log | https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-gridmix.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7916/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7916/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7916/console | This message was automatically generated. nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Labels: BB2015-05-TBR Attachments: YARN-160.005.patch, YARN-160.006.patch, YARN-160.007.patch, apache-yarn-160.0.patch, apache-yarn-160.1.patch, apache-yarn-160.2.patch, apache-yarn-160.3.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.
[ https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541780#comment-14541780 ] Hudson commented on YARN-3629: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/]) YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java NodeID is always printed as null in node manager initialization log. -- Key: YARN-3629 URL: https://issues.apache.org/jira/browse/YARN-3629 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Fix For: 2.8.0 Attachments: YARN-3629-1.patch In Node manager log during startup the following logs is printed 2015-05-12 11:20:02,347 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized nodemanager for *null* : physical-memory=4096 virtual-memory=8602 virtual-cores=8 This line is printed from NodeStatusUpdaterImpl.serviceInit. But the nodeid assignment is happening only in NodeStatusUpdaterImpl.serviceStart {code} protected void serviceStart() throws Exception { // NodeManager is the last service to start, so NodeId is available. this.nodeId = this.context.getNodeId(); {code} Assigning the node id in serviceinit is not feasible since it is generated by ContainerManagerImpl.serviceStart. The log can be moved to service start to give right information to user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3513) Remove unused variables in ContainersMonitorImpl and add debug log for overall resource usage by all containers
[ https://issues.apache.org/jira/browse/YARN-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541777#comment-14541777 ] Hudson commented on YARN-3513: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #195 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/195/]) YARN-3513. Remove unused variables in ContainersMonitorImpl and add debug (devaraj: rev 8badd82ce256e4dc8c234961120d62a88358ab39) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java Remove unused variables in ContainersMonitorImpl and add debug log for overall resource usage by all containers Key: YARN-3513 URL: https://issues.apache.org/jira/browse/YARN-3513 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Priority: Trivial Labels: newbie Fix For: 2.8.0 Attachments: YARN-3513.20150421-1.patch, YARN-3513.20150503-1.patch, YARN-3513.20150506-1.patch, YARN-3513.20150507-1.patch, YARN-3513.20150508-1.patch, YARN-3513.20150508-1.patch, YARN-3513.20150511-1.patch Some local variables in MonitoringThread.run() : {{vmemStillInUsage and pmemStillInUsage}} are not used and just updated. Instead we need to add debug log for overall resource usage by all containers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3640) NodeManager JVM continues to run after SHUTDOWN event is triggered.
[ https://issues.apache.org/jira/browse/YARN-3640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541790#comment-14541790 ] Rohith commented on YARN-3640: -- I am able to reproduce this always..!!! NodeManager JVM continues to run after SHUTDOWN event is triggered. --- Key: YARN-3640 URL: https://issues.apache.org/jira/browse/YARN-3640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.7.0 Reporter: Rohith Attachments: hadoop-rohith-nodemanager-test123.log, nm_141.out, nm_143.out We faced strange issue in the cluster that NodeManager did not exitted when the SHUTDOWN event is called from NodeStatusUpdaterImpl. Taken the thread dump and verified it, but did not get much idea why NM jvm not exited. At a time, for 3 NodeManger got this problem, and all the 3 NodeManager thread dump looks similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541812#comment-14541812 ] Peng Zhang commented on YARN-3585: -- As YARN-3640, Rohith has encountered the same problem. And we all see leveldb thread in thread stack. I think it's probably related with NM recovery. Decommission is not the key matter. [~devaraj.k] Do you enable NM recovery in your env? Nodemanager cannot exit when decommission with NM recovery enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3170) YARN architecture document needs updating
[ https://issues.apache.org/jira/browse/YARN-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula updated YARN-3170: --- Attachment: YARN-3170-006.patch YARN architecture document needs updating - Key: YARN-3170 URL: https://issues.apache.org/jira/browse/YARN-3170 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Allen Wittenauer Assignee: Brahma Reddy Battula Labels: BB2015-05-TBR Attachments: YARN-3170-002.patch, YARN-3170-003.patch, YARN-3170-004.patch, YARN-3170-005.patch, YARN-3170-006.patch, YARN-3170.patch The marketing paragraph at the top, NextGen MapReduce, etc are all marketing rather than actual descriptions. It also needs some general updates, esp given it reads as though 0.23 was just released yesterday. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3617) Fix unused variable to get CPU frequency on Windows systems
[ https://issues.apache.org/jira/browse/YARN-3617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541854#comment-14541854 ] Hadoop QA commented on YARN-3617: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 1s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 52s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 24s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | | | 39m 35s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732542/YARN-3617.1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 065d8f2 | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7917/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7917/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7917/console | This message was automatically generated. Fix unused variable to get CPU frequency on Windows systems --- Key: YARN-3617 URL: https://issues.apache.org/jira/browse/YARN-3617 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.7.0 Environment: Windows 7 x64 SP1 Reporter: Georg Berendt Assignee: J.Andreina Priority: Minor Attachments: YARN-3617.1.patch Original Estimate: 1h Remaining Estimate: 1h In the class 'WindowsResourceCalculatorPlugin.java' of the YARN project, there is an unused variable for CPU frequency. /** {@inheritDoc} */ @Override public long getCpuFrequency() { refreshIfNeeded(); return -1; } Please change '-1' to use 'cpuFrequencyKhz'. org/apache/hadoop/yarn/util/WindowsResourceCalculatorPlugin.java -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3489) RMServerUtils.validateResourceRequests should only obtain queue info once
[ https://issues.apache.org/jira/browse/YARN-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3489: --- Attachment: YARN-3489-branch-2.7.02.patch RMServerUtils.validateResourceRequests should only obtain queue info once - Key: YARN-3489 URL: https://issues.apache.org/jira/browse/YARN-3489 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Labels: BB2015-05-RFC Attachments: YARN-3489-branch-2.7.02.patch, YARN-3489-branch-2.7.patch, YARN-3489.01.patch, YARN-3489.02.patch, YARN-3489.03.patch Since the label support was added we now get the queue info for each request being validated in SchedulerUtils.validateResourceRequest. If validateResourceRequests needs to validate a lot of requests at a time (e.g.: large cluster with lots of varied locality in the requests) then it will get the queue info for each request. Since we build the queue info this generates a lot of unnecessary garbage, as the queue isn't changing between requests. We should grab the queue info once and pass it down rather than building it again for each request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using
[ https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541885#comment-14541885 ] Jason Lowe commented on YARN-3638: -- Isn't this the Absolute Used Capacity metric that is shown for each leaf queue? Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler, yarn Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is currently consuming for jobs on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using
[ https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541950#comment-14541950 ] Hari Sekhon commented on YARN-3638: --- [~jlowe] yes I believe it is % of absolute capacity that is shown, which is useful to seeing how much you're bursting over. It would be nice if RM would also show the % of the total cluster's capacity that the leaf queue was consuming. You could also extend this idea show the % of total cluster capacity that each job is consuming too. Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler, yarn Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is currently consuming for jobs on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.
[ https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541980#comment-14541980 ] Hudson commented on YARN-3629: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2124 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2124/]) YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/CHANGES.txt NodeID is always printed as null in node manager initialization log. -- Key: YARN-3629 URL: https://issues.apache.org/jira/browse/YARN-3629 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Fix For: 2.8.0 Attachments: YARN-3629-1.patch In Node manager log during startup the following logs is printed 2015-05-12 11:20:02,347 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized nodemanager for *null* : physical-memory=4096 virtual-memory=8602 virtual-cores=8 This line is printed from NodeStatusUpdaterImpl.serviceInit. But the nodeid assignment is happening only in NodeStatusUpdaterImpl.serviceStart {code} protected void serviceStart() throws Exception { // NodeManager is the last service to start, so NodeId is available. this.nodeId = this.context.getNodeId(); {code} Assigning the node id in serviceinit is not feasible since it is generated by ContainerManagerImpl.serviceStart. The log can be moved to service start to give right information to user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541973#comment-14541973 ] Hudson commented on YARN-3539: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2124 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2124/]) YARN-3539. Updated timeline server documentation and marked REST APIs evolving. Contributed by Steve Loughran. (zjshen: rev fcd0702c10ce574b887280476aba63d6682d5271) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java * hadoop-yarn-project/CHANGES.txt * hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.7.1 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, YARN-3539.11.patch, timeline_get_api_examples.txt The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3629) NodeID is always printed as null in node manager initialization log.
[ https://issues.apache.org/jira/browse/YARN-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541999#comment-14541999 ] Hudson commented on YARN-3629: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #184 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/184/]) YARN-3629. NodeID is always printed as null in node manager (devaraj: rev 5c2f05cd9bad9bf9beb0f4ca18f4ae1bc3e84499) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/CHANGES.txt NodeID is always printed as null in node manager initialization log. -- Key: YARN-3629 URL: https://issues.apache.org/jira/browse/YARN-3629 Project: Hadoop YARN Issue Type: Bug Reporter: nijel Assignee: nijel Fix For: 2.8.0 Attachments: YARN-3629-1.patch In Node manager log during startup the following logs is printed 2015-05-12 11:20:02,347 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized nodemanager for *null* : physical-memory=4096 virtual-memory=8602 virtual-cores=8 This line is printed from NodeStatusUpdaterImpl.serviceInit. But the nodeid assignment is happening only in NodeStatusUpdaterImpl.serviceStart {code} protected void serviceStart() throws Exception { // NodeManager is the last service to start, so NodeId is available. this.nodeId = this.context.getNodeId(); {code} Assigning the node id in serviceinit is not feasible since it is generated by ContainerManagerImpl.serviceStart. The log can be moved to service start to give right information to user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3539) Compatibility doc to state that ATS v1 is a stable REST API
[ https://issues.apache.org/jira/browse/YARN-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541993#comment-14541993 ] Hudson commented on YARN-3539: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #184 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/184/]) YARN-3539. Updated timeline server documentation and marked REST APIs evolving. Contributed by Steve Loughran. (zjshen: rev fcd0702c10ce574b887280476aba63d6682d5271) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenSelector.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntities.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomains.java * hadoop-common-project/hadoop-common/src/site/markdown/Compatibility.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/package-info.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/package-info.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainersInfo.java * hadoop-project/src/site/site.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/TimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppAttemptsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEvents.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDelegationTokenResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineDomain.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/ContainerInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/TimelineServer.md * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java Compatibility doc to state that ATS v1 is a stable REST API --- Key: YARN-3539 URL: https://issues.apache.org/jira/browse/YARN-3539 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.7.0 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.7.1 Attachments: HADOOP-11826-001.patch, HADOOP-11826-002.patch, TimelineServer.html, YARN-3539-003.patch, YARN-3539-004.patch, YARN-3539-005.patch, YARN-3539-006.patch, YARN-3539-007.patch, YARN-3539-008.patch, YARN-3539-009.patch, YARN-3539-010.patch, YARN-3539.11.patch, timeline_get_api_examples.txt The ATS v2 discussion and YARN-2423 have raised the question: how stable are the ATSv1 APIs? The existing compatibility document actually states that the History Server is [a stable REST API|http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#REST_APIs], which effectively means that ATSv1 has already been declared as a stable API. Clarify this by patching the compatibility document appropriately -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using
[ https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541969#comment-14541969 ] Jason Lowe commented on YARN-3638: -- bq. I believe it is % of absolute capacity that is shown No, the Absolute Used Capacity field is the amount of total cluster capacity being used by this queue. From the 2.6 code in CSQueueUtils.updateQueueStatistics: {code} absoluteUsedCapacity = Resources.divide(calculator, clusterResource, usedResources, clusterResource); {code} Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler, yarn Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is currently consuming for jobs on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods
[ https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541972#comment-14541972 ] Hudson commented on YARN-3613: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2124 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2124/]) YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods. (nijel via kasha) (kasha: rev fe0df596271340788095cb43a1944e19ac4c2cf7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/CHANGES.txt TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods -- Key: YARN-3613 URL: https://issues.apache.org/jira/browse/YARN-3613 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: nijel Priority: Minor Labels: newbie Fix For: 2.8.0 Attachments: YARN-3613-1.patch, yarn-3613-2.patch In TestContainerManagerSecurity, individual tests init and start Yarn cluster. This duplication can be avoided by moving that to setup. Further, one could merge the two @Test methods to avoid bringing up another mini-cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3579) getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String
[ https://issues.apache.org/jira/browse/YARN-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541978#comment-14541978 ] Hadoop QA commented on YARN-3579: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 16s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 7m 43s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 52s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 52s | The applied patch generated 1 new checkstyle issues (total was 34, now 34). | | {color:red}-1{color} | whitespace | 0m 1s | The patch has 15 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 24s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | | | 39m 40s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732573/0004-YARN-3579.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 065d8f2 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7920/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7920/artifact/patchprocess/whitespace.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7920/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7920/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7920/console | This message was automatically generated. getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String Key: YARN-3579 URL: https://issues.apache.org/jira/browse/YARN-3579 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Priority: Minor Attachments: 0001-YARN-3579.patch, 0002-YARN-3579.patch, 0003-YARN-3579.patch, 0004-YARN-3579.patch CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is not passing information such as Exclusivity etc back to REST interface apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542000#comment-14542000 ] Jason Lowe commented on YARN-3641: -- I think the patch approach is OK, but I'm not sure I agree with the problem analysis. We kill -9 the NM during rolling upgrades, which obviously will not cleanly shutdown the state store, yet we don't have the IO error lock problem. The issue is that the old NM process must still be running, which is why leveldb refuses to open the still-in-use database. In that sense this JIRA appears to be a duplicate of the same problems described in YARN-3585 and YARN-3640. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542028#comment-14542028 ] Junping Du commented on YARN-3641: -- bq. We kill -9 the NM during rolling upgrades, which obviously will not cleanly shutdown the state store, yet we don't have the IO error lock problem. Yes. I also suspect that if old NM is still running. The bad news is our original environment is gone, may need sometime to reproduce this to see if the same problem of YARN-3585. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542045#comment-14542045 ] Junping Du commented on YARN-41: Thanks [~jlowe] for sharing this prospective. I think YARN-3212 is facing the same situation as your last comment there. However, in this case, we may make things simpler if we don't care if running applications on work preserving enabled nodes. We just simply don't do unRegister NM from RM when work-preserving is enabled - this is not only simpler for implementation, but also simpler for user to understand - or it will make shutdown NM daemon's behavior sounds more randomly as sometimes it get disappeared from RM while sometimes not - this sounds the behavior is not controlled by configuration but controlled by randomly container allocation. Thoughts? The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Labels: BB2015-05-TBR Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3628) ContainerMetrics should support always-flush mode.
[ https://issues.apache.org/jira/browse/YARN-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542036#comment-14542036 ] Karthik Kambatla commented on YARN-3628: bq. So the empty content is shown for the active container metrics until it is finished. Where are we showing this? jmx or a specific metrics sink? I am not convinced we should support a period of 0 ms, let alone by default: each container will be constantly publishing its usage metrics. Assuming non-positive period implies flushOnExit seems like the right approach to me. Also, since this was released as part of 2.7, we should avoid incompatible changes. ContainerMetrics should support always-flush mode. -- Key: YARN-3628 URL: https://issues.apache.org/jira/browse/YARN-3628 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3628.000.patch ContainerMetrics should support always-flush mode. It will be good to set ContainerMetrics as always-flush mode if yarn.nodemanager.container-metrics.period-ms is configured as 0. Currently both 0 and -1 mean flush on completion. Also the current default value for yarn.nodemanager.container-metrics.period-ms is -1 and the default value for yarn.nodemanager.container-metrics.enable is true. So the empty content is shown for the active container metrics until it is finished. The default value for yarn.nodemanager.container-metrics.period-ms should not be -1. flushOnPeriod is always false if flushPeriodMs is -1, the content will only be shown when the container is finished. {code} if (finished || flushOnPeriod) { registry.snapshot(collector.addRecord(registry.info()), all); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542078#comment-14542078 ] Devaraj K commented on YARN-3585: - Thanks for reply. I have enabled NM recovery in my env. Nodemanager cannot exit when decommission with NM recovery enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3613) TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods
[ https://issues.apache.org/jira/browse/YARN-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541992#comment-14541992 ] Hudson commented on YARN-3613: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #184 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/184/]) YARN-3613. TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods. (nijel via kasha) (kasha: rev fe0df596271340788095cb43a1944e19ac4c2cf7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/CHANGES.txt TestContainerManagerSecurity should init and start Yarn cluster in setup instead of individual methods -- Key: YARN-3613 URL: https://issues.apache.org/jira/browse/YARN-3613 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: nijel Priority: Minor Labels: newbie Fix For: 2.8.0 Attachments: YARN-3613-1.patch, yarn-3613-2.patch In TestContainerManagerSecurity, individual tests init and start Yarn cluster. This duplication can be avoided by moving that to setup. Further, one could merge the two @Test methods to avoid bringing up another mini-cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542052#comment-14542052 ] Hadoop QA commented on YARN-3641: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 39s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 35s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 3s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 6m 0s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 42m 5s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732578/YARN-3641.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 065d8f2 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7921/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7921/console | This message was automatically generated. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542051#comment-14542051 ] Jason Lowe commented on YARN-41: I think avoiding the unregister during shutdown _if_ the NM is under supervision (i.e.: we know it will be restarted momentarily) is fine. I was only bringing up the point since you mentioned the latest patch already covered this, but that patch is checking for active applications to decide whether to unregister. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Labels: BB2015-05-TBR Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3628) ContainerMetrics should support always-flush mode.
[ https://issues.apache.org/jira/browse/YARN-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3628: Issue Type: Improvement (was: Bug) ContainerMetrics should support always-flush mode. -- Key: YARN-3628 URL: https://issues.apache.org/jira/browse/YARN-3628 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3628.000.patch ContainerMetrics should support always-flush mode. It will be good to set ContainerMetrics as always-flush mode if yarn.nodemanager.container-metrics.period-ms is configured as 0. Currently both 0 and -1 mean flush on completion. Also the current default value for yarn.nodemanager.container-metrics.period-ms is -1 and the default value for yarn.nodemanager.container-metrics.enable is true. So the empty content is shown for the active container metrics until it is finished. The default value for yarn.nodemanager.container-metrics.period-ms should not be -1. flushOnPeriod is always false if flushPeriodMs is -1, the content will only be shown when the container is finished. {code} if (finished || flushOnPeriod) { registry.snapshot(collector.addRecord(registry.info()), all); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3638) Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using
[ https://issues.apache.org/jira/browse/YARN-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542563#comment-14542563 ] Wangda Tan commented on YARN-3638: -- I think this is useful. Maybe one possible way is add a switch in RM scheduler UI to change used-capacity and absolute-used-capacity showing in queue bars. Yarn Resource Manager Scheduler page - show percentage of total cluster that each queue is using Key: YARN-3638 URL: https://issues.apache.org/jira/browse/YARN-3638 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler, yarn Affects Versions: 2.6.0 Environment: HDP 2.2 Reporter: Hari Sekhon Priority: Minor Request to show % of total cluster resources each queue is currently consuming for jobs on the Yarn Resource Manager Scheduler page. Currently the Yarn Resource Manager Scheduler page shows the % of total used for root queue and the % of each given queue's configured capacity that is used (often showing say 150% if the max capacity is greater than configured capacity to allow bursting where there are free resources). This is fine, but it would be good to additionally show the % of total cluster that each given queue is consuming and not just the % of that queue's configured capacity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542553#comment-14542553 ] Hadoop QA commented on YARN-3626: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 44s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 32s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 46s | The applied patch generated 3 new checkstyle issues (total was 211, now 214). | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 23s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | mapreduce tests | 0m 46s | Tests passed in hadoop-mapreduce-client-common. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 5m 59s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 46m 57s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12732645/YARN-3626.6.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cdec12d | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-mapreduce-client-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/testrun_hadoop-mapreduce-client-common.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7924/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7924/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7924/console | This message was automatically generated. On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3642) Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java
Lee Hounshell created YARN-3642: --- Summary: Hadoop2 yarn.resourcemanager.scheduler.address not loaded by RMProxy.java Key: YARN-3642 URL: https://issues.apache.org/jira/browse/YARN-3642 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: yarn-site.xml: configuration property nameyarn.nodemanager.aux-services/name valuemapreduce_shuffle/value /property property nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name valueorg.apache.hadoop.mapred.ShuffleHandler/value /property property nameyarn.resourcemanager.hostname/name valueqadoop-nn001.apsalar.com/value /property property nameyarn.resourcemanager.scheduler.address/name valueqadoop-nn001.apsalar.com:8030/value /property property nameyarn.resourcemanager.address/name valueqadoop-nn001.apsalar.com:8032/value /property property nameyarn.resourcemanager.webap.address/name valueqadoop-nn001.apsalar.com:8088/value /property property nameyarn.resourcemanager.resource-tracker.address/name valueqadoop-nn001.apsalar.com:8031/value /property property nameyarn.resourcemanager.admin.address/name valueqadoop-nn001.apsalar.com:8033/value /property property nameyarn.log-aggregation-enable/name valuetrue/value /property property descriptionWhere to aggregate logs to./description nameyarn.nodemanager.remote-app-log-dir/name value/var/log/hadoop/apps/value /property property nameyarn.web-proxy.address/name valueqadoop-nn001.apsalar.com:8088/value /property /configuration core-site.xml: configuration property namefs.defaultFS/name valuehdfs://qadoop-nn001.apsalar.com/value /property property namehadoop.proxyuser.hdfs.hosts/name value*/value /property property namehadoop.proxyuser.hdfs.groups/name value*/value /property /configuration hdfs-site.xml: configuration property namedfs.replication/name value2/value /property property namedfs.namenode.name.dir/name valuefile:/hadoop/nn/value /property property namedfs.datanode.data.dir/name valuefile:/hadoop/dn/dfs/value /property property namedfs.http.address/name valueqadoop-nn001.apsalar.com:50070/value /property property namedfs.secondary.http.address/name valueqadoop-nn002.apsalar.com:50090/value /property /configuration mapred-site.xml: configuration property namemapred.job.tracker/name valueqadoop-nn001.apsalar.com:8032/value /property property namemapreduce.framework.name/name valueyarn/value /property property namemapreduce.jobhistory.address/name valueqadoop-nn001.apsalar.com:10020/value descriptionthe JobHistoryServer address./description /property property namemapreduce.jobhistory.webapp.address/name valueqadoop-nn001.apsalar.com:19888/value descriptionthe JobHistoryServer web address/description /property /configuration hbase-site.xml: configuration property namehbase.master/name valueqadoop-nn001.apsalar.com:6/value /property property namehbase.rootdir/name valuehdfs://qadoop-nn001.apsalar.com:8020/hbase/value /property property namehbase.cluster.distributed/name valuetrue/value /property property namehbase.zookeeper.property.dataDir/name value/opt/local/zookeeper/value /property property namehbase.zookeeper.property.clientPort/name value2181/value /property property namehbase.zookeeper.quorum/name valueqadoop-nn001.apsalar.com/value /property property namezookeeper.session.timeout/name value18/value /property /configuration Reporter: Lee Hounshell There is an issue with Hadoop 2.7.0 when in distributed operation the datanode is unable to reach the yarn scheduler. In our yarn-site.xml, we have defined this path to be: property nameyarn.resourcemanager.scheduler.address/name valueqadoop-nn001.apsalar.com:8030/value /property But when running an oozie job, the problem manifests when looking at the job logs for the yarn container. We see logs similar to the following showing the connection problem: Showing 4096 bytes. Click here for full log [main] org.apache.hadoop.http.HttpServer2: Jetty bound to port 64065 2015-05-13 17:49:33,930 INFO [main] org.mortbay.log: jetty-6.1.26 2015-05-13 17:49:33,971 INFO [main] org.mortbay.log: Extract
[jira] [Resolved] (YARN-2221) WebUI: RM scheduler page's queue filter status will affect appllication page
[ https://issues.apache.org/jira/browse/YARN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong resolved YARN-2221. - Resolution: Duplicate WebUI: RM scheduler page's queue filter status will affect appllication page Key: YARN-2221 URL: https://issues.apache.org/jira/browse/YARN-2221 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Peng Zhang Priority: Minor Apps queue filter added by clicking queue bar in scheduler page will affect display of applications page. No filter query is shown on applications page, this makes confusions. Also we cannot reset the filter query on application page, and we must come back to scheduler page, click root queue to reset. Reproduce steps: {code} 1) Configure two queues under root( A B) 2) Run some apps using queue A and B respectively 3) Click “A” queue in scheduler page 4) Click “Applications”, only apps of queue A show 5) Click “B” queue in scheduler page 6) Click “Applications”, only apps of queue B show {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2221) WebUI: RM scheduler page's queue filter status will affect appllication page
[ https://issues.apache.org/jira/browse/YARN-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542390#comment-14542390 ] Xuan Gong commented on YARN-2221: - Actually, they are duplicate. Close this ticket as duplicate. We could fix them together at https://issues.apache.org/jira/browse/YARN-2238 WebUI: RM scheduler page's queue filter status will affect appllication page Key: YARN-2221 URL: https://issues.apache.org/jira/browse/YARN-2221 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Peng Zhang Priority: Minor Apps queue filter added by clicking queue bar in scheduler page will affect display of applications page. No filter query is shown on applications page, this makes confusions. Also we cannot reset the filter query on application page, and we must come back to scheduler page, click root queue to reset. Reproduce steps: {code} 1) Configure two queues under root( A B) 2) Run some apps using queue A and B respectively 3) Click “A” queue in scheduler page 4) Click “Applications”, only apps of queue A show 5) Click “B” queue in scheduler page 6) Click “Applications”, only apps of queue B show {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3585: -- Priority: Critical (was: Major) Target Version/s: 2.7.1 Marking it as critical for 2.7.1 whichever way we go.. Nodemanager cannot exit when decommission with NM recovery enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3641: -- Target Version/s: 2.7.1 (was: 2.8.0) Marking it as critical for 2.7.1 whichever way we go.. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3634) TestMRTimelineEventHandling and TestApplication are broken
[ https://issues.apache.org/jira/browse/YARN-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542423#comment-14542423 ] Sangjin Lee commented on YARN-3634: --- Thanks [~djp]! TestMRTimelineEventHandling and TestApplication are broken -- Key: YARN-3634 URL: https://issues.apache.org/jira/browse/YARN-3634 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: YARN-2928 Reporter: Sangjin Lee Assignee: Sangjin Lee Fix For: YARN-2928 Attachments: YARN-3634-YARN-2928.001.patch, YARN-3634-YARN-2928.002.patch, YARN-3634-YARN-2928.003.patch, YARN-3634-YARN-2928.004.patch TestMRTimelineEventHandling is broken. Relevant error message: {noformat} 2015-05-12 06:28:56,415 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:57,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:58,416 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:28:59,417 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:00,418 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:01,419 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:02,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:03,420 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:04,421 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:05,422 INFO [AsyncDispatcher event handler] ipc.Client (Client.java:handleConnectionFailure(882)) - Retrying connect to server: asf904.gq1.ygridcore.net/67.195.81.148:0. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2015-05-12 06:29:05,424 ERROR [AsyncDispatcher event handler] collector.NodeTimelineCollectorManager (NodeTimelineCollectorManager.java:postPut(121)) - Failed to communicate with NM Collector Service for application_1431412130291_0001 2015-05-12 06:29:05,425 WARN [AsyncDispatcher event handler] containermanager.AuxServices (AuxServices.java:logWarningWhenAuxServiceThrowExceptions(261)) - The auxService name is timeline_collector and it got an error at event: CONTAINER_INIT org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.net.ConnectException: Call From asf904.gq1.ygridcore.net/67.195.81.148 to asf904.gq1.ygridcore.net:0 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:
[jira] [Updated] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-3626: -- Attachment: YARN-3626.6.patch Fix broken unit tests On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3585) Nodemanager cannot exit when decommission with NM recovery enabled
[ https://issues.apache.org/jira/browse/YARN-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542432#comment-14542432 ] Jason Lowe commented on YARN-3585: -- This is very likely a case where the leveldb state store was not closed properly on shutdown. That was probably triggered by another exception that occurred during shutdown that short-circuited the shutdown of other services (like the state store). See YARN-3641. Could you check the NM logs for the case where it hung and see if another exception was logged during shutdown that may explain how the leveldb store failed to close? Nodemanager cannot exit when decommission with NM recovery enabled -- Key: YARN-3585 URL: https://issues.apache.org/jira/browse/YARN-3585 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Peng Zhang Priority: Critical With NM recovery enabled, after decommission, nodemanager log show stop but process cannot end. non daemon thread: {noformat} DestroyJavaVM prio=10 tid=0x7f3460011800 nid=0x29ec waiting on condition [0x] leveldb prio=10 tid=0x7f3354001800 nid=0x2a97 runnable [0x] VM Thread prio=10 tid=0x7f3460167000 nid=0x29f8 runnable Gang worker#0 (Parallel GC Threads) prio=10 tid=0x7f346002 nid=0x29ed runnable Gang worker#1 (Parallel GC Threads) prio=10 tid=0x7f3460022000 nid=0x29ee runnable Gang worker#2 (Parallel GC Threads) prio=10 tid=0x7f3460024000 nid=0x29ef runnable Gang worker#3 (Parallel GC Threads) prio=10 tid=0x7f3460025800 nid=0x29f0 runnable Gang worker#4 (Parallel GC Threads) prio=10 tid=0x7f3460027800 nid=0x29f1 runnable Gang worker#5 (Parallel GC Threads) prio=10 tid=0x7f3460029000 nid=0x29f2 runnable Gang worker#6 (Parallel GC Threads) prio=10 tid=0x7f346002b000 nid=0x29f3 runnable Gang worker#7 (Parallel GC Threads) prio=10 tid=0x7f346002d000 nid=0x29f4 runnable Concurrent Mark-Sweep GC Thread prio=10 tid=0x7f3460120800 nid=0x29f7 runnable Gang worker#0 (Parallel CMS Threads) prio=10 tid=0x7f346011c800 nid=0x29f5 runnable Gang worker#1 (Parallel CMS Threads) prio=10 tid=0x7f346011e800 nid=0x29f6 runnable VM Periodic Task Thread prio=10 tid=0x7f346019f800 nid=0x2a01 waiting on condition {noformat} and jni leveldb thread stack {noformat} Thread 12 (Thread 0x7f33dd842700 (LWP 10903)): #0 0x003d8340b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f33dfce2a3b in leveldb::(anonymous namespace)::PosixEnv::BGThreadWrapper(void*) () from /tmp/libleveldbjni-64-1-6922178968300745716.8 #2 0x003d83407851 in start_thread () from /lib64/libpthread.so.0 #3 0x003d830e811d in clone () from /lib64/libc.so.6 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542430#comment-14542430 ] Vinod Kumar Vavilapalli commented on YARN-41: - It will be a much easier discussion if someone here can write down a truth table with various dimensions and when we want to/don't want to have the NM unregister. The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager, resourcemanager Reporter: Ravi Teja Ch N V Assignee: Devaraj K Labels: BB2015-05-TBR Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch, YARN-41-1.patch, YARN-41-2.patch, YARN-41-3.patch, YARN-41-4.patch, YARN-41.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3643) Provide a way to store only running applications in the state store
[ https://issues.apache.org/jira/browse/YARN-3643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena reassigned YARN-3643: -- Assignee: Varun Saxena Provide a way to store only running applications in the state store --- Key: YARN-3643 URL: https://issues.apache.org/jira/browse/YARN-3643 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Today, we have a config that determines the number of applications that can be stored in the state-store. Since there is no easy way to figure out the maximum number of running applications at any point in time, users are forced to use a conservative estimate. Our default ends up being even more conservative. It would be nice to allow storing all running applications with a conservative upper bound for it. This should allow for shorter recovery times in most deployments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2921) Fix MockRM/MockAM#waitForState sleep too long
[ https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2921: - Summary: Fix MockRM/MockAM#waitForState sleep too long (was: MockRM#waitForState methods can be too slow and flaky) Fix MockRM/MockAM#waitForState sleep too long - Key: YARN-2921 URL: https://issues.apache.org/jira/browse/YARN-2921 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0, 2.7.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi Ozawa Attachments: YARN-2921.001.patch, YARN-2921.002.patch, YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, YARN-2921.008.patch MockRM#waitForState methods currently sleep for too long (2 seconds and 1 second). This leads to slow tests and sometimes failures if the App/AppAttempt moves to another state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542609#comment-14542609 ] Xuan Gong commented on YARN-3626: - Committed into trunk/branch-2/branch-2.7. Thanks, Craig. On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Fix For: 2.7.1 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3579) CommonNodeLabelsManager should support NodeLabel instead of string label name when getting node-to-label/label-to-label mappings
[ https://issues.apache.org/jira/browse/YARN-3579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3579: - Summary: CommonNodeLabelsManager should support NodeLabel instead of string label name when getting node-to-label/label-to-label mappings (was: getLabelsToNodes in CommonNodeLabelsManager should support NodeLabel instead of label name as String) CommonNodeLabelsManager should support NodeLabel instead of string label name when getting node-to-label/label-to-label mappings Key: YARN-3579 URL: https://issues.apache.org/jira/browse/YARN-3579 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.6.0 Reporter: Sunil G Assignee: Sunil G Priority: Minor Attachments: 0001-YARN-3579.patch, 0002-YARN-3579.patch, 0003-YARN-3579.patch, 0004-YARN-3579.patch CommonNodeLabelsManager#getLabelsToNodes returns label name as string. It is not passing information such as Exclusivity etc back to REST interface apis. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3521) Support return structured NodeLabel objects in REST API
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3521: - Summary: Support return structured NodeLabel objects in REST API (was: Support return structured NodeLabel objects in REST API when call getClusterNodeLabels) Support return structured NodeLabel objects in REST API --- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 0003-YARN-3521.patch, 0004-YARN-3521.patch, 0005-YARN-3521.patch, 0006-YARN-3521.patch, 0007-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2921) Fix MockRM/MockAM#waitForState sleep too long
[ https://issues.apache.org/jira/browse/YARN-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542678#comment-14542678 ] Karthik Kambatla commented on YARN-2921: Thanks Tsuyoshi for fixing this. Just curious - do we know how much improvement this leads to when running RM tests. Fix MockRM/MockAM#waitForState sleep too long - Key: YARN-2921 URL: https://issues.apache.org/jira/browse/YARN-2921 Project: Hadoop YARN Issue Type: Improvement Components: test Affects Versions: 2.6.0, 2.7.0 Reporter: Karthik Kambatla Assignee: Tsuyoshi Ozawa Fix For: 2.8.0 Attachments: YARN-2921.001.patch, YARN-2921.002.patch, YARN-2921.003.patch, YARN-2921.004.patch, YARN-2921.005.patch, YARN-2921.006.patch, YARN-2921.007.patch, YARN-2921.008.patch, YARN-2921.008.patch MockRM#waitForState methods currently sleep for too long (2 seconds and 1 second). This leads to slow tests and sometimes failures if the App/AppAttempt moves to another state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3630) YARN should suggest a heartbeat interval for applications
[ https://issues.apache.org/jira/browse/YARN-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542747#comment-14542747 ] Wangda Tan commented on YARN-3630: -- +1 for the general idea, [~xinxianyin], I think one very good point which you mentioned in https://issues.apache.org/jira/browse/YARN-3630?focusedCommentId=14539662page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14539662, showing the events waiting in the scheduler event handler queue in web UI is more important to figure out if scheduler being overloaded. Which could be addressed in separated JIRA. YARN should suggest a heartbeat interval for applications - Key: YARN-3630 URL: https://issues.apache.org/jira/browse/YARN-3630 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, scheduler Affects Versions: 2.7.0 Reporter: Zoltán Zvara Assignee: Xianyin Xin Priority: Minor It seems currently applications - for example Spark - are not adaptive to RM regarding heartbeat intervals. RM should be able to suggest a desired heartbeat interval to applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543187#comment-14543187 ] Xianyin Xin commented on YARN-3639: --- Yes, you're right [~aw]. It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node. -- Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Attachments: YARN-3639-recovery_log_1_app.txt If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated YARN-3639: -- Summary: It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time. (was: It takes too long time for RM to recover all apps if the original active RM and namenode is deployed on the same node.) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time. Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Attachments: YARN-3639-recovery_log_1_app.txt If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA reassigned YARN-2336: --- Assignee: Akira AJISAKA (was: Kenji Kikushima) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree -- Key: YARN-2336 URL: https://issues.apache.org/jira/browse/YARN-2336 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Kenji Kikushima Assignee: Akira AJISAKA Labels: BB2015-05-RFC Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336-4.patch, YARN-2336.005.patch, YARN-2336.patch When we have sub queues in Fair Scheduler, REST api returns a missing '[' blacket JSON for childQueues. This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543192#comment-14543192 ] Akira AJISAKA commented on YARN-2336: - bq. Should we remove childQueue when childQueue is null for the consistency? Agree. I'll remove it from FairSchedulerLeafQueueInfo. Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree -- Key: YARN-2336 URL: https://issues.apache.org/jira/browse/YARN-2336 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Kenji Kikushima Assignee: Akira AJISAKA Labels: BB2015-05-RFC Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336-4.patch, YARN-2336.005.patch, YARN-2336.patch When we have sub queues in Fair Scheduler, REST api returns a missing '[' blacket JSON for childQueues. This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3639) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time.
[ https://issues.apache.org/jira/browse/YARN-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated YARN-3639: -- Description: If the active RM and NN go down at the same time, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original NN, the result of which is time-out after 10~20s, and then the client tries to connect to the new NN. The entire recovery cost 15*#apps seconds according our test. (was: If the node on which the active RM runs dies and if the active namenode is running on the same node, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original namenode, the result of which is time-out after 10~20s, and then the client tries to connect to the new namenode. The entire recovery cost 15*#apps seconds according our test.) It takes too long time for RM to recover all apps if the original active RM and NN go down at the same time. Key: YARN-3639 URL: https://issues.apache.org/jira/browse/YARN-3639 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Xianyin Xin Attachments: YARN-3639-recovery_log_1_app.txt If the active RM and NN go down at the same time, the new RM will take long time to recover all apps. After analysis, we found the root cause is renewing HDFS tokens in the recovering process. The HDFS client created by the renewer would firstly try to connect to the original NN, the result of which is time-out after 10~20s, and then the client tries to connect to the new NN. The entire recovery cost 15*#apps seconds according our test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14543198#comment-14543198 ] Rohith commented on YARN-3641: -- Apologies for coming late into this JIRA.. I think {{DefaultMetricsSystem.shutdown();}} also should be called in the finally block otherwise if custom implementation of MetricsSinkAdapter like HADOOP-11932 would hang the JVM. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be
[ https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542598#comment-14542598 ] Xuan Gong commented on YARN-3626: - +1 LGTM. Will commit On Windows localized resources are not moved to the front of the classpath when they should be -- Key: YARN-3626 URL: https://issues.apache.org/jira/browse/YARN-3626 Project: Hadoop YARN Issue Type: Bug Components: yarn Environment: Windows Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch In response to the mapreduce.job.user.classpath.first setting the classpath is ordered differently so that localized resources will appear before system classpath resources when tasks execute. On Windows this does not work because the localized resources are not linked into their final location when the classpath jar is created. To compensate for that localized jar resources are added directly to the classpath generated for the jar rather than being discovered from the localized directories. Unfortunately, they are always appended to the classpath, and so are never preferred over system resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3632) Ordering policy should be allowed to reorder an application when demand changes
[ https://issues.apache.org/jira/browse/YARN-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-3632: --- Component/s: capacityscheduler Ordering policy should be allowed to reorder an application when demand changes --- Key: YARN-3632 URL: https://issues.apache.org/jira/browse/YARN-3632 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-3632.0.patch At present, ordering policies have the option to have an application re-ordered (for allocation and preemption) when it is allocated to or a container is recovered from the application. Some ordering policies may also need to reorder when demand changes if that is part of the ordering comparison, this needs to be made available (and used by the fairorderingpolicy when sizebasedweight is true) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3521) Support return structured NodeLabel objects in REST API
[ https://issues.apache.org/jira/browse/YARN-3521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542694#comment-14542694 ] Hudson commented on YARN-3521: -- FAILURE: Integrated in Hadoop-trunk-Commit #7821 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7821/]) YARN-3521. Support return structured NodeLabel objects in REST API (Sunil G via wangda) (wangda: rev 7f19e7a2549a098236d06b29b7076bb037533f05) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeLabelInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/NodeIDsInfo.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeToLabelsEntryList.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeToLabelsEntry.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LabelsToNodesInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeToLabelsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NodeLabelsInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java Support return structured NodeLabel objects in REST API --- Key: YARN-3521 URL: https://issues.apache.org/jira/browse/YARN-3521 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Sunil G Fix For: 2.8.0 Attachments: 0001-YARN-3521.patch, 0002-YARN-3521.patch, 0003-YARN-3521.patch, 0004-YARN-3521.patch, 0005-YARN-3521.patch, 0006-YARN-3521.patch, 0007-YARN-3521.patch In YARN-3413, yarn cluster CLI returns NodeLabel instead of String, we should make the same change in REST API side to make them consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
[ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542742#comment-14542742 ] Hudson commented on YARN-3641: -- SUCCESS: Integrated in Hadoop-trunk-Commit #7823 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7823/]) YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * hadoop-yarn-project/CHANGES.txt NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. --- Key: YARN-3641 URL: https://issues.apache.org/jira/browse/YARN-3641 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, rolling upgrade Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Junping Du Priority: Critical Fix For: 2.7.1 Attachments: YARN-3641.patch If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: {noformat} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 / {noformat} The related code is as below in NodeManager.java: {code} @Override protected void serviceStop() throws Exception { if (isStopping.getAndSet(true)) { return; } super.serviceStop(); stopRecoveryStore(); DefaultMetricsSystem.shutdown(); } {code} We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)