[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-810: Issue Type: Improvement (was: Bug) Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit. Implementation: What do you guys think about introducing a variable to YarnConfiguration: bq.
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181053#comment-14181053 ] Hadoop QA commented on YARN-810: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656675/YARN-810.patch against trunk revision d71d40a. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5515//console This message is automatically generated. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181055#comment-14181055 ] Hadoop QA commented on YARN-2724: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676532/YARN-2724.2.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5514//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5514//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5514//console This message is automatically generated. If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-
[jira] [Updated] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2724: Attachment: YARN-2724.3.patch fix -1 on findBug If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181074#comment-14181074 ] Hadoop QA commented on YARN-2724: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676535/YARN-2724.3.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5516//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5516//console This message is automatically generated. If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181090#comment-14181090 ] Hadoop QA commented on YARN-2701: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676533/YARN-2701.addendum.3.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5517//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5517//console This message is automatically generated. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch, YARN-2701.addendum.3.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181252#comment-14181252 ] Hudson commented on YARN-2198: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #721 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/721/]) YARN-2198. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor. Contributed by Remus Rusanu (jianhe: rev 3b12fd6cfbf4cc91ef8e8616c7aafa9de006cde5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * hadoop-common-project/hadoop-common/src/main/winutils/winutils.sln * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * hadoop-common-project/hadoop-common/src/main/native/native.vcxproj * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/winutils/winutils.mc * hadoop-common-project/hadoop-common/src/main/winutils/service.c * hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/util/ProcessTree.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.c * hadoop-common-project/hadoop-common/src/main/winutils/config.cpp * hadoop-common-project/hadoop-common/src/main/winutils/hadoopwinutilsvc.idl * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c * hadoop-common-project/hadoop-common/src/main/winutils/main.c * hadoop-common-project/hadoop-common/src/main/winutils/winutils.vcxproj * hadoop-common-project/hadoop-common/pom.xml * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.vcxproj * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.h * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java * hadoop-common-project/hadoop-common/src/main/winutils/task.c * hadoop-common-project/hadoop-common/src/main/winutils/client.c * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * .gitignore * hadoop-common-project/hadoop-common/src/main/winutils/chown.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus
[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181260#comment-14181260 ] Hudson commented on YARN-2700: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #721 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/721/]) YARN-2700 TestSecureRMRegistryOperations failing on windows: auth problems (stevel: rev 90e5ca24fbd3bb2da2a3879cc9b73f0b1d7f3e03) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/AbstractSecureRegistryTest.java * hadoop-yarn-project/CHANGES.txt TestSecureRMRegistryOperations failing on windows: auth problems Key: YARN-2700 URL: https://issues.apache.org/jira/browse/YARN-2700 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows Server, Win7 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.6.0 Attachments: YARN-2700-001.patch TestSecureRMRegistryOperations failing on windows: unable to create the root /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2732) Fix syntax error in SecureContainer.apt.vm
[ https://issues.apache.org/jira/browse/YARN-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181257#comment-14181257 ] Hudson commented on YARN-2732: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #721 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/721/]) YARN-2732. Fixed syntax error in SecureContainer.apt.vm. Contributed by Jian He. (zjshen: rev b94b8b30f282563ee2ecdd25761b2345aaf06c9b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/CHANGES.txt Fix syntax error in SecureContainer.apt.vm -- Key: YARN-2732 URL: https://issues.apache.org/jira/browse/YARN-2732 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2732.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181304#comment-14181304 ] Naganarasimha G R commented on YARN-2495: - Thanks for reviewing Wangda : bq. 2) It seems NM_LABELS_FETCH_INTERVAL_MS not been used in the patch, did you forget to do that? -- Earlier was planning to make node labels script only to be dynamic and configruation based as static. Now based on your comment 4 will make it dynamic and change the configuration name too. bq. 3) Regarding ResourceTrackerProtocol, I think NodeHeartbeatRequest should only report labels when labels changed. So there're 3 possible values of node labels in NodeHeartbeatRequest ... And RegisterNodeManagerRequest should report label every time registering. -- Yes this was my plan and will be doing it in the same way. But was thinking about one sceanario labels got changed and on call to NodeLabelsProvider.getLabels() it returns the new labels but the heartbeat failed due to some reason. in that case NodeLabelsProvider will not be able to detect this and on next request to getLabels() it will return null. So we should have some mechanism such that NodeLabelsProvider are informed whether RM accepted the change in labels so that appropriate SET of labels are provided on call to getLabels (also if needed we can have RM Rejected Labels too for logging purpose) Planning to have 3 interfaces in NodeLabelsProvider * getNodeLabels() : to get the labels which can be used for registration * getNodeLabelsOnModify() : to get the labels on modification which can be used for heartbeat * rmUpdateNodeLabelsStatus(boolean success) : to indicate that next call to getNodeLabelsOnModify can be reset to null bq. 4.1 Why this class extends from CompositeService? Did you want to add more component to it? If not, AbstractService should be enough. If the purpose of the NodeLabelsFetcherService is only create a NodeLabelsProvider, and the NodeLabelsProvider will take care of periodically read configuration from yarn-site.xml.I suggest to rename NodeLabelsFetcherService to NodeLabelsProviderFactory, and not extends from any Service, because the NodeLabelsProvider should be a Service. Rename NodeLabelsProvider to NodeLabelsProviderService if your purpose is as what I mentioned. -- Your idea seems to be better, will try to do it in the way you have specified and hence NodeLabelsFetcherService will become factory or i will make it absolute. ConfigurationNodeLabelsProvider : will make it dynamic. i,e. periodically it will read the yarn-site and get the Labels. {quote} 6) More implementation suggestions: Since we need central node labels configuration, I suggest to leverage what we already have in RM admin CLI directly – user can use RM admin CLI add/remove node labels. We can disable this when we're ready to do non-central node label configuration.And there should be an option to tell if distributed node label configuration is used. If it's distributed, AdminService should disable admin change labels on nodes via RM admin CLI. I suggest to do this in a separated JIRA. {quote} -- I presume central node labels configuration as Cluster Valid Node Labels stored at RM side for validation of labels if so ok will do it in the same way as that of RM Admin CLI and for ??If it's distributed, AdminService should disable admin change labels on nodes via RM admin CLI?? will add a jira, but was wondering how to do this ? by configuration with new parameter? I was earlier under the impression as MemoryRMNodeLabelsManager = is for distributed Configuration and RMNodeLabelsManager is for Centrallized configuration. and some factory will take care of this Other comments will handle Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181386#comment-14181386 ] Hudson commented on YARN-2700: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1910 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1910/]) YARN-2700 TestSecureRMRegistryOperations failing on windows: auth problems (stevel: rev 90e5ca24fbd3bb2da2a3879cc9b73f0b1d7f3e03) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/AbstractSecureRegistryTest.java TestSecureRMRegistryOperations failing on windows: auth problems Key: YARN-2700 URL: https://issues.apache.org/jira/browse/YARN-2700 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows Server, Win7 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.6.0 Attachments: YARN-2700-001.patch TestSecureRMRegistryOperations failing on windows: unable to create the root /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2692) ktutil test hanging on some machines/ktutil versions
[ https://issues.apache.org/jira/browse/YARN-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181381#comment-14181381 ] Hudson commented on YARN-2692: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1910 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1910/]) YARN-2692 ktutil test hanging on some machines/ktutil versions (stevel) (stevel: rev 85a88649c3f3fb7280aa511b2035104bcef28a6f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/RegistryTestHelper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/TestSecureLogins.java * hadoop-yarn-project/CHANGES.txt ktutil test hanging on some machines/ktutil versions Key: YARN-2692 URL: https://issues.apache.org/jira/browse/YARN-2692 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.6.0 Attachments: YARN-2692-001.patch a couple of the registry security tests run native {{ktutil}}; this is primarily to debug the keytab generation. [~cnauroth] reports that some versions of {{kinit}} hang. Fix: rm the tests. [YARN-2689] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181378#comment-14181378 ] Hudson commented on YARN-2198: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1910 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1910/]) YARN-2198. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor. Contributed by Remus Rusanu (jianhe: rev 3b12fd6cfbf4cc91ef8e8616c7aafa9de006cde5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/winutils/task.c * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java * hadoop-common-project/hadoop-common/src/main/winutils/winutils.sln * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * hadoop-common-project/hadoop-common/src/main/winutils/main.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/winutils/winutils.mc * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java * hadoop-common-project/hadoop-common/src/main/native/native.vcxproj * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java * hadoop-common-project/hadoop-common/src/main/winutils/service.c * hadoop-common-project/hadoop-common/src/main/winutils/chown.c * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.h * hadoop-common-project/hadoop-common/src/main/winutils/winutils.vcxproj * hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * hadoop-common-project/hadoop-common/src/main/winutils/hadoopwinutilsvc.idl * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c * hadoop-common-project/hadoop-common/src/main/winutils/client.c * hadoop-common-project/hadoop-common/src/main/winutils/config.cpp * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.vcxproj * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/util/ProcessTree.java * .gitignore * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * hadoop-common-project/hadoop-common/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee:
[jira] [Commented] (YARN-2732) Fix syntax error in SecureContainer.apt.vm
[ https://issues.apache.org/jira/browse/YARN-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181383#comment-14181383 ] Hudson commented on YARN-2732: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1910 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1910/]) YARN-2732. Fixed syntax error in SecureContainer.apt.vm. Contributed by Jian He. (zjshen: rev b94b8b30f282563ee2ecdd25761b2345aaf06c9b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/CHANGES.txt Fix syntax error in SecureContainer.apt.vm -- Key: YARN-2732 URL: https://issues.apache.org/jira/browse/YARN-2732 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2732.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181415#comment-14181415 ] Steve Loughran commented on YARN-2678: -- this is what a record now looks like {code} { type : JSONServiceRecord, description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/IPC, addresses : [ { port : 48551, host : nn.example.com } ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ { uri : http://nn.example.com:40743; } ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ { uri : http://nn.example.com:40743/ws/v1/slider/mgmt; } ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ { uri : http://nn.example.com:40743/ws/v1/slider/publisher; } ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ { uri : http://nn.example.com:40743/ws/v1/slider/registry; } ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ { uri : http://nn.example.com:40743/ws/v1/slider/publisher/slider; } ] }, { api : org.apache.slider.publisher.exports, addressType : uri, protocolType : REST, addresses : [ { uri : http://nn.example.com:40743/ws/v1/slider/publisher/exports; } ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://nn.example.com:52705/ws/v1/slider/agents; } ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ { uri : https://nn.example.com:33425/ws/v1/slider/agents; } ] } ], yarn:persistence : application, yarn:id : application_1414052463672_0028 } {code} Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on
[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems
[ https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181448#comment-14181448 ] Hudson commented on YARN-2700: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1935 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1935/]) YARN-2700 TestSecureRMRegistryOperations failing on windows: auth problems (stevel: rev 90e5ca24fbd3bb2da2a3879cc9b73f0b1d7f3e03) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/AbstractSecureRegistryTest.java * hadoop-yarn-project/CHANGES.txt TestSecureRMRegistryOperations failing on windows: auth problems Key: YARN-2700 URL: https://issues.apache.org/jira/browse/YARN-2700 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Affects Versions: 2.6.0 Environment: Windows Server, Win7 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.6.0 Attachments: YARN-2700-001.patch TestSecureRMRegistryOperations failing on windows: unable to create the root /registry path with permissions problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2732) Fix syntax error in SecureContainer.apt.vm
[ https://issues.apache.org/jira/browse/YARN-2732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181445#comment-14181445 ] Hudson commented on YARN-2732: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1935 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1935/]) YARN-2732. Fixed syntax error in SecureContainer.apt.vm. Contributed by Jian He. (zjshen: rev b94b8b30f282563ee2ecdd25761b2345aaf06c9b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/CHANGES.txt Fix syntax error in SecureContainer.apt.vm -- Key: YARN-2732 URL: https://issues.apache.org/jira/browse/YARN-2732 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2732.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2692) ktutil test hanging on some machines/ktutil versions
[ https://issues.apache.org/jira/browse/YARN-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181443#comment-14181443 ] Hudson commented on YARN-2692: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1935 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1935/]) YARN-2692 ktutil test hanging on some machines/ktutil versions (stevel) (stevel: rev 85a88649c3f3fb7280aa511b2035104bcef28a6f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/RegistryTestHelper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/TestSecureLogins.java * hadoop-yarn-project/CHANGES.txt ktutil test hanging on some machines/ktutil versions Key: YARN-2692 URL: https://issues.apache.org/jira/browse/YARN-2692 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.6.0 Attachments: YARN-2692-001.patch a couple of the registry security tests run native {{ktutil}}; this is primarily to debug the keytab generation. [~cnauroth] reports that some versions of {{kinit}} hang. Fix: rm the tests. [YARN-2689] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181440#comment-14181440 ] Hudson commented on YARN-2198: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1935 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1935/]) YARN-2198. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor. Contributed by Remus Rusanu (jianhe: rev 3b12fd6cfbf4cc91ef8e8616c7aafa9de006cde5) * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/Shell.java * hadoop-common-project/hadoop-common/src/main/winutils/main.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/winutils/hadoopwinutilsvc.idl * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.h * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/yarn/server/nodemanager/windows_secure_container_executor.c * hadoop-common-project/hadoop-common/src/main/native/native.vcxproj * hadoop-common-project/hadoop-common/src/main/winutils/winutils.sln * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/util/ProcessTree.java * hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/io/nativeio/NativeIO.c * hadoop-common-project/hadoop-common/src/main/winutils/winutils.vcxproj * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java * hadoop-common-project/hadoop-common/src/main/winutils/task.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/winutils/client.c * hadoop-common-project/hadoop-common/src/main/winutils/chown.c * hadoop-yarn-project/CHANGES.txt * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.vcxproj * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-common-project/hadoop-common/pom.xml * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h * hadoop-common-project/hadoop-common/src/main/winutils/winutils.mc * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java * hadoop-common-project/hadoop-common/src/main/winutils/config.cpp * .gitignore * hadoop-common-project/hadoop-common/src/main/winutils/service.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181474#comment-14181474 ] Sunil G commented on YARN-2647: --- Thank you Wangda. Sure. I will use the QueueInfo itself. bq. yarn queue -list short-queue-name or full-queue-name Here, as you have mentioned the sub option will be passed only with a queue name. I do not expect a complete list command only tp print queue acls/node lables from all queues. Hope this is what you also expected. Patch is coming in shape, and I will upload in a shortwhile from now. Add yarn queue CLI to get queue info including labels of such queue --- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181502#comment-14181502 ] Sangjin Lee commented on YARN-2183: --- Thanks for the review [~kasha]! {quote} I understand we need a check to prevent the race. I wonder if we can just re-use the existing check in CleanerTask#run instead of an explicit check in CleanerService#runCleanerTask? From what I remember, that would make the code in CleanerTask#run cleaner as well. (no pun) {quote} The main motivation for this somewhat elaborate double check was for a situation when an on-demand cleaner run comes in and a scheduled cleaner run starts. Without this check, we would have two cleaner runs back to back which is somewhat wasteful. Having said that, I think it is debatable how important it is to avoid that situation and whether it is an optimization worth doing. One could argue that this is bit too fine optimization. Thoughts? {quote} I poked around a little more, and here is what I think. SharedCacheManager creates an instance of AppChecker, rest of the SCM pieces (Store, CleanerService) should just use the same instance. This instance can be passed either in the constructor or through an SCMContext similar to RMContext. Or, we could add SCM#getAppChecker. In its current form, CleanerTask#cleanResourceReferences fetches the references from the store, checks if the apps are running, and asks the store to remove the references. Moving the whole method to the store would simplify the code more. {quote} Yes, I agree that moving cleanResourceReferences() to the store would simplify code here. There is one caveat however. Currently CleanerTask.cleanResourceReferences() is generic: i.e. it does not depend on the type of the store. But if we move this to the store, then I think it would need to be abstract at the level of SCMStore and each store implementation would need to implement its own. The main reason is that the concurrency/safety semantics would be different from store impl to store impl. In the case of the in-memory store, it would use the synchronization on the interned key. But in case of other stores, that does not apply and they need to do their own implementation mostly because how they handle concurrency will be different. So it would mean largely copying and pasting of the same logic with a small difference of how they handle concurrency. That does seem to be a downside of this approach. What do you think? Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
[ https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181560#comment-14181560 ] Stephen Chu commented on YARN-2722: --- Hi [~ywskycn], thanks for making this change. Java 6 doesn't support TLSv1.2. Robert noted this in HADOOP-11217 as well. Should we be adding TLSv1.2 in this patch? Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle - Key: YARN-2722 URL: https://issues.apache.org/jira/browse/YARN-2722 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2722-1.patch We should disable SSLv3 in HttpFS to protect against the POODLEbleed vulnerability. See [CVE-2014-3566 |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] We have {{context = SSLContext.getInstance(TLS);}} in SSLFactory, but when I checked, I could still connect with SSLv3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181582#comment-14181582 ] Sumit Kumar commented on YARN-2505: --- bq. I would think that should be a post with a (list of) label(s) and a list of node ids. I think it would be enough to provide support for applying a label on a list of node ids. From use case perspective, such a labeling should mean categorizing certain nodes in a group. May be i do not see much usecase for putting multiple nodes in multiple groups at the same time. If at all such a complicated case arises, users should make multiple calls with single label and a list of nodes each time. bq. I don't think there's a compelling purpose at the moment for a node label type, it's a string/textual label and I think it is sensible to just model it as such. I agree with you. Given that we already have support for _applicationTags_, there is no immediate need for _type_ for a label. Though at some point of time we should merge _applicationTags_ and _label_ features into one. What do you think? Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2734) If a sub-folder is encountered by log aggregator it results in invalid aggregated file
Sumit Mohanty created YARN-2734: --- Summary: If a sub-folder is encountered by log aggregator it results in invalid aggregated file Key: YARN-2734 URL: https://issues.apache.org/jira/browse/YARN-2734 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Fix For: 2.6.0 See YARN-2724 for some more context on how the error surfaces during yarn logs call. If aggregator sees a sub-folder today it results in the following error when reading the logs: {noformat} Container: container_1413512973198_0019_01_02 on c6401.ambari.apache.org_45454 LogType: cmd_data LogLength: 4096 Log Contents: Error aggregating log file. Log file : /hadoop/yarn/log/application_1413512973198_0019/container_1413512973198_0019_01_02/cmd_data/hadoop/yarn/log/application_1413512973198_0019/container_1413512973198_0019_01_02/cmd_data (Is a directory) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
[ https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181588#comment-14181588 ] Wei Yan commented on YARN-2722: --- Thanks, [~schu]. You're right, we shouldn't add TLSv1.2. And according to this jdk document: https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https. JDK6 actually only supports TLSv1. I verified in a cluster that TLSv1.1 should also be removed when using jdk 6. Will confirm with Robert later. Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle - Key: YARN-2722 URL: https://issues.apache.org/jira/browse/YARN-2722 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2722-1.patch We should disable SSLv3 in HttpFS to protect against the POODLEbleed vulnerability. See [CVE-2014-3566 |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] We have {{context = SSLContext.getInstance(TLS);}} in SSLFactory, but when I checked, I could still connect with SSLv3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2734) If a sub-folder is encountered by log aggregator it results in invalid aggregated file
[ https://issues.apache.org/jira/browse/YARN-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-2734: --- Assignee: Xuan Gong If a sub-folder is encountered by log aggregator it results in invalid aggregated file -- Key: YARN-2734 URL: https://issues.apache.org/jira/browse/YARN-2734 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Fix For: 2.6.0 See YARN-2724 for some more context on how the error surfaces during yarn logs call. If aggregator sees a sub-folder today it results in the following error when reading the logs: {noformat} Container: container_1413512973198_0019_01_02 on c6401.ambari.apache.org_45454 LogType: cmd_data LogLength: 4096 Log Contents: Error aggregating log file. Log file : /hadoop/yarn/log/application_1413512973198_0019/container_1413512973198_0019_01_02/cmd_data/hadoop/yarn/log/application_1413512973198_0019/container_1413512973198_0019_01_02/cmd_data (Is a directory) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2724: Attachment: YARN-2724.4.patch If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch, YARN-2724.4.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181616#comment-14181616 ] Jian He commented on YARN-2701: --- lgtm too, thanks Binglin and Zhihai for reviewing the patch Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch, YARN-2701.addendum.1.patch, YARN-2701.addendum.2.patch, YARN-2701.addendum.3.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181646#comment-14181646 ] Wangda Tan commented on YARN-2495: -- 1) bq. But was thinking about one sceanario labels got changed and on call to NodeLabelsProvider.getLabels() it returns the new labels but the heartbeat failed due to some reason. If heartbeat failed, the resource tracker in NM side cannot get NodeHeartbeatResponse. But I'm thinking another case is, labels reported by NMs can be invalid and rejected by RM. NM should be notified about such cases. So I would suggest do this way: - Keep getNodeLabels in NodeHeartbeatRequest and RegisterNodeManagerRequest. - Add a reject node labels list in NodeHeartbeatRequest -- we may not have to handle this list for now. But we can keep it on the interface - Add a lastNodeLabels in NodeStatusUpdater, it will save last node labels list get from NodeLabelFetcher. And in the while loop of {{startStatusUpdater}}, we will check if the new list fetched from NodeLabelFetcher is different from our last node labels list. If different, we will set it, if same, we will skip and set the labels to be null in next heartbeat. And the interface of NodeLabelsProvider should be simple, just a getNodeLabels(), NodeStatusUpdater will take care other stuffs. 2) bq. and for If it's distributed, AdminService should disable admin change labels on nodes via RM admin CLI will add a jira, but was wondering how to do this ? by configuration with new parameter? Yes, we should add a new parameter for it, we may not need have this immediately, but we should have one in the future. bq. I was earlier under the impression as MemoryRMNodeLabelsManager = is for distributed Configuration and RMNodeLabelsManager is for Centrallized configuration. and some factory will take care of this Not really, the different between them is one will persist labels to filesystem and one not. We still have to do something for the distributed configuration. Any thoughts? [~vinodkv] Thanks, Wangda Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181661#comment-14181661 ] Wangda Tan commented on YARN-2647: -- Hi [~sunilg], Maybe my previous comment make you confused, what in my mind is, {code} yarn queue -list (or -liststatus): OUTPUT: root: ACL: Labels: LINUX, LARGE_MEM Status: RUNNING Capacity: 80% ... root.queue-a: ACL: Labels: LINUX, ... ... {code} {code} yarn queue -list root.queueA OUTPUT: root.queue-a: ACL: Labels: LINUX, ... Capacity: 80% ... {code} {code} yarn queue -list root.queueA -show-node-label OUTPUT: root.queue-a: ACL: Labels: LINUX, ... END {code} Does this make sense to you? Or do you have any other suggestions? Thanks, Wangda Add yarn queue CLI to get queue info including labels of such queue --- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2734) If a sub-folder is encountered by log aggregator it results in invalid aggregated file
[ https://issues.apache.org/jira/browse/YARN-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181664#comment-14181664 ] Xuan Gong commented on YARN-2734: - Currently, if the current path is a sub-folder, we will throw an IOException. Instead of exception, we should check explicitly to skip sub-dirs. If a sub-folder is encountered by log aggregator it results in invalid aggregated file -- Key: YARN-2734 URL: https://issues.apache.org/jira/browse/YARN-2734 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Fix For: 2.6.0 See YARN-2724 for some more context on how the error surfaces during yarn logs call. If aggregator sees a sub-folder today it results in the following error when reading the logs: {noformat} Container: container_1413512973198_0019_01_02 on c6401.ambari.apache.org_45454 LogType: cmd_data LogLength: 4096 Log Contents: Error aggregating log file. Log file : /hadoop/yarn/log/application_1413512973198_0019/container_1413512973198_0019_01_02/cmd_data/hadoop/yarn/log/application_1413512973198_0019/container_1413512973198_0019_01_02/cmd_data (Is a directory) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181665#comment-14181665 ] Zhijie Shen commented on YARN-2724: --- +1 for the latest patch. Will commit it later today to give [~mitdesai] and [~vinodkv] a chance to look at it. If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch, YARN-2724.4.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reassigned YARN-2726: Assignee: Wangda Tan (was: Naganarasimha G R) CapacityScheduler should explicitly log when an accessible label has no capacity Key: YARN-2726 URL: https://issues.apache.org/jira/browse/YARN-2726 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Phil D'Amore Assignee: Wangda Tan Priority: Minor Given: - Node label defined: test-label - Two queues defined: a, b - label accessibility and and capacity defined as follows (properties abbreviated for readability): root.a.accessible-node-labels = test-label root.a.accessible-node-labels.test-label.capacity = 100 If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack trace with the following error buried within: Illegal capacity of -1.0 for label=test-label in queue=root.b This of course occurs because test-label is accessible to b due to inheritance from the root, and -1 is the UNDEFINED value. To my mind this might not be obvious to the admin, and the error message which results does not help guide someone to the source of the issue. I propose that this situation be updated so that when the capacity on an accessible label is undefined, it is explicitly called out instead of falling through to the illegal capacity check. Something like: {code} if (capacity == UNDEFINED) { throw new IllegalArgumentException(Configuration issue: + label= + label + is accessible from queue= + queue + but has no capacity set.); } {code} I'll leave it to better judgement than mine as to whether I'm throwing the appropriate exception there. I think this check should be added to both getNodeLabelCapacities and getMaximumNodeLabelCapacities in CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181673#comment-14181673 ] Wangda Tan commented on YARN-2726: -- Taking this over.. CapacityScheduler should explicitly log when an accessible label has no capacity Key: YARN-2726 URL: https://issues.apache.org/jira/browse/YARN-2726 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Phil D'Amore Assignee: Wangda Tan Priority: Minor Given: - Node label defined: test-label - Two queues defined: a, b - label accessibility and and capacity defined as follows (properties abbreviated for readability): root.a.accessible-node-labels = test-label root.a.accessible-node-labels.test-label.capacity = 100 If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack trace with the following error buried within: Illegal capacity of -1.0 for label=test-label in queue=root.b This of course occurs because test-label is accessible to b due to inheritance from the root, and -1 is the UNDEFINED value. To my mind this might not be obvious to the admin, and the error message which results does not help guide someone to the source of the issue. I propose that this situation be updated so that when the capacity on an accessible label is undefined, it is explicitly called out instead of falling through to the illegal capacity check. Something like: {code} if (capacity == UNDEFINED) { throw new IllegalArgumentException(Configuration issue: + label= + label + is accessible from queue= + queue + but has no capacity set.); } {code} I'll leave it to better judgement than mine as to whether I'm throwing the appropriate exception there. I think this check should be added to both getNodeLabelCapacities and getMaximumNodeLabelCapacities in CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
[ https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181677#comment-14181677 ] Wei Yan commented on YARN-2722: --- Hi, [~schu]. Discussed with Robert offline, and we also need to remove TLSv1.1. only support TLSv1. Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle - Key: YARN-2722 URL: https://issues.apache.org/jira/browse/YARN-2722 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2722-1.patch We should disable SSLv3 in HttpFS to protect against the POODLEbleed vulnerability. See [CVE-2014-3566 |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] We have {{context = SSLContext.getInstance(TLS);}} in SSLFactory, but when I checked, I could still connect with SSLv3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181684#comment-14181684 ] Karthik Kambatla commented on YARN-2183: bq. Without this check, we would have two cleaner runs back to back which is somewhat wasteful. I don't entirely remember my train of thought, but I can take a look again and see if we can implement it in a simpler way and get the same guarantee. May be, after the next patch. bq. Currently CleanerTask.cleanResourceReferences() is generic: i.e. it does not depend on the type of the store. Can we keep this method as is, but mark it protected. The store implementations can choose to use it. Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181688#comment-14181688 ] Mit Desai commented on YARN-2724: - I'll take a look shortly. If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch, YARN-2724.4.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181690#comment-14181690 ] Hadoop QA commented on YARN-2724: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676633/YARN-2724.4.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5518//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5518//console This message is automatically generated. If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch, YARN-2724.4.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181701#comment-14181701 ] Sunil G commented on YARN-2647: --- Thank you [~gp.leftnoteasy] Yes. This is more or like what I also have in mind. I have a point here. {code} yarn queue -list -show-node-label {code} I do not feel above config is needed to show node lables only for all queues. Here I will anyway show complete queue details of all queues. Also as you have displayed, row based display is better as we have variable number of config items for Node Labels. My initial approach was column based, which will cause breakage of line often. Current display which you have shown makes more sense, and I will be using the same and will be making changes in my patch now. Add yarn queue CLI to get queue info including labels of such queue --- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2473) YARN never cleans up container directories from a full disk
[ https://issues.apache.org/jira/browse/YARN-2473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-2473. -- Resolution: Duplicate Closing as a duplicate of YARN-90. YARN never cleans up container directories from a full disk --- Key: YARN-2473 URL: https://issues.apache.org/jira/browse/YARN-2473 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Varun Vasudev Priority: Blocker After YARN-1781 when a container ends up filling a local disk the nodemanager will mark it as a bad disk and remove it from the list of good local dirs. When the container eventually completes the files that filled the disk will not be removed because the NM thinks the directory is bad. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181709#comment-14181709 ] Gour Saha commented on YARN-2678: - Steve it looks good. On the addresses front, do you plan to expose host and port attributes in addition to uri (show below)? Clients can avoid parsing. {noformat} ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://nn.example.com:52705/ws/v1/slider/agents;, host : nn.example.com, port : 52705 } ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ { uri : https://nn.example.com:33425/ws/v1/slider/agents;, host : nn.example.com, port : 33425 } ] } ], ... {noformat} Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181717#comment-14181717 ] Wangda Tan commented on YARN-2647: -- bq. I do not feel above config is needed to show node lables only for all queues. Make sense, the basic functionality should let user get queue statuses, user doesn't need get only node-label/ACL So the command line should be yarn queue -list queue-name or queue-path, if user doesn't specify queue name, all queues' statuses will be printed. Add yarn queue CLI to get queue info including labels of such queue --- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181723#comment-14181723 ] Sangjin Lee commented on YARN-2183: --- You mean, moving the method to SCMStore but mark it protected? If so, for CleanerTask to be able to call it, it cannot be protected, right? One thing we can do is to move it to SCMStore as a public method, but let implementations override/augment it. Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2726: - Attachment: YARN-2726-20141023-1.patch CapacityScheduler should explicitly log when an accessible label has no capacity Key: YARN-2726 URL: https://issues.apache.org/jira/browse/YARN-2726 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Phil D'Amore Assignee: Wangda Tan Priority: Minor Attachments: YARN-2726-20141023-1.patch Given: - Node label defined: test-label - Two queues defined: a, b - label accessibility and and capacity defined as follows (properties abbreviated for readability): root.a.accessible-node-labels = test-label root.a.accessible-node-labels.test-label.capacity = 100 If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack trace with the following error buried within: Illegal capacity of -1.0 for label=test-label in queue=root.b This of course occurs because test-label is accessible to b due to inheritance from the root, and -1 is the UNDEFINED value. To my mind this might not be obvious to the admin, and the error message which results does not help guide someone to the source of the issue. I propose that this situation be updated so that when the capacity on an accessible label is undefined, it is explicitly called out instead of falling through to the illegal capacity check. Something like: {code} if (capacity == UNDEFINED) { throw new IllegalArgumentException(Configuration issue: + label= + label + is accessible from queue= + queue + but has no capacity set.); } {code} I'll leave it to better judgement than mine as to whether I'm throwing the appropriate exception there. I think this check should be added to both getNodeLabelCapacities and getMaximumNodeLabelCapacities in CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2694) Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2694: - Summary: Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY (was: Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY) Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY --- Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest with multiple node labels will make user limit computation becomes tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
zhihai xu created YARN-2735: --- Summary: diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2694) Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2694: - Description: Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest/host with multiple node labels will make user limit, etc. computation becomes more tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService - RMAdminCLI - CommonNodeLabelsManager was: Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest with multiple node labels will make user limit computation becomes tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY --- Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest/host with multiple node labels will make user limit, etc. computation becomes more tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService - RMAdminCLI - CommonNodeLabelsManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2735: Attachment: YARN-2735.000.patch diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection --- Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2735.000.patch diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181764#comment-14181764 ] zhihai xu commented on YARN-2735: - I attached a patch to remove the unnecessary initialization for diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection --- Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2735.000.patch diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181772#comment-14181772 ] Karthik Kambatla commented on YARN-2183: Yes. Sorry for the confusion. Just looked at the code again. My suggestion is to move cleanResourceReferences to SCMStore and mark it @Private public final. Does that make sense? Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181771#comment-14181771 ] Naganarasimha G R commented on YARN-2495: - hi [~wangda] Actually what i meant was update the HeartBeatResponse abt the labels acceptance by RM and once NodeStatusUpdater gets response(+ve or -ve) from RM then it can set LabelsProvider with approp flag. But your logic seems to be much better because i was handling thread sync unnecessarly in ConfNodeLabelsProvider. Having this logic in Node status updater removes the burden of each type of NodeLabelsProvider to have this sync logic and interface will be simple in NodeLabelsProvider (earlier my thinking was labels should not be handled by NodeStatusUpdater hence kept in nodeLabelsprovider) Actually was about the upload the patch with my logic, as its not as per your latest comments i will upload one more by tomorrow afternoon(IST) after correction as per your comments bq. Add a reject node labels list in NodeHeartbeatRequest – we may not have to handle this list for now. But we can keep it on the interface you meant NodeHeartBeatResponse right ? Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2736) Job.getHistoryUrl returns empty string
Kannan Rajah created YARN-2736: -- Summary: Job.getHistoryUrl returns empty string Key: YARN-2736 URL: https://issues.apache.org/jira/browse/YARN-2736 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.5.1 Reporter: Kannan Rajah Priority: Critical getHistoryUrl() method in Job class is returning empty string. Example code: job = Job.getInstance(conf); job.setJobName(MapReduceApp); job.setJarByClass(MapReduceApp.class); job.setMapperClass(Mapper1.class); job.setCombinerClass(Reducer1.class); job.setReducerClass(Reducer1.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.setOutputFormatClass(TextOutputFormat.class); job.setInputFormatClass(TextInputFormat.class); FileInputFormat.addInputPath(job, inputPath); FileOutputFormat.setOutputPath(job, outputPath); job.waitForCompletion(true); job.getHistoryUrl(); It is always returning empty string. Looks like getHistoryUrl() support was removed in YARN-321. getTrackingURL() returns correct url though. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181791#comment-14181791 ] Zhijie Shen commented on YARN-2703: --- [~xgong], thanks for the patch. Some comments about it. 1. It's better to write timestamp directly. When reading it, it's flexible to convert it into the desired format we what. {code} // Write the uploaded TimeStamp out.writeUTF(Times.format(uploadedTime)); {code} 2. Is it necessary to sort the files? The goal here is add the timestamp to the same log file name uploaded in different iterations. The following is make the order of the uploaded files in the same iteration changed? Previously, it‘s alphabetical, while now it's chronological. For example, stderr1 - stdout1 - stderr2 - stdout2 will be changed to stderr1 - stdout1 - stdout2 - stderr2, which may not be a better order. {code} // sort the file by lastModfiedTime. ListFile candidatesList = new ArrayListFile(candidates); Collections.sort(candidatesList, new ComparatorFile() { public int compare(File s1, File s2) { return s1.lastModified() s2.lastModified() ? -1 : s1.lastModified() s2.lastModified() ? 1 : 0; } }); return candidatesList; {code} 3. No need to ask caller to pass in the uploaded time. We can directly execute {{out.writeLong(System.currentTimeMillis());}} {code} public void write(DataOutputStream out, SetFile pendingUploadFiles, long uploadedTime) throws IOException { {code} 4. Can you correct the log message bellow in TestLogAggregationService, and add logTime as well? {code} LOG.info(LogType: + fileType); LOG.info(LogType: + fileLength); {code} Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch, YARN-2703.2.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2505: -- Attachment: YARN-2505.7.patch Add forgotten generic type definitions, should fix javac warnings... Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2724: Attachment: YARN-2724.5.patch Same patch If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch, YARN-2724.4.patch, YARN-2724.5.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181798#comment-14181798 ] Craig Welch commented on YARN-2505: --- -re I think it would be enough to provide support for applying a label on a list of node ids Fair enough - I was thinking of the suggested api as a superset of this, but maybe this is all we really need. I like the idea, not sure I can get to it just now - I'll see, if not, perhaps we can do a followon jira for it - let's see -re Though at some point of time we should merge applicationTags and label features into one. What do you think? I'm not sure actually, there are clearly some similarities, but I think they are distinct things... Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181809#comment-14181809 ] Jason Lowe commented on YARN-2314: -- bq. IIUC, mayBeCloseProxy can be invoked by MR/NMClient, but proxy.scheduledForClose is always false. So it won’t call the following stopProxy. proxy.scheduledForClose is not always false, as it can be set to true by removeProxy. removeProxy is called by the cache when an entry needs to be evicted from the cache. If the cache never fills then we never will call removeProxy by the very design of the cache. This patch doesn't change the behavior in that sense. I suppose we could change the patch so that it only caches the proxy objects but not their underlying connections. However I have my doubts that's where the real expense is in creating the proxy -- it's much more likely to be establishing the RPC connection to the NM. bq. once ContainerManagementProtocolProxy#tryCloseProxy is called, internally it’ll call rpc.stopProxy, will it eventually call ClientCache#stopClient ClientCache#stopClient will not necessarily shut down the connection. It will only shutdown the connection if there are no references to the protocol by any other objects, but the very nature of the ContainerManagementProtocolProxy cache is to keep around references. Therefore stopClient will never actually do anything in practice as long as we are caching proxy objects. That's why I mentioned earlier that the RPC layer itself needs to change to add the ability to shutdown connections or change the way the ClientCache behaves to really fix this if we want to continue to cache proxy objects at a higher layer. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, YARN-2314v2.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, tez-yarn-2314.xlsx ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2735) diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection
[ https://issues.apache.org/jira/browse/YARN-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181814#comment-14181814 ] Hadoop QA commented on YARN-2735: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676677/YARN-2735.000.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5520//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5520//console This message is automatically generated. diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection --- Key: YARN-2735 URL: https://issues.apache.org/jira/browse/YARN-2735 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Attachments: YARN-2735.000.patch diskUtilizationPercentageCutoff and diskUtilizationSpaceCutoff are initialized twice in DirectoryCollection -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181838#comment-14181838 ] Hadoop QA commented on YARN-2724: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676687/YARN-2724.5.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5522//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5522//console This message is automatically generated. If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch, YARN-2724.4.patch, YARN-2724.5.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181859#comment-14181859 ] Wangda Tan commented on YARN-2495: -- Hi Naga, bq. you meant NodeHeartBeatResponse right ? Yes Looking forward your patch. Wangda Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181869#comment-14181869 ] Jian He commented on YARN-2314: --- Jason, thanks for your explanation. bq. If the cache never fills then we never will call removeProxy by the very design of the cache. I was thinking the client could have a way to explicitly stopProxy and remove the entry from the cache, rather than remove the entry only if it hits the cache limit. But looks like this is by design. And yes, this is the existing behavior. ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, YARN-2314v2.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, tez-yarn-2314.xlsx ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181908#comment-14181908 ] Hadoop QA commented on YARN-2505: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676686/YARN-2505.7.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5521//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5521//console This message is automatically generated. Support get/add/remove/change labels in RM REST API --- Key: YARN-2505 URL: https://issues.apache.org/jira/browse/YARN-2505 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Craig Welch Attachments: YARN-2505.1.patch, YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181929#comment-14181929 ] Jian He commented on YARN-2209: --- bq. No need to change to split a single statement into the following two. This is required, because the following finally block needs this temporary variable. bq. Why does it not need to take the remaining operations after code change? Because the allocate throwing exception, the response object is empty bq. Is the change in ResourceCalculator.java related? It's causing excessive loggings in production cluster, I intentionally removed them. Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch, YARN-2209.6.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2694) Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2694: - Attachment: YARN-2694-20141023-1.patch Updated patch Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY --- Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch, YARN-2694-20141023-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest/host with multiple node labels will make user limit, etc. computation becomes more tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService - RMAdminCLI - CommonNodeLabelsManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2209: -- Attachment: YARN-2209.7.patch Thanks zhijie for the review! addressed other comments Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch, YARN-2209.6.patch, YARN-2209.7.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2737) Misleading msg in LogCLI when app is not successfully submitted
Jian He created YARN-2737: - Summary: Misleading msg in LogCLI when app is not successfully submitted Key: YARN-2737 URL: https://issues.apache.org/jira/browse/YARN-2737 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He {{LogCLiHelpers#logDirNotExist}} prints msg {{Log aggregation has not completed or is not enabled.}} if the app log file doesn't exist. This is misleading because if the application is not submitted successfully. Clearly, we won't have logs for this application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2703: Attachment: YARN-2703.3.patch Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch, YARN-2703.2.patch, YARN-2703.3.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181998#comment-14181998 ] Xuan Gong commented on YARN-2703: - Addressed all comments Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch, YARN-2703.2.patch, YARN-2703.3.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2694) Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182004#comment-14182004 ] Hadoop QA commented on YARN-2694: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676721/YARN-2694-20141023-1.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5523//console This message is automatically generated. Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY --- Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch, YARN-2694-20141023-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest/host with multiple node labels will make user limit, etc. computation becomes more tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService - RMAdminCLI - CommonNodeLabelsManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2678: - Attachment: yarnregistry.pdf Updated TLA specification; # covers new structure # declares that {{serialize()}} and {{deserialize()}} functions exist to go from {{ServiceRecord}} instances to record data (strings), as well as a {{containsValidServiceRecord()}} predicate to check whether or not a string contains a service record. This lets us define the record{{-}}data marshalling behaviour without covering the implementation details Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran Attachments: yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2694) Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2694: - Attachment: YARN-2694-20141023-2.patch I can compile this locally, I haven't found any error message in the console log of Jenkins result. So I suspect it just Jenkins process crashed. Resubmit same patch. Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY --- Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch, YARN-2694-20141023-1.patch, YARN-2694-20141023-2.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest/host with multiple node labels will make user limit, etc. computation becomes more tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService - RMAdminCLI - CommonNodeLabelsManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182018#comment-14182018 ] Hadoop QA commented on YARN-2726: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676662/YARN-2726-20141023-1.patch against trunk revision d71d40a. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.fs.permission.TestStickyBitTTests org.apache.hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrationTeTests org.apache.hadoop.hdfs.server.datanode.TestDataNodeRollingUpgTestsTests org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolTestsTests org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporTestsTests org.apache.hadoop.hdfs.server.datanode.TestFsDatasetCaTestTests org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksWithNotEnoughRTestsTests org.apache.hadoop.hdfs.server.blockmanagement.TestDatanodeMaTests org.apache.hadoop.hdfs.TestEncryptionZonesWiTests org.apache.hadoop.hdfs.TestDFSClientRetrTestTests org.apache.hadoop.hdfs.TestFileCreaTestsTests org.apache.hadoop.hdfs.TestDatanodeTests org.apache.hadoop.hdfs.TestLeaseReTests org.apache.hadoop.hdfs.TestDatanodeBlockScTests org.apache.hadoop.hdfs.qjournal.client.TestQJMWithTests org.apache.hadoop.hdfs.TestGetTests org.apache.hadoop.tracing.TestTraceAdmin {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5519//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5519//console This message is automatically generated. CapacityScheduler should explicitly log when an accessible label has no capacity Key: YARN-2726 URL: https://issues.apache.org/jira/browse/YARN-2726 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Phil D'Amore Assignee: Wangda Tan Priority: Minor Attachments: YARN-2726-20141023-1.patch Given: - Node label defined: test-label - Two queues defined: a, b - label accessibility and and capacity defined as follows (properties abbreviated for readability): root.a.accessible-node-labels = test-label root.a.accessible-node-labels.test-label.capacity = 100 If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack trace with the following error buried within: Illegal capacity of -1.0 for label=test-label in queue=root.b This of course occurs because test-label is accessible to b due to inheritance from the root, and -1 is the UNDEFINED value. To my mind this might not be obvious to the admin, and the error message which results does not help guide someone to the source of the issue. I propose that this situation be updated so that when the capacity on an accessible label is undefined, it is explicitly called out instead of falling through to the illegal capacity check. Something like: {code} if (capacity == UNDEFINED) { throw new IllegalArgumentException(Configuration issue: + label= + label + is accessible from queue= + queue + but has no capacity set.); } {code} I'll leave it to better judgement than mine as to whether I'm throwing the appropriate exception there. I think this check should be added to both getNodeLabelCapacities and getMaximumNodeLabelCapacities in CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2678: - Attachment: YARN-2678-001.patch Updated patch # marshalling to ZK node without header, but check for type string performed before any attempt to parse the content # maps used to define addresses # updated doc to match (with full example generated off live AM) Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran Attachments: YARN-2678-001.patch, yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182022#comment-14182022 ] Steve Loughran commented on YARN-2678: -- Gour: I don't want to split out hostname and port from a URI. Parsing URLs is ubiquitous, every language has a toolkit to do it. Mandating that they must be separate only creates the possibility of conflicting values between the {{uri}} field and the explicit ones. Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran Attachments: YARN-2678-001.patch, yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2726: - Attachment: YARN-2726-20141023-2.patch Jenkins issue resubmit patch CapacityScheduler should explicitly log when an accessible label has no capacity Key: YARN-2726 URL: https://issues.apache.org/jira/browse/YARN-2726 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Phil D'Amore Assignee: Wangda Tan Priority: Minor Attachments: YARN-2726-20141023-1.patch, YARN-2726-20141023-2.patch Given: - Node label defined: test-label - Two queues defined: a, b - label accessibility and and capacity defined as follows (properties abbreviated for readability): root.a.accessible-node-labels = test-label root.a.accessible-node-labels.test-label.capacity = 100 If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack trace with the following error buried within: Illegal capacity of -1.0 for label=test-label in queue=root.b This of course occurs because test-label is accessible to b due to inheritance from the root, and -1 is the UNDEFINED value. To my mind this might not be obvious to the admin, and the error message which results does not help guide someone to the source of the issue. I propose that this situation be updated so that when the capacity on an accessible label is undefined, it is explicitly called out instead of falling through to the illegal capacity check. Something like: {code} if (capacity == UNDEFINED) { throw new IllegalArgumentException(Configuration issue: + label= + label + is accessible from queue= + queue + but has no capacity set.); } {code} I'll leave it to better judgement than mine as to whether I'm throwing the appropriate exception there. I think this check should be added to both getNodeLabelCapacities and getMaximumNodeLabelCapacities in CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
[ https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2722: -- Attachment: YARN-2722-2.patch Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle - Key: YARN-2722 URL: https://issues.apache.org/jira/browse/YARN-2722 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2722-1.patch, YARN-2722-2.patch We should disable SSLv3 in HttpFS to protect against the POODLEbleed vulnerability. See [CVE-2014-3566 |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] We have {{context = SSLContext.getInstance(TLS);}} in SSLFactory, but when I checked, I could still connect with SSLv3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2678: - Attachment: HADOOP-2678-002.patch Patch merging in YARN-2677 patch which is forgiving of non-DNS entries in the path (like usernames) Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran Attachments: HADOOP-2678-002.patch, YARN-2678-001.patch, yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182074#comment-14182074 ] Gour Saha commented on YARN-2678: - bq. Gour: I don't want to split out hostname and port from a URI. Parsing URLs is ubiquitous, every language has a toolkit to do it. Mandating that they must be separate only creates the possibility of conflicting values between the uri field and the explicit ones. Ok makes sense. Slider agents are doing it today anyway. Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran Attachments: HADOOP-2678-002.patch, YARN-2678-001.patch, yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2722) Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle
[ https://issues.apache.org/jira/browse/YARN-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182094#comment-14182094 ] Hadoop QA commented on YARN-2722: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676744/YARN-2722-2.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5527//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5527//console This message is automatically generated. Disable SSLv3 (POODLEbleed vulnerability) in YARN shuffle - Key: YARN-2722 URL: https://issues.apache.org/jira/browse/YARN-2722 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2722-1.patch, YARN-2722-2.patch We should disable SSLv3 in HttpFS to protect against the POODLEbleed vulnerability. See [CVE-2014-3566 |http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2014-3566] We have {{context = SSLContext.getInstance(TLS);}} in SSLFactory, but when I checked, I could still connect with SSLv3. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2678) Recommended improvements to Yarn Registry
[ https://issues.apache.org/jira/browse/YARN-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182095#comment-14182095 ] Hadoop QA commented on YARN-2678: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676750/HADOOP-2678-002.patch against trunk revision 828429d. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5529//console This message is automatically generated. Recommended improvements to Yarn Registry - Key: YARN-2678 URL: https://issues.apache.org/jira/browse/YARN-2678 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Gour Saha Assignee: Steve Loughran Attachments: HADOOP-2678-002.patch, YARN-2678-001.patch, yarnregistry.pdf In the process of binding to Slider AM from Slider agent python code here are some of the items I stumbled upon and would recommend as improvements. This is how the Slider's registry looks today - {noformat} jsonservicerec{ description : Slider Application Master, external : [ { api : org.apache.slider.appmaster, addressType : host/port, protocolType : hadoop/protobuf, addresses : [ [ c6408.ambari.apache.org, 34837 ] ] }, { api : org.apache.http.UI, addressType : uri, protocolType : webui, addresses : [ [ http://c6408.ambari.apache.org:43314; ] ] }, { api : org.apache.slider.management, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/mgmt; ] ] }, { api : org.apache.slider.publisher, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher; ] ] }, { api : org.apache.slider.registry, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/registry; ] ] }, { api : org.apache.slider.publisher.configurations, addressType : uri, protocolType : REST, addresses : [ [ http://c6408.ambari.apache.org:43314/ws/v1/slider/publisher/slider; ] ] } ], internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:46958/ws/v1/slider/agents; ] ] }, { api : org.apache.slider.agents.oneway, addressType : uri, protocolType : REST, addresses : [ [ https://c6408.ambari.apache.org:57513/ws/v1/slider/agents; ] ] } ], yarn:persistence : application, yarn:id : application_1412974695267_0015 } {noformat} Recommendations: 1. I would suggest to either remove the string {color:red}jsonservicerec{color} or if it is desirable to have a non-null data at all times then loop the string into the json structure as a top-level attribute to ensure that the registry data is always a valid json document. 2. The {color:red}addresses{color} attribute is currently a list of list. I would recommend to convert it to a list of dictionary objects. In the dictionary object it would be nice to have the host and port portions of objects of addressType uri as separate key-value pairs to avoid parsing on the client side. The URI should also be retained as a key say uri to avoid clients trying to generate it by concatenating host, port, resource-path, etc. Here is a proposed structure - {noformat} { ... internal : [ { api : org.apache.slider.agents.secure, addressType : uri, protocolType : REST, addresses : [ { uri : https://c6408.ambari.apache.org:46958/ws/v1/slider/agents;, host : c6408.ambari.apache.org, port: 46958 } ] } ], } {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2056: - Attachment: YARN-2056.201410232244.txt Thanks very much [~leftnoteasy]. I have attached a patch which uses PriorityQueue instead of an internal queue class. Please note that since the algorithm for building up needy queues is different, the rounding is also different, so some of the tests' expected values needed to change. I stepped through several of the tests and they seem to be working as I expect. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt, YARN-2056.201410232244.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2664) Improve RM webapp to expose info about reservations.
[ https://issues.apache.org/jira/browse/YARN-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182112#comment-14182112 ] Matteo Mazzucchelli commented on YARN-2664: --- I notice that the data that are sent to the html page are in a cvs format and that most of them are zero. I think that the best way to handle these data would be to send only the important (non-zero) values in a json. {code:java} [ { key: reservation_1413792787395_0018, values : [{date: Mon Oct 24 10:13:37 CEST 2014, value: 0}, {date: Mon Oct 24 10:14:18 CEST 2014, value: 5}] }, ... ] {code} Therefore, only the timestamp and the value associated to it will be sent. Improve RM webapp to expose info about reservations. Key: YARN-2664 URL: https://issues.apache.org/jira/browse/YARN-2664 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Attachments: PlannerPage_screenshot.pdf, YARN-2664.patch YARN-1051 provides a new functionality in the RM to ask for reservation on resources. Exposing this through the webapp GUI is important. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182124#comment-14182124 ] Hadoop QA commented on YARN-2209: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676728/YARN-2209.7.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5524//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5524//console This message is automatically generated. Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch, YARN-2209.6.patch, YARN-2209.7.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2183) Cleaner service for cache manager
[ https://issues.apache.org/jira/browse/YARN-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-2183: -- Attachment: YARN-2183-trunk-v6.patch v.6 patch posted. To see the patch in context, go to https://github.com/ctrezzo/hadoop/compare/apache:trunk...sharedcache-3-YARN-2183-cleaner To see the changes between v.5 and v.6, go to https://github.com/ctrezzo/hadoop/commit/ffcc098749d16950732d833141db356efe116ed3 Cleaner service for cache manager - Key: YARN-2183 URL: https://issues.apache.org/jira/browse/YARN-2183 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2183-trunk-v1.patch, YARN-2183-trunk-v2.patch, YARN-2183-trunk-v3.patch, YARN-2183-trunk-v4.patch, YARN-2183-trunk-v5.patch, YARN-2183-trunk-v6.patch Implement the cleaner service for the cache manager along with metrics for the service. This service is responsible for cleaning up old resource references in the manager and removing stale entries from the cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2724) If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed
[ https://issues.apache.org/jira/browse/YARN-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182133#comment-14182133 ] Zhijie Shen commented on YARN-2724: --- [~mitdesai], any comments so far? If an unreadable file is encountered during log aggregation then aggregated file in HDFS badly formed - Key: YARN-2724 URL: https://issues.apache.org/jira/browse/YARN-2724 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.1 Reporter: Sumit Mohanty Assignee: Xuan Gong Attachments: YARN-2724.1.patch, YARN-2724.2.patch, YARN-2724.3.patch, YARN-2724.4.patch, YARN-2724.5.patch Look into the log output snippet. It looks like there is an issue during aggregation when an unreadable file is encountered. Likely, this results in bad encoding. {noformat} LogType: command-13.json LogLength: 13934 Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) errors-3.txt0gc.log-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K-15575K(184320K), 0.0488700 secs] 163840K-15575K(1028096K), 0.0492510 secs] [Times: user=0.06 sys=0.01, real=0.05 secs] 2014-10-21T04:45:14.939+: 8.027: [GC2014-10-21T04:45:14.939+: 8.027: [ParNew: 179415K-11865K(184320K), 0.0941310 secs] 179415K-17228K(1028096K), 0.0943140 secs] [Times: user=0.13 sys=0.04, real=0.09 secs] 2014-10-21T04:46:42.099+: 95.187: [GC2014-10-21T04:46:42.099+: 95.187: [ParNew: 175705K-12802K(184320K), 0.0466420 secs] 181068K-18164K(1028096K), 0.0468490 secs] [Times: user=0.06 sys=0.00, real=0.04 secs] {noformat} Specifically, look at the text after the exception text. There should be two more entries for log files but none exist. This is likely due to the fact that command-13.json is expected to be of length 13934 but its is not as the file was never read. I think, it should have been {noformat} LogType: command-13.json LogLength: Length of the exception text Log Contents: Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json/grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-13.json (Permission denied)command-3.json13983Error aggregating log file. Log file : /grid/0/yarn/log/application_1413865041660_0002/container_1413865041660_0002_01_04/command-3.json/grid/0/yarn/log/application_1413865041660_0002/contaierrors-13.txt0660_0002_01_04/command-3.json (Permission denied) {noformat} {noformat} LogType: errors-3.txt LogLength:0 Log Contents: {noformat} {noformat} LogType:gc.log LogLength:??? Log Contents: ..-20141021044514484052014-10-21T04:45:12.046+: 5.134: [GC2014-10-21T04:45:12.046+: 5.134: [ParNew: 163840K- ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182135#comment-14182135 ] Karthik Kambatla commented on YARN-2010: [~jianhe] - can you please verify the changes to TestWorkPreservingRMRestart are reasonable. If RM fails to recover an app, it can never transition to active again -- Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Attachment: yarn-2010-5.patch Updated patch to fix test failure, findbugs warning, and suppress javac warnings (we call getEventHandler().handle() at several other places, I don't quite get why it leads to a javac warning only here). If RM fails to recover an app, it can never transition to active again -- Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2694) Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182142#comment-14182142 ] Hadoop QA commented on YARN-2694: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676740/YARN-2694-20141023-2.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5526//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5526//console This message is automatically generated. Ensure only single node labels specified in resource request / host, and node label expression only specified when resourceName=ANY --- Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch, YARN-2694-20141021-1.patch, YARN-2694-20141023-1.patch, YARN-2694-20141023-2.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest/host with multiple node labels will make user limit, etc. computation becomes more tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService - RMAdminCLI - CommonNodeLabelsManager -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2726) CapacityScheduler should explicitly log when an accessible label has no capacity
[ https://issues.apache.org/jira/browse/YARN-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182178#comment-14182178 ] Hadoop QA commented on YARN-2726: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676745/YARN-2726-20141023-2.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5528//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5528//console This message is automatically generated. CapacityScheduler should explicitly log when an accessible label has no capacity Key: YARN-2726 URL: https://issues.apache.org/jira/browse/YARN-2726 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Phil D'Amore Assignee: Wangda Tan Priority: Minor Attachments: YARN-2726-20141023-1.patch, YARN-2726-20141023-2.patch Given: - Node label defined: test-label - Two queues defined: a, b - label accessibility and and capacity defined as follows (properties abbreviated for readability): root.a.accessible-node-labels = test-label root.a.accessible-node-labels.test-label.capacity = 100 If you restart the RM or do a 'rmadmin -refreshQueues' you will get a stack trace with the following error buried within: Illegal capacity of -1.0 for label=test-label in queue=root.b This of course occurs because test-label is accessible to b due to inheritance from the root, and -1 is the UNDEFINED value. To my mind this might not be obvious to the admin, and the error message which results does not help guide someone to the source of the issue. I propose that this situation be updated so that when the capacity on an accessible label is undefined, it is explicitly called out instead of falling through to the illegal capacity check. Something like: {code} if (capacity == UNDEFINED) { throw new IllegalArgumentException(Configuration issue: + label= + label + is accessible from queue= + queue + but has no capacity set.); } {code} I'll leave it to better judgement than mine as to whether I'm throwing the appropriate exception there. I think this check should be added to both getNodeLabelCapacities and getMaximumNodeLabelCapacities in CapacitySchedulerConfiguration.java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2713) Broken RM Home link in NM Web UI when RM HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2713: --- Attachment: yarn-2713-1.patch Here is a straight-forward patch that points RM Home to the first RM in a HA deployment. If the first RM is not Active, it will redirect to the Active automatically. I am not sure if we want a more sophisticated fix that would take us to the Active directly. Broken RM Home link in NM Web UI when RM HA is enabled Key: YARN-2713 URL: https://issues.apache.org/jira/browse/YARN-2713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2713-1.patch When RM HA is enabled, the 'RM Home' link in the NM WebUI is broken. It points to the NM-host:RM-port instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2713) Broken RM Home link in NM Web UI when RM HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2713: --- Fix Version/s: (was: 2.7.0) Broken RM Home link in NM Web UI when RM HA is enabled Key: YARN-2713 URL: https://issues.apache.org/jira/browse/YARN-2713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2713-1.patch When RM HA is enabled, the 'RM Home' link in the NM WebUI is broken. It points to the NM-host:RM-port instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2713) Broken RM Home link in NM Web UI when RM HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182195#comment-14182195 ] Karthik Kambatla commented on YARN-2713: [~xgong] - I believe you are the most familiar with HA-redirections. Will you be able to take a look at this patch? Thanks. Broken RM Home link in NM Web UI when RM HA is enabled Key: YARN-2713 URL: https://issues.apache.org/jira/browse/YARN-2713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2713-1.patch When RM HA is enabled, the 'RM Home' link in the NM WebUI is broken. It points to the NM-host:RM-port instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2664) Improve RM webapp to expose info about reservations.
[ https://issues.apache.org/jira/browse/YARN-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182205#comment-14182205 ] Carlo Curino commented on YARN-2664: Matteo, first of all, thanks for looking into this. The delta-encoding you propose makes lots of sense and it is well aligned with the internal representation of resource allocations (which are mostly based on: *RLESparseResourceAllocation*), so you should be able to extract it from there easily. One thing we need to figure out is whether to use the javascript library I had in the seed patch above or other javascript (or non-javascript) visualization lib. Anything that can consume the json format you propose, and has an amenable licensing for hadoop is ok with me. (if anyone else has suggestions on this please chime in!) Another important problem will be what to visualize. I suspect that showing all jobs accepted over an arbitrary past/future time range is likely going to be too much for any large cluster... Being able to focus on a portion of the plan (e.g., time-range, user, queue) I think is going to be important. This would allow the GUI to lazily fetch the data corresponding to the portion of the plan we are visualizing, instead of dumping out the entire plan (which even with your much better delta-encoding might eventually be too big). My 2 cents.. Improve RM webapp to expose info about reservations. Key: YARN-2664 URL: https://issues.apache.org/jira/browse/YARN-2664 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Attachments: PlannerPage_screenshot.pdf, YARN-2664.patch YARN-1051 provides a new functionality in the RM to ask for reservation on resources. Exposing this through the webapp GUI is important. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2713) Broken RM Home link in NM Web UI when RM HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182242#comment-14182242 ] Hadoop QA commented on YARN-2713: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676784/yarn-2713-1.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5532//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5532//console This message is automatically generated. Broken RM Home link in NM Web UI when RM HA is enabled Key: YARN-2713 URL: https://issues.apache.org/jira/browse/YARN-2713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2713-1.patch When RM HA is enabled, the 'RM Home' link in the NM WebUI is broken. It points to the NM-host:RM-port instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182245#comment-14182245 ] Hadoop QA commented on YARN-2010: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676770/yarn-2010-5.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5531//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5531//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5531//console This message is automatically generated. If RM fails to recover an app, it can never transition to active again -- Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182247#comment-14182247 ] Hadoop QA commented on YARN-2703: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676733/YARN-2703.3.patch against trunk revision 828429d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDecommTests org.apache.hadoop.hdfs.TestParallelUnixDomaTests org.apache.hadoop.hdfs.TestEncryptionZonesWTests org.apache.hadoop.hdfs.TestClientProtocolForPipelineRecTests org.apache.hadoop.hdfs.TestPTestsTests org.apache.hadoop.hdfs.TestGetBTests org.apache.hadoop.hdfs.TestFileCreTests org.apache.hadoop.hdfs.TestWriTests org.apache.hadoop.hdfs.TestSetrepIncrTests org.apache.hadoop.hdfs.TestRenameWhiTests org.apache.hadoop.hdfs.TestBlockReaderLocTesTests org.apache.hadoop.hdfs.TestEncryptionZoneTeTests org.apache.hadoop.hdfs.web.TestWebHdfsFileSystemContraTesTests org.apache.hadoop.hdfs.web.TestWebHDFTeTests org.apache.hadoop.hdfs.web.TestWebHDFSForHTeTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5525//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5525//console This message is automatically generated. Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch, YARN-2703.2.patch, YARN-2703.3.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Attachment: yarn-2010-6.patch Updated patch to fix the findbugs issue, it was due to an empty if-block that got left around by mistake. If RM fails to recover an app, it can never transition to active again -- Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2495) Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182257#comment-14182257 ] Wangda Tan commented on YARN-2495: -- [~Naganarasimha], One comment before you uploading patch: I suggest to have an option to indicate if currently is using decentralized node label configuration or not. If it is true, NM will do following steps like create NodeLabelProvider, setup labels in NodeHeadrbeatRequest, etc. If you think that is make sense to you, I suggest we can call it ENABLE_DECENTRALIZED_NODELABEL_CONFIGURATION - (yarn.node-labels.decentralized-configuration.enabled), or do you have another suggestions? And also, that value will be used by RM, RM need do similar things like disable admin change labels on nodes via RM admin CLI, etc. I think you can first focus on NM stuffs and ResourceTracker changes in RM. AdminService related changes can be split to another JIRA. Thanks, Wangda Allow admin specify labels in each NM (Distributed configuration) - Key: YARN-2495 URL: https://issues.apache.org/jira/browse/YARN-2495 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Attachments: YARN-2495.20141023-1.patch, YARN-2495_20141022.1.patch Target of this JIRA is to allow admin specify labels in each NM, this covers - User can set labels in each NM (by setting yarn-site.xml or using script suggested by [~aw]) - NM will send labels to RM via ResourceTracker API - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2704) Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time
[ https://issues.apache.org/jira/browse/YARN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182291#comment-14182291 ] Jian He commented on YARN-2704: --- Thanks Vinod for the review ! bq. removeApplicationFromRenewal() is only called when log-aggregation is enabled, so that will affect the new credentials map? removeApplicationFromRenewal is actually directly called when log-aggregation is disabled. if it’s enabled, they are added to a delayed map. bq. Just bubble up the IOException instead of wrapping it in YarnRuntimeException. It’s inside the run method, so I wrap it with runTimeException bq. We don't need to renew the token immediately after obtaining it? This is to get the expiration date, token itself doesn’t have the expiration date bq. Make 3600 a constant. And why is it 10 hours? Shouldn't this be a function of the max-life-time in general? This guarantees we have 10h minimum buffer to distribute the tokens. Any time more that is not necessary ? bq. we should also look for the service name matching the default-file-system. there’s no easy way to get the service name base on the file-system object, and the hdfs token service-name varies in different case: e.g. ha/non-ha; use-ip/use-hostname bq. The log message found existing hdfs token needs to be a debug log Regarding the info/debug level logs, these logs are all low frequency logs, by default it’s only 1 day a time(renew interval) . And it’s so much easier to debug in info level than debug level. maybe in info level while stablizing this feature? Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time -- Key: YARN-2704 URL: https://issues.apache.org/jira/browse/YARN-2704 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2704.1.patch In secure mode, YARN requires the hdfs-delegation token to do localization and log aggregation on behalf of the user. But the hdfs delegation token will eventually expire after max-token-life-time. So, localization and log aggregation will fail after the token expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) If RM fails to recover an app, it can never transition to active again
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182302#comment-14182302 ] Hadoop QA commented on YARN-2010: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676796/yarn-2010-6.patch against trunk revision db45f04. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5533//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5533//console This message is automatically generated. If RM fails to recover an app, it can never transition to active again -- Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)