[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156127#comment-14156127 ] Zhijie Shen commented on YARN-2468: --- bq. I would like to check how many log files we can upload this time. If the number is 0, we can skip this time. And this check is also happened before LogKey.write(), otherwise, we will write key, but without value. I think Vinod meant that pendingUploadFiles is needed, but doesn't need to the member variable. getPendingLogFilesToUploadForThisContainer can return this collection, and pass it into LogValue.write by adding one param of it. 2. IMHO, the following code can be improved. If we use iterator, we can delete the unnecessary element on the fly. {code} for (File file : candidates) { Matcher fileMatcher = filterPattern.matcher(file.getName()); if (fileMatcher.find()) { filteredFiles.add(file); } } if (!exclusion) { return filteredFiles; } else { candidates.removeAll(filteredFiles); return candidates; } {code} This block could be: {code} ... while(candidatesItr.hasNext()) { candidate = candidatesItr.next(); ... if ((not match inclusive) || (match exclusive)) { candidatesItr.remove() } } {code} 3. [~jianhe] mentioned to me before that the following condition is not always true to determine an AM container. Any idea? And it seems that we don't need shouldUploadLogsForRunningContainer, we can re-use shouldUploadLogs and set wasContainerSuccessful to true. Personally, if it's not trivial to identify the AM container, I prefer to write a TODO comment and leave it until we implement the log retention API. {code} if (containerId.getId() == 1) { return true; } {code} bq. It seems to be, let's validate this via a test-case. Is it addressed by {code} this.conf.setLong(YarnConfiguration.DEBUG_NM_DELETE_DELAY_SEC, 3600); {code} Is it better to add a line of comment of the rationale behind the config? 5. Can the following code {code} SetContainerId finishedContainers = new HashSetContainerId(); for (ContainerId id : pendingContainerInThisCycle) { finishedContainers.add(id); } {code} be simplified as {code} SetContainerId finishedContainers = new HashSetContainerId(pendingContainerInThisCycle); {code} Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
Jun Gong created YARN-2640: -- Summary: TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156192#comment-14156192 ] Junping Du commented on YARN-1972: -- Hi [~vinodkv], I think we should commit this patch to branch-2.6 given this JIRA is marked as fixed in 2.6. Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition to the SE_TCB privilege mentioned in YARN-1063, but this requirement will automatically imply that the SE_TCB privilege is held by the nodemanager. For the Linux speakers in the audience, the requirement is basically to run NM as root. h2. Dedicated high privilege Service Due to the high privilege required by the WCE we had discussed the need to isolate the high privilege operations into a separate process, an 'executor' service that is solely responsible to start the containers (incloding the localizer). The NM would have to authenticate, authorize and communicate with this service via an IPC mechanism and use this service to launch the containers. I still believe we'll end up deploying such a service, but the effort to onboard such a new platfrom specific new service on the project are not trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156198#comment-14156198 ] Jun Gong commented on YARN-2617: I investigated why TestDirectoryCollection failed. And it might be because of YARN-2640. Could you help check and review it please? Thank you. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2562: - Attachment: (was: YARN-2562.5.patch) ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2562: - Attachment: YARN-2562.5.patch ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
[ https://issues.apache.org/jira/browse/YARN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2640: --- Attachment: YARN-2640.patch Patch submitted. TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-2640.patch When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156228#comment-14156228 ] Hadoop QA commented on YARN-2562: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672512/YARN-2562.5.patch against trunk revision 9e40de6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5233//console This message is automatically generated. ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
[ https://issues.apache.org/jira/browse/YARN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156230#comment-14156230 ] Hadoop QA commented on YARN-2640: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672513/YARN-2640.patch against trunk revision 9e40de6. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5234//console This message is automatically generated. TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-2640.patch When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
[ https://issues.apache.org/jira/browse/YARN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2640: --- Attachment: YARN-2640.2.patch TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-2640.2.patch, YARN-2640.patch When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156330#comment-14156330 ] Hudson commented on YARN-2613: -- FAILURE: Integrated in Hadoop-Yarn-trunk #698 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/698/]) YARN-2613. Support retry in NMClient for rolling-upgrades. (Contributed by Jian He) (junping_du: rev 0708827a935d190d439854e08bb5a655d7daa606) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factories/impl/pb/RpcClientFactoryPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/ContainerManagementProtocolProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/NMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/CHANGES.txt NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch, YARN-2613.2.patch, YARN-2613.3.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156357#comment-14156357 ] Hudson commented on YARN-1063: -- FAILURE: Integrated in Hadoop-Yarn-trunk #698 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/698/]) YARN-1063. Augmented Hadoop common winutils to have the ability to create containers as domain users. Contributed by Remus Rusanu. (vinodkv: rev 5ca97f1e60b8a7848f6eadd15f6c08ed390a8cda) * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestWinUtils.java * hadoop-common-project/hadoop-common/src/main/winutils/symlink.c * hadoop-common-project/hadoop-common/src/main/winutils/chown.c * hadoop-common-project/hadoop-common/src/main/winutils/task.c * hadoop-yarn-project/CHANGES.txt * hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c Winutils needs ability to create task as domain user Key: YARN-1063 URL: https://issues.apache.org/jira/browse/YARN-1063 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Environment: Windows Reporter: Kyle Leckie Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch h1. Summary: Securing a Hadoop cluster requires constructing some form of security boundary around the processes executed in YARN containers. Isolation based on Windows user isolation seems most feasible. This approach is similar to the approach taken by the existing LinuxContainerExecutor. The current patch to winutils.exe adds the ability to create a process as a domain user. h1. Alternative Methods considered: h2. Process rights limited by security token restriction: On Windows access decisions are made by examining the security token of a process. It is possible to spawn a process with a restricted security token. Any of the rights granted by SIDs of the default token may be restricted. It is possible to see this in action by examining the security tone of a sandboxed process launch be a web browser. Typically the launched process will have a fully restricted token and need to access machine resources through a dedicated broker process that enforces a custom security policy. This broker process mechanism would break compatibility with the typical Hadoop container process. The Container process must be able to utilize standard function calls for disk and network IO. I performed some work looking at ways to ACL the local files to the specific launched without granting rights to other processes launched on the same machine but found this to be an overly complex solution. h2. Relying on APP containers: Recent versions of windows have the ability to launch processes within an isolated container. Application containers are supported for execution of WinRT based executables. This method was ruled out due to the lack of official support for standard windows APIs. At some point in the future windows may support functionality similar to BSD jails or Linux containers, at that point support for containers should be added. h1. Create As User Feature Description: h2. Usage: A new sub command was added to the set of task commands. Here is the syntax: winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] Some notes: * The username specified is in the format of user@domain * The machine executing this command must be joined to the domain of the user specified * The domain controller must allow the account executing the command access to the user information. For this join the account to the predefined group labeled Pre-Windows 2000 Compatible Access * The account running the command must have several rights on the local machine. These can be managed manually using secpol.msc: ** Act as part of the operating system - SE_TCB_NAME ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME * The launched process will not have rights to the desktop so will not be able to display any information or create UI. * The launched process will have no network credentials. Any access of network resources that requires domain authentication will fail. h2. Implementation: Winutils performs the following steps: # Enable the required privileges for the current process. # Register as a trusted process with the Local Security Authority (LSA). # Create a new logon for the user passed on the command line. # Load/Create a profile on the local machine for the new logon. # Create a new environment
[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156349#comment-14156349 ] Hudson commented on YARN-2630: -- FAILURE: Integrated in Hadoop-Yarn-trunk #698 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/698/]) YARN-2630. Prevented previous AM container status from being acquired by the current restarted AM. Contributed by Jian He. (zjshen: rev 52bbe0f11bc8e97df78a1ab9b63f4eff65fd7a76) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatResponsePBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatResponse.java * hadoop-yarn-project/CHANGES.txt TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch, YARN-2630.4.patch The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156343#comment-14156343 ] Hudson commented on YARN-1972: -- FAILURE: Integrated in Hadoop-Yarn-trunk #698 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/698/]) YARN-1972. Added a secure container-executor for Windows. Contributed by Remus Rusanu. (vinodkv: rev ba7f31c2ee8d23ecb183f88920ef06053c0b9769) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/index.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to
[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user
[ https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156334#comment-14156334 ] Hudson commented on YARN-2446: -- FAILURE: Integrated in Hadoop-Yarn-trunk #698 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/698/]) YARN-2446. Augmented Timeline service APIs to start taking in domains as a parameter while posting entities and events. Contributed by Zhijie Shen. (vinodkv: rev 9e40de6af7959ac7bb5f4e4d2833ca14ea457614) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/timeline/TestTimelineRecords.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java * hadoop-yarn-project/CHANGES.txt Using TimelineNamespace to shield the entities of a user Key: YARN-2446 URL: https://issues.apache.org/jira/browse/YARN-2446 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the entities, preventing them from being accessed or affected by other users' operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
[ https://issues.apache.org/jira/browse/YARN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156364#comment-14156364 ] Hadoop QA commented on YARN-2640: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672530/YARN-2640.2.patch against trunk revision 9e40de6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5235//console This message is automatically generated. TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-2640.2.patch, YARN-2640.patch When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156435#comment-14156435 ] Hudson commented on YARN-1063: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1889 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1889/]) YARN-1063. Augmented Hadoop common winutils to have the ability to create containers as domain users. Contributed by Remus Rusanu. (vinodkv: rev 5ca97f1e60b8a7848f6eadd15f6c08ed390a8cda) * hadoop-common-project/hadoop-common/src/main/winutils/symlink.c * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c * hadoop-common-project/hadoop-common/src/main/winutils/chown.c * hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h * hadoop-common-project/hadoop-common/src/main/winutils/task.c * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestWinUtils.java * hadoop-yarn-project/CHANGES.txt Winutils needs ability to create task as domain user Key: YARN-1063 URL: https://issues.apache.org/jira/browse/YARN-1063 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Environment: Windows Reporter: Kyle Leckie Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch h1. Summary: Securing a Hadoop cluster requires constructing some form of security boundary around the processes executed in YARN containers. Isolation based on Windows user isolation seems most feasible. This approach is similar to the approach taken by the existing LinuxContainerExecutor. The current patch to winutils.exe adds the ability to create a process as a domain user. h1. Alternative Methods considered: h2. Process rights limited by security token restriction: On Windows access decisions are made by examining the security token of a process. It is possible to spawn a process with a restricted security token. Any of the rights granted by SIDs of the default token may be restricted. It is possible to see this in action by examining the security tone of a sandboxed process launch be a web browser. Typically the launched process will have a fully restricted token and need to access machine resources through a dedicated broker process that enforces a custom security policy. This broker process mechanism would break compatibility with the typical Hadoop container process. The Container process must be able to utilize standard function calls for disk and network IO. I performed some work looking at ways to ACL the local files to the specific launched without granting rights to other processes launched on the same machine but found this to be an overly complex solution. h2. Relying on APP containers: Recent versions of windows have the ability to launch processes within an isolated container. Application containers are supported for execution of WinRT based executables. This method was ruled out due to the lack of official support for standard windows APIs. At some point in the future windows may support functionality similar to BSD jails or Linux containers, at that point support for containers should be added. h1. Create As User Feature Description: h2. Usage: A new sub command was added to the set of task commands. Here is the syntax: winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] Some notes: * The username specified is in the format of user@domain * The machine executing this command must be joined to the domain of the user specified * The domain controller must allow the account executing the command access to the user information. For this join the account to the predefined group labeled Pre-Windows 2000 Compatible Access * The account running the command must have several rights on the local machine. These can be managed manually using secpol.msc: ** Act as part of the operating system - SE_TCB_NAME ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME * The launched process will not have rights to the desktop so will not be able to display any information or create UI. * The launched process will have no network credentials. Any access of network resources that requires domain authentication will fail. h2. Implementation: Winutils performs the following steps: # Enable the required privileges for the current process. # Register as a trusted process with the Local Security Authority (LSA). # Create a new logon for the user passed on the command line. # Load/Create a profile on the local machine for the new logon. # Create a new environment
[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156427#comment-14156427 ] Hudson commented on YARN-2630: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1889 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1889/]) YARN-2630. Prevented previous AM container status from being acquired by the current restarted AM. Contributed by Jian He. (zjshen: rev 52bbe0f11bc8e97df78a1ab9b63f4eff65fd7a76) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatResponsePBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch, YARN-2630.4.patch The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156421#comment-14156421 ] Hudson commented on YARN-1972: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1889 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1889/]) YARN-1972. Added a secure container-executor for Windows. Contributed by Remus Rusanu. (vinodkv: rev ba7f31c2ee8d23ecb183f88920ef06053c0b9769) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/index.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156408#comment-14156408 ] Hudson commented on YARN-2613: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1889 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1889/]) YARN-2613. Support retry in NMClient for rolling-upgrades. (Contributed by Jian He) (junping_du: rev 0708827a935d190d439854e08bb5a655d7daa606) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/ContainerManagementProtocolProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/NMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factories/impl/pb/RpcClientFactoryPBImpl.java * hadoop-yarn-project/CHANGES.txt NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch, YARN-2613.2.patch, YARN-2613.3.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user
[ https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156412#comment-14156412 ] Hudson commented on YARN-2446: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1889 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1889/]) YARN-2446. Augmented Timeline service APIs to start taking in domains as a parameter while posting entities and events. Contributed by Zhijie Shen. (vinodkv: rev 9e40de6af7959ac7bb5f4e4d2833ca14ea457614) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/timeline/TestTimelineRecords.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java Using TimelineNamespace to shield the entities of a user Key: YARN-2446 URL: https://issues.apache.org/jira/browse/YARN-2446 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the entities, preventing them from being accessed or affected by other users' operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2562: - Attachment: YARN-2562.5-2.patch ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5-2.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
[ https://issues.apache.org/jira/browse/YARN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156461#comment-14156461 ] Tsuyoshi OZAWA commented on YARN-2640: -- [~hex108], thanks for your contribution. Can we close this jira as duplicated issue of YARN-1979? TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-2640.2.patch, YARN-2640.patch When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1979) TestDirectoryCollection fails when the umask is unusual
[ https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156462#comment-14156462 ] Tsuyoshi OZAWA commented on YARN-1979: -- [~djp], do you mind taking a look at latest patch? Some users report same issue like YARN-2640. TestDirectoryCollection fails when the umask is unusual --- Key: YARN-1979 URL: https://issues.apache.org/jira/browse/YARN-1979 Project: Hadoop YARN Issue Type: Test Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: YARN-1979.2.patch, YARN-1979.txt I've seen this fail in Windows where the default permissions are matching up to 700. {code} --- Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection --- Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.422 sec FAILURE! java.lang.AssertionError: local dir parent Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA not created with proper permissions expected:rwxr-xr-x but was:rwx-- at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106) {code} The clash is between testDiskSpaceUtilizationLimit() and testCreateDirectories(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156490#comment-14156490 ] Wei Yan commented on YARN-2635: --- All tests passed locally. The TestDirectoryCollection failure looks related to YARN-1979, YARN-2640. TestRMRestart fails with FairScheduler -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2446) Using TimelineNamespace to shield the entities of a user
[ https://issues.apache.org/jira/browse/YARN-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156528#comment-14156528 ] Hudson commented on YARN-2446: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1914 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1914/]) YARN-2446. Augmented Timeline service APIs to start taking in domains as a parameter while posting entities and events. Contributed by Zhijie Shen. (vinodkv: rev 9e40de6af7959ac7bb5f4e4d2833ca14ea457614) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/security/TestTimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TimelineStoreTestUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/MemoryTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryManagerOnTimelineStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/timeline/TestTimelineRecords.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelineEntity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/timeline/TimelinePutResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/ApplicationHistoryServer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/security/TimelineACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/applicationhistoryservice/TestApplicationHistoryServer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/webapp/TestTimelineWebServicesWithSSL.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/TimelineDataManager.java Using TimelineNamespace to shield the entities of a user Key: YARN-2446 URL: https://issues.apache.org/jira/browse/YARN-2446 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2446.1.patch, YARN-2446.2.patch, YARN-2446.3.patch Given YARN-2102 adds TimelineNamespace, we can make use of it to shield the entities, preventing them from being accessed or affected by other users' operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156537#comment-14156537 ] Hudson commented on YARN-1972: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1914 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1914/]) YARN-1972. Added a secure container-executor for Windows. Contributed by Remus Rusanu. (vinodkv: rev ba7f31c2ee8d23ecb183f88920ef06053c0b9769) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/index.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/WindowsSecureContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ContainerLocalizer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestContainerExecutor.java Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156524#comment-14156524 ] Hudson commented on YARN-2613: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1914 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1914/]) YARN-2613. Support retry in NMClient for rolling-upgrades. (Contributed by Jian He) (junping_du: rev 0708827a935d190d439854e08bb5a655d7daa606) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/ServerProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/factories/impl/pb/RpcClientFactoryPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/ContainerManagementProtocolProxy.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/TestNMProxy.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/NMProxy.java NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch, YARN-2613.2.patch, YARN-2613.3.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1063) Winutils needs ability to create task as domain user
[ https://issues.apache.org/jira/browse/YARN-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156552#comment-14156552 ] Hudson commented on YARN-1063: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1914 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1914/]) YARN-1063. Augmented Hadoop common winutils to have the ability to create containers as domain users. Contributed by Remus Rusanu. (vinodkv: rev 5ca97f1e60b8a7848f6eadd15f6c08ed390a8cda) * hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/util/TestWinUtils.java * hadoop-common-project/hadoop-common/src/main/winutils/chown.c * hadoop-common-project/hadoop-common/src/main/winutils/symlink.c * hadoop-common-project/hadoop-common/src/main/winutils/libwinutils.c * hadoop-common-project/hadoop-common/src/main/winutils/include/winutils.h * hadoop-common-project/hadoop-common/src/main/winutils/task.c * hadoop-yarn-project/CHANGES.txt Winutils needs ability to create task as domain user Key: YARN-1063 URL: https://issues.apache.org/jira/browse/YARN-1063 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Environment: Windows Reporter: Kyle Leckie Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1063.2.patch, YARN-1063.3.patch, YARN-1063.4.patch, YARN-1063.5.patch, YARN-1063.6.patch, YARN-1063.patch h1. Summary: Securing a Hadoop cluster requires constructing some form of security boundary around the processes executed in YARN containers. Isolation based on Windows user isolation seems most feasible. This approach is similar to the approach taken by the existing LinuxContainerExecutor. The current patch to winutils.exe adds the ability to create a process as a domain user. h1. Alternative Methods considered: h2. Process rights limited by security token restriction: On Windows access decisions are made by examining the security token of a process. It is possible to spawn a process with a restricted security token. Any of the rights granted by SIDs of the default token may be restricted. It is possible to see this in action by examining the security tone of a sandboxed process launch be a web browser. Typically the launched process will have a fully restricted token and need to access machine resources through a dedicated broker process that enforces a custom security policy. This broker process mechanism would break compatibility with the typical Hadoop container process. The Container process must be able to utilize standard function calls for disk and network IO. I performed some work looking at ways to ACL the local files to the specific launched without granting rights to other processes launched on the same machine but found this to be an overly complex solution. h2. Relying on APP containers: Recent versions of windows have the ability to launch processes within an isolated container. Application containers are supported for execution of WinRT based executables. This method was ruled out due to the lack of official support for standard windows APIs. At some point in the future windows may support functionality similar to BSD jails or Linux containers, at that point support for containers should be added. h1. Create As User Feature Description: h2. Usage: A new sub command was added to the set of task commands. Here is the syntax: winutils task createAsUser [TASKNAME] [USERNAME] [COMMAND_LINE] Some notes: * The username specified is in the format of user@domain * The machine executing this command must be joined to the domain of the user specified * The domain controller must allow the account executing the command access to the user information. For this join the account to the predefined group labeled Pre-Windows 2000 Compatible Access * The account running the command must have several rights on the local machine. These can be managed manually using secpol.msc: ** Act as part of the operating system - SE_TCB_NAME ** Replace a process-level token - SE_ASSIGNPRIMARYTOKEN_NAME ** Adjust memory quotas for a process - SE_INCREASE_QUOTA_NAME * The launched process will not have rights to the desktop so will not be able to display any information or create UI. * The launched process will have no network credentials. Any access of network resources that requires domain authentication will fail. h2. Implementation: Winutils performs the following steps: # Enable the required privileges for the current process. # Register as a trusted process with the Local Security Authority (LSA). # Create a new logon for the user passed on the command line. # Load/Create a profile on the local machine for the new logon. # Create a new
[jira] [Commented] (YARN-2630) TestDistributedShell#testDSRestartWithPreviousRunningContainers fails
[ https://issues.apache.org/jira/browse/YARN-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156543#comment-14156543 ] Hudson commented on YARN-2630: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1914 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1914/]) YARN-2630. Prevented previous AM container status from being acquired by the current restarted AM. Contributed by Jian He. (zjshen: rev 52bbe0f11bc8e97df78a1ab9b63f4eff65fd7a76) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/NodeHeartbeatResponse.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/NodeHeartbeatResponsePBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto TestDistributedShell#testDSRestartWithPreviousRunningContainers fails - Key: YARN-2630 URL: https://issues.apache.org/jira/browse/YARN-2630 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.6.0 Attachments: YARN-2630.1.patch, YARN-2630.2.patch, YARN-2630.3.patch, YARN-2630.4.patch The problem is that after YARN-1372, in work-preserving AM restart, the re-launched AM will also receive previously failed AM container. But DistributedShell logic is not expecting this extra completed container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1979) TestDirectoryCollection fails when the umask is unusual
[ https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156618#comment-14156618 ] Junping Du commented on YARN-1979: -- Thanks [~ozawa] for reminding me on this. Yes. I do forget this JIRA. +1. Committing it now. TestDirectoryCollection fails when the umask is unusual --- Key: YARN-1979 URL: https://issues.apache.org/jira/browse/YARN-1979 Project: Hadoop YARN Issue Type: Test Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: YARN-1979.2.patch, YARN-1979.txt I've seen this fail in Windows where the default permissions are matching up to 700. {code} --- Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection --- Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.422 sec FAILURE! java.lang.AssertionError: local dir parent Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA not created with proper permissions expected:rwxr-xr-x but was:rwx-- at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106) {code} The clash is between testDiskSpaceUtilizationLimit() and testCreateDirectories(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2615) ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields
[ https://issues.apache.org/jira/browse/YARN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2615: - Attachment: YARN-2615-v2.patch In v2 patch, - Fix test failures and audit warning. - Add more tests for RMDelegationToken and TimelineDelegationToken. ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields Key: YARN-2615 URL: https://issues.apache.org/jira/browse/YARN-2615 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2615-v2.patch, YARN-2615.patch As three TokenIdentifiers get updated in YARN-668, ClientToAMTokenIdentifier and DelegationTokenIdentifier should also be updated in the same way to allow fields get extended in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1979) TestDirectoryCollection fails when the umask is unusual
[ https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156647#comment-14156647 ] Hudson commented on YARN-1979: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6174 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6174/]) YARN-1979. TestDirectoryCollection fails when the umask is unusual. (Contributed by Vinod Kumar Vavilapalli and Tsuyoshi OZAWA) (junping_du: rev c7cee9b4551918d5d35bf4e9dc73982a050c73ba) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDirectoryCollection.java TestDirectoryCollection fails when the umask is unusual --- Key: YARN-1979 URL: https://issues.apache.org/jira/browse/YARN-1979 Project: Hadoop YARN Issue Type: Test Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 2.7.0 Attachments: YARN-1979.2.patch, YARN-1979.txt I've seen this fail in Windows where the default permissions are matching up to 700. {code} --- Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection --- Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.422 sec FAILURE! java.lang.AssertionError: local dir parent Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA not created with proper permissions expected:rwxr-xr-x but was:rwx-- at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106) {code} The clash is between testDiskSpaceUtilizationLimit() and testCreateDirectories(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1979) TestDirectoryCollection fails when the umask is unusual
[ https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156653#comment-14156653 ] Tsuyoshi OZAWA commented on YARN-1979: -- Thanks Vinod for the contribution and Junping for the review! TestDirectoryCollection fails when the umask is unusual --- Key: YARN-1979 URL: https://issues.apache.org/jira/browse/YARN-1979 Project: Hadoop YARN Issue Type: Test Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Fix For: 2.7.0 Attachments: YARN-1979.2.patch, YARN-1979.txt I've seen this fail in Windows where the default permissions are matching up to 700. {code} --- Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection --- Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.422 sec FAILURE! java.lang.AssertionError: local dir parent Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA not created with proper permissions expected:rwxr-xr-x but was:rwx-- at org.junit.Assert.fail(Assert.java:93) at org.junit.Assert.failNotEquals(Assert.java:647) at org.junit.Assert.assertEquals(Assert.java:128) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106) {code} The clash is between testDiskSpaceUtilizationLimit() and testCreateDirectories(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2615) ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields
[ https://issues.apache.org/jira/browse/YARN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156651#comment-14156651 ] Hadoop QA commented on YARN-2615: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672553/YARN-2615-v2.patch against trunk revision c7cee9b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5237//console This message is automatically generated. ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields Key: YARN-2615 URL: https://issues.apache.org/jira/browse/YARN-2615 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2615-v2.patch, YARN-2615.patch As three TokenIdentifiers get updated in YARN-668, ClientToAMTokenIdentifier and DelegationTokenIdentifier should also be updated in the same way to allow fields get extended in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2615) ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields
[ https://issues.apache.org/jira/browse/YARN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156655#comment-14156655 ] Tsuyoshi OZAWA commented on YARN-2615: -- [~djp], currently, maybe the build about YARN looks broken on Jenkins CI. I faced same issue on YARN-2562. ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields Key: YARN-2615 URL: https://issues.apache.org/jira/browse/YARN-2615 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2615-v2.patch, YARN-2615.patch As three TokenIdentifiers get updated in YARN-668, ClientToAMTokenIdentifier and DelegationTokenIdentifier should also be updated in the same way to allow fields get extended in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156761#comment-14156761 ] Jian He commented on YARN-2617: --- YARN-2640 seems resolved in YARN-1979 already. NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2617) NM does not need to send finished container whose APP is not running to RM
[ https://issues.apache.org/jira/browse/YARN-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156772#comment-14156772 ] Hudson commented on YARN-2617: -- FAILURE: Integrated in Hadoop-trunk-Commit #6176 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6176/]) YARN-2617. Fixed NM to not send duplicate container status whose app is not running. Contributed by Jun Gong (jianhe: rev 3ef1cf187faeb530e74606dd7113fd1ba08140d7) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java NM does not need to send finished container whose APP is not running to RM -- Key: YARN-2617 URL: https://issues.apache.org/jira/browse/YARN-2617 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.6.0 Attachments: YARN-2617.2.patch, YARN-2617.3.patch, YARN-2617.4.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.5.patch, YARN-2617.6.patch, YARN-2617.patch We([~chenchun]) are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. NM continuously reported completed containers whose Application had already finished while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In the patch for YARN-1372, ApplicationImpl on NM should guarantee to clean up already completed applications. But it will only remove appId from 'app.context.getApplications()' when ApplicaitonImpl received evnet 'ApplicationEventType.APPLICATION_LOG_HANDLING_FINISHED' , however NM might receive this event for a long time or could not receive. * For NonAggregatingLogHandler, it wait for YarnConfiguration.NM_LOG_RETAIN_SECONDS which is 3 * 60 * 60 sec by default, then it will be scheduled to delete Application logs and send the event. * For LogAggregationService, it might fail(e.g. if user does not have HDFS write permission), and it will not send the event. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated YARN-2527: --- Attachment: YARN-2527.patch Thanks for the code, [~zjshen]. I have updated the patch based on the comment. NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated YARN-2527: --- Attachment: (was: YARN-2527.patch) NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156805#comment-14156805 ] Karthik Kambatla commented on YARN-2624: The patch looks good to me. Would like input from someone more familiar with the NM restart code. [~jlowe], [~djp] - can either of you take a look? We would like to get this committed soon. Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2254) TestRMWebServicesAppsModification should run against both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156803#comment-14156803 ] Hudson commented on YARN-2254: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6177 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6177/]) YARN-2254. TestRMWebServicesAppsModification should run against both CS and FS. (Zhihai Xu via kasha) (kasha: rev 5e0b49da9caa53814581508e589f3704592cf335) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java TestRMWebServicesAppsModification should run against both CS and FS --- Key: YARN-2254 URL: https://issues.apache.org/jira/browse/YARN-2254 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Labels: test Fix For: 2.7.0 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch TestRMWebServicesAppsModification skips the test, if the scheduler is not CapacityScheduler. change TestRMWebServicesAppsModification to support both CapacityScheduler and FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated YARN-2527: --- Attachment: YARN-2527.patch NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1414) with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs
[ https://issues.apache.org/jira/browse/YARN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156814#comment-14156814 ] Siqi Li commented on YARN-1414: --- Sure, I will submit a rebased patch shortly. with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs - Key: YARN-1414 URL: https://issues.apache.org/jira/browse/YARN-1414 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Affects Versions: 2.0.5-alpha Reporter: Siqi Li Assignee: Siqi Li Fix For: 2.2.0 Attachments: YARN-1221-subtask.v1.patch.txt, YARN-1221-v2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156824#comment-14156824 ] Jason Lowe commented on YARN-2624: -- Thanks for catching and fixing this, Anubhav! My apologies for missing this scenario in the original JIRA. +1 lgtm. Committing this. Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156835#comment-14156835 ] Karthik Kambatla commented on YARN-2624: Thanks for super-quick turnaround, Jason. Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156836#comment-14156836 ] Ray Chiang commented on YARN-2635: -- Looks good to me. Ran cleanly in my tree. +1 TestRMRestart fails with FairScheduler -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156841#comment-14156841 ] Hudson commented on YARN-2624: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6178 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6178/]) YARN-2624. Resource Localization fails on a cluster due to existing cache directories. Contributed by Anubhav Dhoot (jlowe: rev 29f520052e2b02f44979980e446acc0dccd96d54) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2638) Let TestRM run with all types of schedulers (FIFO, Capacity, Fair)
[ https://issues.apache.org/jira/browse/YARN-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156867#comment-14156867 ] Ray Chiang commented on YARN-2638: -- This patch fixes the test for me. +1 Let TestRM run with all types of schedulers (FIFO, Capacity, Fair) -- Key: YARN-2638 URL: https://issues.apache.org/jira/browse/YARN-2638 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2638-1.patch TestRM fails when using FairScheduler or FifoScheduler. The failures not shown in trunk as the trunk uses the default capacity scheduler. We need to let TestRM run with all types of schedulers, to make sure any new change wouldn't break any scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156869#comment-14156869 ] Anubhav Dhoot commented on YARN-2624: - Thanks [~jlowe]! Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2254) TestRMWebServicesAppsModification should run against both CS and FS
[ https://issues.apache.org/jira/browse/YARN-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156888#comment-14156888 ] zhihai xu commented on YARN-2254: - thanks [~kasha] for reviewing and committing the patch. TestRMWebServicesAppsModification should run against both CS and FS --- Key: YARN-2254 URL: https://issues.apache.org/jira/browse/YARN-2254 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Assignee: zhihai xu Priority: Minor Labels: test Fix For: 2.7.0 Attachments: YARN-2254.000.patch, YARN-2254.001.patch, YARN-2254.002.patch, YARN-2254.003.patch, YARN-2254.004.patch TestRMWebServicesAppsModification skips the test, if the scheduler is not CapacityScheduler. change TestRMWebServicesAppsModification to support both CapacityScheduler and FairScheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156890#comment-14156890 ] Hadoop QA commented on YARN-2527: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672583/YARN-2527.patch against trunk revision 5e0b49d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5238//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5238//console This message is automatically generated. NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1414) with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs
[ https://issues.apache.org/jira/browse/YARN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156892#comment-14156892 ] Siqi Li commented on YARN-1414: --- I just found out that this problem has been fixed in the trunk. I am going to close this jira with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs - Key: YARN-1414 URL: https://issues.apache.org/jira/browse/YARN-1414 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Affects Versions: 2.0.5-alpha Reporter: Siqi Li Assignee: Siqi Li Fix For: 2.2.0 Attachments: YARN-1221-subtask.v1.patch.txt, YARN-1221-v2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2639) TestClientToAMTokens should run with all types of schedulers
[ https://issues.apache.org/jira/browse/YARN-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2639: -- Attachment: YARN-2639-2.patch re-trigger the jenkins TestClientToAMTokens should run with all types of schedulers Key: YARN-2639 URL: https://issues.apache.org/jira/browse/YARN-2639 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2639-1.patch, YARN-2639-2.patch TestClientToAMTokens fails with FairScheduler now. We should let TestClientToAMTokens run with all kinds of schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156905#comment-14156905 ] Karthik Kambatla commented on YARN-2180: Looks mostly good, but for these minor comments: # App-checker and the store implementations aren't related: ## the app-checker config should be appended to SHARED_CACHE_PREFIX and IN_MEMORY_STORE ## the variable names should be updated accordingly. ## InMemorySCMStore#createAppCheckerService should move to a util class - how about changing SharedCacheStructureUtil to SharedCacheUtil and adding this method there? # Can we create a follow-up blocker sub-task to revisit all the config names before we include sharedcache work in a release? In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch, YARN-2180-trunk-v6.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart fails with FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156933#comment-14156933 ] Karthik Kambatla commented on YARN-2635: +1. Committing this. TestRMRestart fails with FairScheduler -- Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2635) TestRMRestart should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2635: --- Summary: TestRMRestart should run with all schedulers (was: TestRMRestart fails with FairScheduler) TestRMRestart should run with all schedulers Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2638) TestRM should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2638: --- Summary: TestRM should run with all schedulers (was: Let TestRM run with all types of schedulers (FIFO, Capacity, Fair)) TestRM should run with all schedulers - Key: YARN-2638 URL: https://issues.apache.org/jira/browse/YARN-2638 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2638-1.patch TestRM fails when using FairScheduler or FifoScheduler. The failures not shown in trunk as the trunk uses the default capacity scheduler. We need to let TestRM run with all types of schedulers, to make sure any new change wouldn't break any scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2634) Test failure for TestClientRMTokens
[ https://issues.apache.org/jira/browse/YARN-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He reassigned YARN-2634: - Assignee: Jian He Test failure for TestClientRMTokens --- Key: YARN-2634 URL: https://issues.apache.org/jira/browse/YARN-2634 Project: Hadoop YARN Issue Type: Test Reporter: Junping Du Assignee: Jian He Priority: Blocker The test get failed as below: {noformat} --- Test set: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens --- Tests run: 6, Failures: 3, Errors: 2, Skipped: 0, Time elapsed: 60.184 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens testShortCircuitRenewCancelDifferentHostSamePort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 22.693 sec FAILURE! java.lang.AssertionError: expected:getProxy but was:null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostSamePort(TestClientRMTokens.java:272) testShortCircuitRenewCancelDifferentHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 20.087 sec FAILURE! java.lang.AssertionError: expected:getProxy but was:null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostDifferentPort(TestClientRMTokens.java:283) testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 0.031 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:148) at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:101) at org.apache.hadoop.security.token.Token.renew(Token.java:377) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:309) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:241) testShortCircuitRenewCancelSameHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 0.061 sec FAILURE! java.lang.AssertionError: expected:getProxy but was:null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelSameHostDifferentPort(TestClientRMTokens.java:261) testShortCircuitRenewCancelWildcardAddress(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 0.07 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.net.NetUtils.isLocalAddress(NetUtils.java:684) at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:149) 1,1 Top {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2615) ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields
[ https://issues.apache.org/jira/browse/YARN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156951#comment-14156951 ] Jian He commented on YARN-2615: --- looks good, only few minor things: - {{ClientToAMTokenIdentifierForTest}}, the same code overrides from {{ClientToAMTokenIdentifier}} may be removed ? similarly for {{RMDelegationTokenIdentifierForTest}} - this code can be removed. {code} byte[] tokenIdentifierContent = token.getIdentifier(); ClientToAMTokenIdentifier tokenIdentifier = new ClientToAMTokenIdentifier(); DataInputBuffer dib = new DataInputBuffer(); dib.reset(tokenIdentifierContent, tokenIdentifierContent.length); tokenIdentifier.readFields(dib); {code} ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields Key: YARN-2615 URL: https://issues.apache.org/jira/browse/YARN-2615 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2615-v2.patch, YARN-2615.patch As three TokenIdentifiers get updated in YARN-668, ClientToAMTokenIdentifier and DelegationTokenIdentifier should also be updated in the same way to allow fields get extended in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156963#comment-14156963 ] Karthik Kambatla commented on YARN-2635: Just saw YARN-2638 as well. On second thought, it might be better to club the two JIRAs and implement a base class for RM tests that run against all schedulers. And, schedulerType in these tests should probably be an enum so subclasses don't have to know the order. TestRMRestart should run with all schedulers Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2639) TestClientToAMTokens should run with all types of schedulers
[ https://issues.apache.org/jira/browse/YARN-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2639. Resolution: Duplicate Can we fix this also as part of YARN-2635. TestClientToAMTokens should run with all types of schedulers Key: YARN-2639 URL: https://issues.apache.org/jira/browse/YARN-2639 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2639-1.patch, YARN-2639-2.patch TestClientToAMTokens fails with FairScheduler now. We should let TestClientToAMTokens run with all kinds of schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2639) TestClientToAMTokens should run with all types of schedulers
[ https://issues.apache.org/jira/browse/YARN-2639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157012#comment-14157012 ] Hadoop QA commented on YARN-2639: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672593/YARN-2639-2.patch against trunk revision 29f5200. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5239//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5239//console This message is automatically generated. TestClientToAMTokens should run with all types of schedulers Key: YARN-2639 URL: https://issues.apache.org/jira/browse/YARN-2639 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2639-1.patch, YARN-2639-2.patch TestClientToAMTokens fails with FairScheduler now. We should let TestClientToAMTokens run with all kinds of schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1198: -- Attachment: YARN-1198.11.patch Attaching patch .11, this is based on .10 (nee .7), the preferred approach, with the a factoring change to decrease the impact - the HeadroomProvider is now limited to just the CapacityScheduler area / FiCaSchedulerApp. It's actually possible to remove the HeadroomProvider altogether in favor of adding more members to the scheduler app, but I think it actually looks better factored this way (the functional result would be the same). Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.10.patch, YARN-1198.11.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch, YARN-1198.9.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157067#comment-14157067 ] Hadoop QA commented on YARN-1198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672614/YARN-1198.11.patch against trunk revision a56f3ec. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5240//console This message is automatically generated. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.10.patch, YARN-1198.11.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch, YARN-1198.9.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2634) Test failure for TestClientRMTokens
[ https://issues.apache.org/jira/browse/YARN-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157079#comment-14157079 ] Jian He commented on YARN-2634: --- [~djp], I took latest trunk and ran locally, it actually passes. Would you mind checking again ? thx Test failure for TestClientRMTokens --- Key: YARN-2634 URL: https://issues.apache.org/jira/browse/YARN-2634 Project: Hadoop YARN Issue Type: Test Reporter: Junping Du Assignee: Jian He Priority: Blocker The test get failed as below: {noformat} --- Test set: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens --- Tests run: 6, Failures: 3, Errors: 2, Skipped: 0, Time elapsed: 60.184 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens testShortCircuitRenewCancelDifferentHostSamePort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 22.693 sec FAILURE! java.lang.AssertionError: expected:getProxy but was:null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostSamePort(TestClientRMTokens.java:272) testShortCircuitRenewCancelDifferentHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 20.087 sec FAILURE! java.lang.AssertionError: expected:getProxy but was:null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelDifferentHostDifferentPort(TestClientRMTokens.java:283) testShortCircuitRenewCancel(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 0.031 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:148) at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.renew(RMDelegationTokenIdentifier.java:101) at org.apache.hadoop.security.token.Token.renew(Token.java:377) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:309) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancel(TestClientRMTokens.java:241) testShortCircuitRenewCancelSameHostDifferentPort(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 0.061 sec FAILURE! java.lang.AssertionError: expected:getProxy but was:null at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.checkShortCircuitRenewCancel(TestClientRMTokens.java:319) at org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens.testShortCircuitRenewCancelSameHostDifferentPort(TestClientRMTokens.java:261) testShortCircuitRenewCancelWildcardAddress(org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens) Time elapsed: 0.07 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.net.NetUtils.isLocalAddress(NetUtils.java:684) at org.apache.hadoop.yarn.security.client.RMDelegationTokenIdentifier$Renewer.getRmClient(RMDelegationTokenIdentifier.java:149) 1,1 Top {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle updated YARN-2408: - Attachment: YARN-2408-5.patch Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features Attachments: YARN-2408-5.patch, YARN-2408.4.patch I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle updated YARN-2408: - Attachment: (was: YARN-2408-5.patch) Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2408) Resource Request REST API for YARN
[ https://issues.apache.org/jira/browse/YARN-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renan DelValle updated YARN-2408: - Attachment: (was: YARN-2408.4.patch) Resource Request REST API for YARN -- Key: YARN-2408 URL: https://issues.apache.org/jira/browse/YARN-2408 Project: Hadoop YARN Issue Type: New Feature Components: webapp Reporter: Renan DelValle Labels: features I’m proposing a new REST API for YARN which exposes a snapshot of the Resource Requests that exist inside of the Scheduler. My motivation behind this new feature is to allow external software to monitor the amount of resources being requested to gain more insightful information into cluster usage than is already provided. The API can also be used by external software to detect a starved application and alert the appropriate users and/or sys admin so that the problem may be remedied. Here is the proposed API (a JSON counterpart is also available): {code:xml} resourceRequests MB7680/MB VCores7/VCores appMaster applicationIdapplication_1412191664217_0001/applicationId applicationAttemptIdappattempt_1412191664217_0001_01/applicationAttemptId queueNamedefault/queueName totalMB6144/totalMB totalVCores6/totalVCores numResourceRequests3/numResourceRequests requests request MB1024/MB VCores1/VCores numContainers6/numContainers relaxLocalitytrue/relaxLocality priority20/priority resourceNames resourceNamelocalMachine/resourceName resourceName/default-rack/resourceName resourceName*/resourceName /resourceNames /request /requests /appMaster appMaster ... /appMaster /resourceRequests {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2468: Attachment: YARN-2468.10.patch Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157133#comment-14157133 ] Xuan Gong commented on YARN-2468: - new patch addressed all the comments Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157199#comment-14157199 ] Hadoop QA commented on YARN-2468: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672626/YARN-2468.10.patch against trunk revision a56f3ec. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5241//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5241//console This message is automatically generated. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1198: -- Attachment: YARN-1198.11-with-1857.patch Patch combining the last .11 with the latest 1857 patch, to make it easy to check them out together. Tests changed/added for both issues are present and pass (unchanged) Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.10.patch, YARN-1198.11-with-1857.patch, YARN-1198.11.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch, YARN-1198.9.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157217#comment-14157217 ] Hadoop QA commented on YARN-1198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672649/YARN-1198.11-with-1857.patch against trunk revision f679ca3. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5243//console This message is automatically generated. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.10.patch, YARN-1198.11-with-1857.patch, YARN-1198.11.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch, YARN-1198.9.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2414) RM web UI: app page will crash if app is failed before any attempt has been created
[ https://issues.apache.org/jira/browse/YARN-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157234#comment-14157234 ] Jason Lowe commented on YARN-2414: -- Ran into this as well. Any update, [~leftnoteasy]? RM web UI: app page will crash if app is failed before any attempt has been created --- Key: YARN-2414 URL: https://issues.apache.org/jira/browse/YARN-2414 Project: Hadoop YARN Issue Type: Bug Components: webapp Reporter: Zhijie Shen Assignee: Wangda Tan {code} 2014-08-12 16:45:13,573 ERROR org.apache.hadoop.yarn.webapp.Dispatcher: error handling URI: /cluster/app/application_1407887030038_0001 java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:153) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:900) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:84) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:460) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1191) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.lang.NullPointerException at
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157245#comment-14157245 ] Zhijie Shen commented on YARN-2527: --- +1, will commit the patch NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157246#comment-14157246 ] Ray Chiang commented on YARN-2635: -- Tested TestRM/TestRMRestart/TestClientToAMTokens. All three tests now pass cleanly using FairScheduler. +1 TestRMRestart should run with all schedulers Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch, YARN-2635-2.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157248#comment-14157248 ] Craig Welch commented on YARN-1680: --- [~airbots] thanks for your updated WIP patch - I've not looked at it extensively yet, but at first glance it looks good to me. On the original patch I noticed that there seems to be a facility for blacklisting racks as well as nodes, and I was concerned that that needed to be addressed as well. It may be in this patch, but it did not look like it to me. I do think it can be without too much difficulty - I think putting the additions (and removals) into sets and then checking to see if the node's rack is in the set during the node iteration would do the trick (I may be off here, but that looks like it would work to me.) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157265#comment-14157265 ] Karthik Kambatla commented on YARN-1879: Thanks for working on this, Tsuyoshi. Review comments on the latest patch: # Are there cases when we don't want RetryCache enabled? IMO, we should always use the RetryCache (no harm). If we decide on having a config, the default should be true. # I would set DEFAULT_RM_RETRY_CACHE_EXPIRY_MS to {{10 * 60 * 1000}} instead of 60, and the corresponding comment (10 mins) can be removed or moved to the same line. # TestApplicationMasterServiceRetryCache has a few lines longer than 80 chars. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157269#comment-14157269 ] Hudson commented on YARN-2527: -- FAILURE: Integrated in Hadoop-trunk-Commit #6182 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6182/]) YARN-2527. Fixed the potential NPE in ApplicationACLsManager and added test cases for it. Contributed by Benoy Antony. (zjshen: rev 1c93025a1b370db46e345161dbc15e03f829823f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/server/security/ApplicationACLsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/server/security/TestApplicationACLsManager.java NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Fix For: 2.6.0 Attachments: YARN-2527.patch, YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157271#comment-14157271 ] Craig Welch commented on YARN-1680: --- [~john.jian.fang] I should probably not have referred to the cluster level adjustments as blacklisting. What I see is a mechanism (state machine, events, including adding and removing nodes and the unhealthy state/the health monitor) that, I think, ultimately result in the CapacityScheduler.addNode() and removeNode() calls, which modify the clusterResource value. In any case, the blacklisting functionality we are addressing here definitely looks to be application specific needs to be addressed at that level. The issue isn't, so far as I know, related to any blacklisting/node health issues outside the one in play here, as those should work properly for headroom as they adjust the cluster resource. The problem is that the application blacklist activity does not adjust the cluster resource and was previously not involved in the headroom calculation. If it's not the case that cluster level adjustments are being made for nodes then this blacklisting will result in duplication among applications as they independently discover problems with nodes and blacklist them, but that is not a new characteristic of the way the system works. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157274#comment-14157274 ] Ray Chiang commented on YARN-2635: -- Oops, pending Jenkins of course. TestRMRestart should run with all schedulers Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch, YARN-2635-2.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157286#comment-14157286 ] Craig Welch commented on YARN-1680: --- This does bring up what I think could be an issue, I'm not sure if it was what you were getting at before or not, [~john.jian.fang], but we could well be introducing a new bug here unless we are careful. I don't see any connection between the scheduler level resource adjustments and the application level adjustments, so if an application had problems with a node and blacklisted it, and then the cluster did, the resource value of the node would be effectively removed from the headroom 2x (once when the application adds it to it's new blacklist reduction, and a second time when the cluster removes it's value from the clusterResource). I think this could be a problem, I think it could be addressed, but it's something to think about and I don't think the current approach addresses this- [~airbots], [~jlowe], thoughts? availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157291#comment-14157291 ] Hadoop QA commented on YARN-2468: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672626/YARN-2468.10.patch against trunk revision f679ca3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5244//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5244//console This message is automatically generated. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.10.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch, YARN-2468.9.1.patch, YARN-2468.9.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157296#comment-14157296 ] Hadoop QA commented on YARN-2635: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672637/YARN-2635-2.patch against trunk revision 6ac1051. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5242//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5242//console This message is automatically generated. TestRMRestart should run with all schedulers Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch, YARN-2635-2.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2598) GHS should show N/A instead of null for the inaccessible information
[ https://issues.apache.org/jira/browse/YARN-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2598: -- Attachment: YARN-2598.2.patch Rebase the patch against the latest trunk GHS should show N/A instead of null for the inaccessible information Key: YARN-2598 URL: https://issues.apache.org/jira/browse/YARN-2598 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2598.1.patch, YARN-2598.2.patch When the user doesn't have the access to an application, the app attempt information is not visible to the user. ClientRMService will output N/A, but GHS is showing null, which is not user-friendly. {code} 14/09/24 22:07:20 INFO impl.TimelineClientImpl: Timeline service address: http://nn.example.com:8188/ws/v1/timeline/ 14/09/24 22:07:20 INFO client.RMProxy: Connecting to ResourceManager at nn.example.com/240.0.0.11:8050 14/09/24 22:07:21 INFO client.AHSProxy: Connecting to Application History server at nn.example.com/240.0.0.11:10200 Application Report : Application-Id : application_1411586934799_0001 Application-Name : Sleep job Application-Type : MAPREDUCE User : hrt_qa Queue : default Start-Time : 1411586956012 Finish-Time : 1411586989169 Progress : 100% State : FINISHED Final-State : SUCCEEDED Tracking-URL : null RPC Port : -1 AM Host : null Aggregate Resource Allocation : N/A Diagnostics : null {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157302#comment-14157302 ] Steve Loughran commented on YARN-913: - Failing test is still the (believed unrelated) Running org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell Tests run: 11, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 379.565 sec FAILURE! - in org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell testDSRestartWithPreviousRunningContainers(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) Time elapsed: 38.715 sec FAILURE! java.lang.AssertionError: client failed at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSRestartWithPreviousRunningContainers(TestDistributedShell.java:319) Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, YARN-913-014.patch, YARN-913-015.patch, YARN-913-016.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2628) Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free
[ https://issues.apache.org/jira/browse/YARN-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157311#comment-14157311 ] Hudson commented on YARN-2628: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6183 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6183/]) YARN-2628. Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free. Contributed by Varun Vasudev (jianhe: rev 054f28552687e9b9859c0126e16a2066e20ead3f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * hadoop-yarn-project/CHANGES.txt Capacity scheduler with DominantResourceCalculator carries out reservation even though slots are free - Key: YARN-2628 URL: https://issues.apache.org/jira/browse/YARN-2628 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.5.1 Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-2628.0.patch, apache-yarn-2628.1.patch We've noticed that if you run the CapacityScheduler with the DominantResourceCalculator, sometimes apps will end up with containers in a reserved state even though free slots are available. The root cause seems to be this piece of code from CapacityScheduler.java - {noformat} // Try to schedule more if there are no reservations to fulfill if (node.getReservedContainer() == null) { if (Resources.greaterThanOrEqual(calculator, getClusterResource(), node.getAvailableResource(), minimumAllocation)) { if (LOG.isDebugEnabled()) { LOG.debug(Trying to schedule on node: + node.getNodeName() + , available: + node.getAvailableResource()); } root.assignContainers(clusterResource, node, false); } } else { LOG.info(Skipping scheduling since node + node.getNodeID() + is reserved by application + node.getReservedContainer().getContainerId().getApplicationAttemptId() ); } {noformat} The code is meant to check if a node has any slots available for containers . Since it uses the greaterThanOrEqual function, we end up in situation where greaterThanOrEqual returns true, even though we may not have enough CPU or memory to actually run the container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1414) with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs
[ https://issues.apache.org/jira/browse/YARN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157336#comment-14157336 ] Sandy Ryza commented on YARN-1414: -- Awesome with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs - Key: YARN-1414 URL: https://issues.apache.org/jira/browse/YARN-1414 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Affects Versions: 2.0.5-alpha Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-1221-subtask.v1.patch.txt, YARN-1221-v2.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2598) GHS should show N/A instead of null for the inaccessible information
[ https://issues.apache.org/jira/browse/YARN-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157349#comment-14157349 ] Hadoop QA commented on YARN-2598: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672667/YARN-2598.2.patch against trunk revision 054f285. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5245//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5245//console This message is automatically generated. GHS should show N/A instead of null for the inaccessible information Key: YARN-2598 URL: https://issues.apache.org/jira/browse/YARN-2598 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2598.1.patch, YARN-2598.2.patch When the user doesn't have the access to an application, the app attempt information is not visible to the user. ClientRMService will output N/A, but GHS is showing null, which is not user-friendly. {code} 14/09/24 22:07:20 INFO impl.TimelineClientImpl: Timeline service address: http://nn.example.com:8188/ws/v1/timeline/ 14/09/24 22:07:20 INFO client.RMProxy: Connecting to ResourceManager at nn.example.com/240.0.0.11:8050 14/09/24 22:07:21 INFO client.AHSProxy: Connecting to Application History server at nn.example.com/240.0.0.11:10200 Application Report : Application-Id : application_1411586934799_0001 Application-Name : Sleep job Application-Type : MAPREDUCE User : hrt_qa Queue : default Start-Time : 1411586956012 Finish-Time : 1411586989169 Progress : 100% State : FINISHED Final-State : SUCCEEDED Tracking-URL : null RPC Port : -1 AM Host : null Aggregate Resource Allocation : N/A Diagnostics : null {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157431#comment-14157431 ] Benoy Antony commented on YARN-2527: Thanks a lot, [~zjshen]. NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Fix For: 2.6.0 Attachments: YARN-2527.patch, YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2562: - Attachment: YARN-2562.5-4.patch ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5-2.patch, YARN-2562.5-4.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
[ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157461#comment-14157461 ] Santosh Marella commented on YARN-556: -- Referencing YARN-2476 here to ensure the specific scenario mentioned there is fixed as part of this JIRA. RM Restart phase 2 - Work preserving restart Key: YARN-556 URL: https://issues.apache.org/jira/browse/YARN-556 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Bikas Saha Assignee: Bikas Saha Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch, YARN-1372.prelim.patch YARN-128 covered storing the state needed for the RM to recover critical information. This umbrella jira will track changes needed to recover the running state of the cluster so that work can be preserved across RM restarts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2562) ContainerId@toString() is unreadable for epoch 0 after YARN-2182
[ https://issues.apache.org/jira/browse/YARN-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157490#comment-14157490 ] Hadoop QA commented on YARN-2562: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672691/YARN-2562.5-4.patch against trunk revision 054f285. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5246//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5246//console This message is automatically generated. ContainerId@toString() is unreadable for epoch 0 after YARN-2182 - Key: YARN-2562 URL: https://issues.apache.org/jira/browse/YARN-2562 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-2562.1.patch, YARN-2562.2.patch, YARN-2562.3.patch, YARN-2562.4.patch, YARN-2562.5-2.patch, YARN-2562.5-4.patch, YARN-2562.5.patch ContainerID string format is unreadable for RMs that restarted at least once (epoch 0) after YARN-2182. For e.g, container_1410901177871_0001_01_05_17. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157515#comment-14157515 ] Karthik Kambatla commented on YARN-2635: By the way, these tests take a long time to run. Do we want to run against all three schedulers? Or, would it be enough to run against CS and FS? TestRMRestart should run with all schedulers Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157524#comment-14157524 ] Craig Welch commented on YARN-1198: --- FYI, it's not possible to call the getAndCalculateHeadroom because nothing can synchronize on the queue during the allocation call without deadlocking - this is why it's necessary to break out the headroom they way it is here and store some items (such as the LeafQueue.User, which comes from the usermanager and syncs on the queu) to avoid any synchronization on the queue itself during the final headroom calculation in the allocate/getHeadroom step. It's not a bad thing to do anyway, to reduce the number of operations (somewhat) in that final headroom calculation - but it is also why we can't just call the getAndCalculateHeadroom as such (unchanged) in allocate() Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.10.patch, YARN-1198.11-with-1857.patch, YARN-1198.11.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch, YARN-1198.8.patch, YARN-1198.9.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2641) improve node decommission latency in RM.
zhihai xu created YARN-2641: --- Summary: improve node decommission latency in RM. Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2640) TestDirectoryCollection.testCreateDirectories failed
[ https://issues.apache.org/jira/browse/YARN-2640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157570#comment-14157570 ] Jun Gong commented on YARN-2640: [~ozawa], thank you for telling me. Close it now. TestDirectoryCollection.testCreateDirectories failed Key: YARN-2640 URL: https://issues.apache.org/jira/browse/YARN-2640 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-2640.2.patch, YARN-2640.patch When running test mvn test -Dtest=TestDirectoryCollection, it failed: {code} Running org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.538 sec FAILURE! - in org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) Time elapsed: 0.969 sec FAILURE! java.lang.AssertionError: local dir parent not created with proper permissions expected:rwxr-xr-x but was:rwxrwxr-x at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:104) {code} I found it was because testDiskSpaceUtilizationLimit ran before testCreateDirectories when running test, then directory dirA was created in test function testDiskSpaceUtilizationLimit. When testCreateDirectories tried to create dirA with specified permission, it found dirA has already been there and it did nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-2612. Resolution: Duplicate Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. We think when RMAppAttempt call BaseFinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2635) TestRMRestart should run with all schedulers
[ https://issues.apache.org/jira/browse/YARN-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14157579#comment-14157579 ] Hadoop QA commented on YARN-2635: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672709/yarn-2635-3.patch against trunk revision 054f285. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5247//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5247//console This message is automatically generated. TestRMRestart should run with all schedulers Key: YARN-2635 URL: https://issues.apache.org/jira/browse/YARN-2635 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2635-1.patch, YARN-2635-2.patch, yarn-2635-3.patch If we change the scheduler from Capacity Scheduler to Fair Scheduler, the TestRMRestart would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)