[jira] [Commented] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049665#comment-14049665 ] Hadoop QA commented on YARN-2241: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653540/YARN-2241.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4174//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4174//console This message is automatically generated. ZKRMStateStore: On startup, show nicer messages when znodes already exist - Key: YARN-2241 URL: https://issues.apache.org/jira/browse/YARN-2241 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Minor Attachments: YARN-2241.patch, YARN-2241.patch When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049758#comment-14049758 ] Steve Loughran commented on YARN-2139: -- looks a good first draft # unless this does address HDFS, call out that this is local disk IO # this is really disk io bandwidth, so it should use an option, like vlocaldiskIObandwidth. This will avoid confusion (make clear its not HDFS), and add scope for the addition of future options: IOPs and actual allocation of entire disks to containers # what's the testability of this feature? Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2246) Job History Link in RM UI is redirecting to the URL which contains Job Id twice
[ https://issues.apache.org/jira/browse/YARN-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2246: -- Component/s: webapp Job History Link in RM UI is redirecting to the URL which contains Job Id twice --- Key: YARN-2246 URL: https://issues.apache.org/jira/browse/YARN-2246 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 3.0.0, 0.23.11, 2.5.0 Reporter: Devaraj K Assignee: Devaraj K Attachments: MAPREDUCE-4064-1.patch, MAPREDUCE-4064.patch {code:xml} http://xx.x.x.x:19888/jobhistory/job/job_1332435449546_0001/jobhistory/job/job_1332435449546_0001 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2246) Job History Link in RM UI is redirecting to the URL which contains Job Id twice
[ https://issues.apache.org/jira/browse/YARN-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen moved MAPREDUCE-4064 to YARN-2246: -- Component/s: (was: mrv2) Affects Version/s: (was: 0.23.1) 2.5.0 0.23.11 3.0.0 Key: YARN-2246 (was: MAPREDUCE-4064) Project: Hadoop YARN (was: Hadoop Map/Reduce) Job History Link in RM UI is redirecting to the URL which contains Job Id twice --- Key: YARN-2246 URL: https://issues.apache.org/jira/browse/YARN-2246 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 0.23.11, 2.5.0 Reporter: Devaraj K Assignee: Devaraj K Attachments: MAPREDUCE-4064-1.patch, MAPREDUCE-4064.patch {code:xml} http://xx.x.x.x:19888/jobhistory/job/job_1332435449546_0001/jobhistory/job/job_1332435449546_0001 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2246) Job History Link in RM UI is redirecting to the URL which contains Job Id twice
[ https://issues.apache.org/jira/browse/YARN-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049813#comment-14049813 ] Zhijie Shen commented on YARN-2246: --- Move the ticket to YARN, as the root cause sound like the YARN issue. Job History Link in RM UI is redirecting to the URL which contains Job Id twice --- Key: YARN-2246 URL: https://issues.apache.org/jira/browse/YARN-2246 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 3.0.0, 0.23.11, 2.5.0 Reporter: Devaraj K Assignee: Devaraj K Attachments: MAPREDUCE-4064-1.patch, MAPREDUCE-4064.patch {code:xml} http://xx.x.x.x:19888/jobhistory/job/job_1332435449546_0001/jobhistory/job/job_1332435449546_0001 {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2233: Attachment: apache-yarn-2233.1.patch {quote} 1. bq. It should be noted that when cancelling a token, the token to be cancelled is specified by setting a header. Any reason for specifying the token in head? If there's something non-intuitive, maybe we should have some in-code comments for other developers? {quote} I've added comments to the code explaining why this is. Jetty doesn't allow request bodies for DELETE methods. {quote} 2. RPC get delegation token API doesn't have these fields, but it seems to be nice have. We may want to file a Jira. {noformat} +long currentExpiration = ident.getIssueDate() + tokenRenewInterval; +long maxValidity = ident.getMaxDate(); {noformat} {quote} Fixed this. I've left the fields out for now to match the RPC response. I'll file tickets to add the information to both interfaces. {quote} 3. Is it possible to reuse KerberosTestUtils in hadoop-auth? {quote} I missed this. hadoop-auth doesn't export test jars for us to use. I've changed the pom.xml to start generating test-jars for hadoop-auth and used KerberosTestUtils from there. {quote} 4. Is this supposed to test invalid request body? It doesn't look like the invalid body construction in the later tests. {noformat} +response = +resource().path(ws).path(v1).path(cluster) + .path(delegation-token).accept(contentType) + .entity(dtoken, mediaType).post(ClientResponse.class); +assertEquals(Status.BAD_REQUEST, response.getClientResponseStatus()); {noformat} {quote} This is actually a test with the renewer missing from the request body, hence the BAD_REQUEST. {quote} 1. No need of == ture. {noformat} +if (usePrincipal == true) { {noformat} Similarly, {noformat} +if (KerberosAuthenticationHandler.TYPE.equals(authType) == false) { {noformat} {quote} Fixed. {quote} 2. If I remember it correctly, callerUGI.doAs will throw UndeclaredThrowableException, which wraps the real raised exception. However, UndeclaredThrowableException is an RE, this code cannot capture it. {noformat} +try { + resp = + callerUGI +.doAs(new PrivilegedExceptionActionGetDelegationTokenResponse() { + @Override + public GetDelegationTokenResponse run() throws IOException, + YarnException { +GetDelegationTokenRequest createReq = +GetDelegationTokenRequest.newInstance(renewer); +return rm.getClientRMService().getDelegationToken(createReq); + } +}); +} catch (Exception e) { + LOG.info(Create delegation token request failed, e); + throw e; +} {noformat} {quote} I'm unsure about this. RE is a sub-class of Exception. Why won't this code work? {quote} 3. Cannot return respToken simply? The framework should generate OK status automatically, right? {noformat} +return Response.status(Status.OK).entity(respToken).build(); {noformat} {quote} There are a few cases where we need to send a FORBIDDEN response back and the GenericExceptionHandler doesn't return FORBIDDEN responses. {quote} 4. You can call tk.decodeIdentifier directly. {noformat} +RMDelegationTokenIdentifier ident = new RMDelegationTokenIdentifier(); +ByteArrayInputStream buf = new ByteArrayInputStream(tk.getIdentifier()); +DataInputStream in = new DataInputStream(buf); +ident.readFields(in); {noformat} {quote} Fixed. Thanks for this, cleaned up bunch of boilerplate code. Implement web services to create, renew and cancel delegation tokens Key: YARN-2233 URL: https://issues.apache.org/jira/browse/YARN-2233 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2243) Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor
[ https://issues.apache.org/jira/browse/YARN-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K reassigned YARN-2243: --- Assignee: Devaraj K Good catch [~yuzhih...@gmail.com]. Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor Key: YARN-2243 URL: https://issues.apache.org/jira/browse/YARN-2243 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Devaraj K Priority: Minor {code} public SchedulerApplicationAttempt(ApplicationAttemptId applicationAttemptId, String user, Queue queue, ActiveUsersManager activeUsersManager, RMContext rmContext) { Preconditions.checkNotNull(RMContext should not be null, rmContext); {code} Order of arguments is wrong for Preconditions.checkNotNull(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2243) Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor
[ https://issues.apache.org/jira/browse/YARN-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-2243: Attachment: YARN-2243.patch Attaching trivial patch to fix this issue. Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor Key: YARN-2243 URL: https://issues.apache.org/jira/browse/YARN-2243 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Devaraj K Priority: Minor Attachments: YARN-2243.patch {code} public SchedulerApplicationAttempt(ApplicationAttemptId applicationAttemptId, String user, Queue queue, ActiveUsersManager activeUsersManager, RMContext rmContext) { Preconditions.checkNotNull(RMContext should not be null, rmContext); {code} Order of arguments is wrong for Preconditions.checkNotNull(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2204) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049836#comment-14049836 ] Hudson commented on YARN-2204: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #601 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/601/]) YARN-2204. Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow. (Anubhav Dhoot via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607231) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler --- Key: YARN-2204 URL: https://issues.apache.org/jira/browse/YARN-2204 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Trivial Fix For: 2.5.0 Attachments: YARN-2204.patch, YARN-2204_addendum.patch, YARN-2204_addendum.patch TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049833#comment-14049833 ] Hudson commented on YARN-2022: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #601 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/601/]) YARN-2022 Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy (Sunil G via mayank) (mayank: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607227) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049834#comment-14049834 ] Hudson commented on YARN-1713: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #601 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/601/]) YARN-1713. Added get-new-app and submit-app functionality to RM web services. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607216) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ApplicationSubmissionContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ContainerLaunchContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CredentialsInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LocalResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NewApplication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm Implement getnewapplication and submitapp as part of RM web service --- Key: YARN-1713 URL: https://issues.apache.org/jira/browse/YARN-1713 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Fix For: 2.5.0 Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, apache-yarn-1713.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049839#comment-14049839 ] Tsuyoshi OZAWA commented on YARN-2242: -- [~zjshen], thank you for the notification. I agree with [~gtCarrera] - these patches can be helpful for users. And, I checked the patch and confirmed that these patches are bisect each other. Therefore, we can work separately. [~djp], could you also review YARN-2013? Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2243) Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor
[ https://issues.apache.org/jira/browse/YARN-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049838#comment-14049838 ] Hadoop QA commented on YARN-2243: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653564/YARN-2243.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4175//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4175//console This message is automatically generated. Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor Key: YARN-2243 URL: https://issues.apache.org/jira/browse/YARN-2243 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Devaraj K Priority: Minor Attachments: YARN-2243.patch {code} public SchedulerApplicationAttempt(ApplicationAttemptId applicationAttemptId, String user, Queue queue, ActiveUsersManager activeUsersManager, RMContext rmContext) { Preconditions.checkNotNull(RMContext should not be null, rmContext); {code} Order of arguments is wrong for Preconditions.checkNotNull(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes
[ https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2013: - Attachment: YARN-2013.3-2.patch Attached same patch for confirming that the patch doesn't have any conflicts. The diagnostics is always the ExitCodeException stack when the container crashes Key: YARN-2013 URL: https://issues.apache.org/jira/browse/YARN-2013 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2013.1.patch, YARN-2013.2.patch, YARN-2013.3-2.patch, YARN-2013.3.patch When a container crashes, ExitCodeException will be thrown from Shell. Default/LinuxContainerExecutor captures the exception, put the exception stack into the diagnostic. Therefore, the exception stack is always the same. {code} String diagnostics = Exception from container-launch: \n + StringUtils.stringifyException(e) + \n + shExec.getOutput(); container.handle(new ContainerDiagnosticsUpdateEvent(containerId, diagnostics)); {code} In addition, it seems that the exception always has a empty message as there's no message from stderr. Hence the diagnostics is not of much use for users to analyze the reason of container crash. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2013) The diagnostics is always the ExitCodeException stack when the container crashes
[ https://issues.apache.org/jira/browse/YARN-2013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049861#comment-14049861 ] Hadoop QA commented on YARN-2013: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653571/YARN-2013.3-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4176//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4176//console This message is automatically generated. The diagnostics is always the ExitCodeException stack when the container crashes Key: YARN-2013 URL: https://issues.apache.org/jira/browse/YARN-2013 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Zhijie Shen Assignee: Tsuyoshi OZAWA Attachments: YARN-2013.1.patch, YARN-2013.2.patch, YARN-2013.3-2.patch, YARN-2013.3.patch When a container crashes, ExitCodeException will be thrown from Shell. Default/LinuxContainerExecutor captures the exception, put the exception stack into the diagnostic. Therefore, the exception stack is always the same. {code} String diagnostics = Exception from container-launch: \n + StringUtils.stringifyException(e) + \n + shExec.getOutput(); container.handle(new ContainerDiagnosticsUpdateEvent(containerId, diagnostics)); {code} In addition, it seems that the exception always has a empty message as there's no message from stderr. Hence the diagnostics is not of much use for users to analyze the reason of container crash. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049864#comment-14049864 ] Sunil G commented on YARN-2022: --- Thank you Mayank, Vinod and Wangda Tan for the reviews. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049906#comment-14049906 ] Hudson commented on YARN-2022: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1819 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1819/]) YARN-2022 Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy (Sunil G via mayank) (mayank: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607227) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049907#comment-14049907 ] Hudson commented on YARN-1713: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1819 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1819/]) YARN-1713. Added get-new-app and submit-app functionality to RM web services. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607216) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ApplicationSubmissionContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ContainerLaunchContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CredentialsInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LocalResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NewApplication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm Implement getnewapplication and submitapp as part of RM web service --- Key: YARN-1713 URL: https://issues.apache.org/jira/browse/YARN-1713 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Fix For: 2.5.0 Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, apache-yarn-1713.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2204) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049909#comment-14049909 ] Hudson commented on YARN-2204: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1819 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1819/]) YARN-2204. Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow. (Anubhav Dhoot via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607231) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler --- Key: YARN-2204 URL: https://issues.apache.org/jira/browse/YARN-2204 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Trivial Fix For: 2.5.0 Attachments: YARN-2204.patch, YARN-2204_addendum.patch, YARN-2204_addendum.patch TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2243) Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor
[ https://issues.apache.org/jira/browse/YARN-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049959#comment-14049959 ] Ted Yu commented on YARN-2243: -- Thanks for taking care of this. Order of arguments for Preconditions.checkNotNull() is wrong in SchedulerApplicationAttempt ctor Key: YARN-2243 URL: https://issues.apache.org/jira/browse/YARN-2243 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Devaraj K Priority: Minor Attachments: YARN-2243.patch {code} public SchedulerApplicationAttempt(ApplicationAttemptId applicationAttemptId, String user, Queue queue, ActiveUsersManager activeUsersManager, RMContext rmContext) { Preconditions.checkNotNull(RMContext should not be null, rmContext); {code} Order of arguments is wrong for Preconditions.checkNotNull(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049974#comment-14049974 ] Hudson commented on YARN-2022: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1792 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1792/]) YARN-2022 Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy (Sunil G via mayank) (mayank: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607227) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/ProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/capacity/TestProportionalCapacityPreemptionPolicy.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2204) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049977#comment-14049977 ] Hudson commented on YARN-2204: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1792 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1792/]) YARN-2204. Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow. (Anubhav Dhoot via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607231) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitor.java TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler --- Key: YARN-2204 URL: https://issues.apache.org/jira/browse/YARN-2204 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Trivial Fix For: 2.5.0 Attachments: YARN-2204.patch, YARN-2204_addendum.patch, YARN-2204_addendum.patch TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1713) Implement getnewapplication and submitapp as part of RM web service
[ https://issues.apache.org/jira/browse/YARN-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049975#comment-14049975 ] Hudson commented on YARN-1713: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1792 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1792/]) YARN-1713. Added get-new-app and submit-app functionality to RM web services. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607216) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/GenericExceptionHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ApplicationSubmissionContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ContainerLaunchContextInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/CredentialsInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/LocalResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/NewApplication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/ResourceInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesAppsModification.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm Implement getnewapplication and submitapp as part of RM web service --- Key: YARN-1713 URL: https://issues.apache.org/jira/browse/YARN-1713 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Fix For: 2.5.0 Attachments: apache-yarn-1713.10.patch, apache-yarn-1713.3.patch, apache-yarn-1713.4.patch, apache-yarn-1713.5.patch, apache-yarn-1713.6.patch, apache-yarn-1713.7.patch, apache-yarn-1713.8.patch, apache-yarn-1713.9.patch, apache-yarn-1713.cumulative.2.patch, apache-yarn-1713.cumulative.3.patch, apache-yarn-1713.cumulative.4.patch, apache-yarn-1713.cumulative.patch, apache-yarn-1713.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050042#comment-14050042 ] Zhijie Shen commented on YARN-2233: --- Almost good to me. Just some nits: 1. This won't happen inside renewDelegationToken, as it is already validated before. {code} +if (tokenData.getToken().isEmpty()) { + throw new BadRequestException(Empty token in request); +} {code} 2. It seems that some of the fields in DelegationToken are no longer necessary. 3. assertValidToken seems not to be necessary. Implement web services to create, renew and cancel delegation tokens Key: YARN-2233 URL: https://issues.apache.org/jira/browse/YARN-2233 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-2065: - Attachment: YARN-2065-003.patch correction: this is the version that compiles against trunk. I've tested this on the slider minicluster test that kills a container while the AM is down and verifies that hbase comes up to the right #of nodes ...this patch fixes it AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050260#comment-14050260 ] Zhijie Shen commented on YARN-675: -- Sure, reassign it to you In YarnClient, pull AM logs on AM container failure --- Key: YARN-675 URL: https://issues.apache.org/jira/browse/YARN-675 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Zhijie Shen Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to pull its logs from the NM to the client so that they can be displayed immediately to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-675) In YarnClient, pull AM logs on AM container failure
[ https://issues.apache.org/jira/browse/YARN-675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-675: - Assignee: Li Lu (was: Zhijie Shen) In YarnClient, pull AM logs on AM container failure --- Key: YARN-675 URL: https://issues.apache.org/jira/browse/YARN-675 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Li Lu Similar to MAPREDUCE-4362, when an AM container fails, it would be helpful to pull its logs from the NM to the client so that they can be displayed immediately to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050270#comment-14050270 ] Hadoop QA commented on YARN-2233: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653563/apache-yarn-2233.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-auth hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4177//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4177//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4177//console This message is automatically generated. Implement web services to create, renew and cancel delegation tokens Key: YARN-2233 URL: https://issues.apache.org/jira/browse/YARN-2233 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050272#comment-14050272 ] Hadoop QA commented on YARN-2065: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653606/YARN-2065-003.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4178//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4178//console This message is automatically generated. AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050286#comment-14050286 ] Jian He commented on YARN-2065: --- thanks for the testing, Steve! AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2211: Attachment: YARN-2211.1.patch RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2208: Attachment: YARN-2208.2.patch AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212.1.patch ApplicationMaster needs to find a way to update the AMRMToken periodically -- Key: YARN-2212 URL: https://issues.apache.org/jira/browse/YARN-2212 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2212.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2237) MRAppMaster changes for AMRMToken roll-up
[ https://issues.apache.org/jira/browse/YARN-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2237: Attachment: YARN-2237.1.patch MRAppMaster changes for AMRMToken roll-up - Key: YARN-2237 URL: https://issues.apache.org/jira/browse/YARN-2237 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2237.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050301#comment-14050301 ] Wei Yan commented on YARN-2139: --- Thanks for the comments, [~ste...@apache.org]. For you mentioned HDFS read/write problem, we leave it solved by the network part, as we also need handle the hdfs replicate traffic. I agree that we should avoid confuction with HDFS ''fs''. The idea of vdisks follows the vcores, where each physical cpu core is measured as some number of vcores. One concern about using real number is that users cannot specify their task requirements easily. One way may solve that is to provide several levels (low, moderate, high, etc) instead of real numbers. This is also similar to the discussions YARN-1024 on how to measure the cpu capacity. We can define the how many IOPs/bandwidth map to 1 vdisks. For the testability, currently I have: (1) For fairshare, start several tasks with same operations, put them in a single node, and check their I/O performance whether follows fairsharing; (2) I/O performance isolation for a given task, in a fully loaded cluster, we replay the given task several times, and verify when its I/O performance is stable. Here the task can do lots of local disk read and directly write operation, and the most time is used to do the I/O. Any good testing ideas? Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050380#comment-14050380 ] Hadoop QA commented on YARN-2208: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653615/YARN-2208.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.TestRMAppAttemptTransitions org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4179//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4179//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4179//console This message is automatically generated. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2208: Attachment: YARN-2208.3.patch Fix findbugs and testcase failures AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2211: Attachment: YARN-2211.2.patch RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens -- Key: YARN-2211 URL: https://issues.apache.org/jira/browse/YARN-2211 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2211.1.patch, YARN-2211.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050521#comment-14050521 ] Steve Loughran commented on YARN-2065: -- With Jenkins happy, I'm +1 on this patch; it fixes what it says it does AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2139) Add support for disk IO isolation/scheduling for containers
[ https://issues.apache.org/jira/browse/YARN-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050530#comment-14050530 ] Karthik Kambatla commented on YARN-2139: bq. this is really disk io bandwidth, so it should use an option, like vlocaldiskIObandwidth. This will avoid confusion (make clear its not HDFS), and add scope for the addition of future options: IOPs and actual allocation of entire disks to containers Good point. The document should probably discuss this in more detail. I think we should separate out the resource model used for requests and scheduling from the way we enforce it. For the former, I believe vdisks is a good candidate. Users find it hard to specify disk IO requirements in terms of IOPS and bandwidth; e.g. my MR task *needs* 200 MBps. vdisks, on the other hand, represent a share of the node and the IO parallelism (in a somewhat vague sense) the task can make use of. Furthermore, it is hard to guarantee a particular bandwidth or performance as they depend on the amount of parallelism and degree of randomness the disk accesses have. That said, I see value in making the enforcement pluggable. This JIRA could add the cgroups-based disk-share enforcment. In the future, we could explore other options. Add support for disk IO isolation/scheduling for containers --- Key: YARN-2139 URL: https://issues.apache.org/jira/browse/YARN-2139 Project: Hadoop YARN Issue Type: New Feature Reporter: Wei Yan Assignee: Wei Yan Attachments: Disk_IO_Scheduling_Design_1.pdf -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050533#comment-14050533 ] Anubhav Dhoot commented on YARN-2175: - We have seen it happen when the source file system had issues. Some jobs would intermittently take a long time to fail and would succeed in rerun because the jars were put in a new distributed cache location when rerun. Without this timeout we have no lever to mitigate underlying HDFS/Hardware issues out in production until the root cause is identified and fixed. Also in comparison with the mapreduce.task.timeout this seems very focussed on a specific operation - localization. I would expect this timeout would be defaulted to a large value in production (say 30 min) and used only to mitigate when a issue occurs in production. Container localization has no timeouts and tasks can be stuck there for a long time --- Key: YARN-2175 URL: https://issues.apache.org/jira/browse/YARN-2175 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no automated way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we can open others if we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050539#comment-14050539 ] Karthik Kambatla commented on YARN-2175: I ll let Anubhav provide details on the particular instance we ran into this. bq. We should try to address the right individual problem with its solution before we put a band-aid that may still be useful for issues that we cannot just address directly if any. Having worked on several MR1 production issues, I see your point. I agree we should look into and address individual problems. That said, I also believe in failsafes to avoid bringing down a production cluster or failing a critical job altogether in the face of hardware issues. That gives us time to fix the individual issues correctly when we encounter them, instead of hurrying for a hot fix. Container localization has no timeouts and tasks can be stuck there for a long time --- Key: YARN-2175 URL: https://issues.apache.org/jira/browse/YARN-2175 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no automated way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we can open others if we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time
[ https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050543#comment-14050543 ] Karthik Kambatla commented on YARN-2175: In MR1, mapred.task.timeout handles localization as well and that has worked very well for our customers. Should we do the same for MR2 as well? Container localization has no timeouts and tasks can be stuck there for a long time --- Key: YARN-2175 URL: https://issues.apache.org/jira/browse/YARN-2175 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot There are no timeouts that can be used to limit the time taken by various container startup operations. Localization for example could take a long time and there is no automated way to kill an task if its stuck in these states. These may have nothing to do with the task itself and could be an issue within the platform. Ideally there should be configurable limits for various states within the NodeManager to limit various states. The RM does not care about most of these and its only between AM and the NM. We can start by making these global configurable defaults and in future we can make it fancier by letting AM override them in the start container request. This jira will be used to limit localization time and we can open others if we feel we need to limit other operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050556#comment-14050556 ] Varun Vasudev commented on YARN-2233: - The test case failure is due to YARN-2232. It fixes a bug that one of the test cases relies on. Implement web services to create, renew and cancel delegation tokens Key: YARN-2233 URL: https://issues.apache.org/jira/browse/YARN-2233 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2065) AM cannot create new containers after restart-NM token from previous attempt used
[ https://issues.apache.org/jira/browse/YARN-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050562#comment-14050562 ] Hudson commented on YARN-2065: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5808 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5808/]) YARN-2065 AM cannot create new containers after restart (stevel: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607441) * /hadoop/common/trunk * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java AM cannot create new containers after restart-NM token from previous attempt used - Key: YARN-2065 URL: https://issues.apache.org/jira/browse/YARN-2065 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Steve Loughran Assignee: Jian He Fix For: 2.5.0 Attachments: YARN-2065-002.patch, YARN-2065-003.patch, YARN-2065.1.patch Slider AM Restart failing (SLIDER-34). The AM comes back up, but it cannot create new containers. The Slider minicluster test {{TestKilledAM}} can replicate this reliably -it kills the AM, then kills a container while the AM is down, which triggers a reallocation of a container, leading to this failure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2233: Attachment: apache-yarn-2233.2.patch {quote} 1. This won't happen inside renewDelegationToken, as it is already validated before. {noformat} +if (tokenData.getToken().isEmpty()) { + throw new BadRequestException(Empty token in request); +} {noformat} 2. It seems that some of the fields in DelegationToken are no longer necessary. 3. assertValidToken seems not to be necessary. {quote} Fixed all 3. I also fixed the FindBug warnings that were caused. Implement web services to create, renew and cancel delegation tokens Key: YARN-2233 URL: https://issues.apache.org/jira/browse/YARN-2233 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, apache-yarn-2233.2.patch Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050572#comment-14050572 ] Hadoop QA commented on YARN-2208: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653638/YARN-2208.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4180//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4180//console This message is automatically generated. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2240) yarn logs can get corrupted if the aggregator does not have permissions to the log file it tries to read
[ https://issues.apache.org/jira/browse/YARN-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050580#comment-14050580 ] Mit Desai commented on YARN-2240: - Aggregated Logs Comment [~vinodkv], here is the error on which it fails. {noformat} 2014-06-10 22:06:34,940 [LogAggregationService #1922] ERROR logaggregation.AggregatedLogFormat: Error aggregating log file. Log file : /grid/0/tmp/yarn-logs/application_1401475649625_135179/container_1401475649625_135179_01_01/history.txt.appattempt_1401475649625_135179_01/grid/0/tmp/yarn-logs/application_1401475649625_135179/container_1401475649625_135179_01_01/history.txt.appattempt_1401475649625_135179_01 (Permission denied) java.io.FileNotFoundException: /grid/0/tmp/yarn-logs/application_1401475649625_135179/container_1401475649625_135179_01_01/history.txt.appattempt_1401475649625_135179_01 (Permission denied) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:138) at org.apache.hadoop.io.SecureIOUtils.forceSecureOpenForRead(SecureIOUtils.java:215) at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:204) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogValue.write(AggregatedLogFormat.java:196) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogWriter.append(AggregatedLogFormat.java:311) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainer(AppLogAggregatorImpl.java:130) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:166) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:140) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$2.run(LogAggregationService.java:354) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) {noformat} I managed to get into the logs and found that the length for the logs it was reporting was 111K and the corrupted aggregated would read something like this. The portion of the aggregated logs where there is the problem is here. {noformat} [...] LogType: history.txt.appattempt_1401475649625_135179_01 LogLength: 111686 Log Contents: Error aggregating log file. Log file : /grid/0/tmp/yarn-logs/application_1401475649625_135179/container_1401475649625_135179_01_01/history.txt.appattempt_1401475649625_135179_01/grid/0/tmp/yarn-logs/application_1401475649625_135179/container_1401475649625_135179_01_01/history.txt.appattempt_1401475649625_135179_01 (Permission denied)stderr0!stderr_dag_1401475649625_135179_10stderr_dag_1401475649625_135179_1_post0stdout0!stdout_dag_1401475649625_135179_10stdout_dag_1401475649625_135179_1_post0syslog102042014-06-10 22:05:58,519 INFO [main] org.apache.tez.dag.app.DAGAppMaster: Created DAGAppMaster for application appattempt_1401475649625_135179_01 [...] {noformat} yarn logs can get corrupted if the aggregator does not have permissions to the log file it tries to read Key: YARN-2240 URL: https://issues.apache.org/jira/browse/YARN-2240 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Mit Desai When the log aggregator is aggregating the logs, it writes the file length first. Then tries to open the log file and if it does not have permission to do that, it ends up just writing an error message to the aggregated logs. The mismatch between the file length and the actual length here makes the aggregated logs corrupted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2241) ZKRMStateStore: On startup, show nicer messages when znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050621#comment-14050621 ] Karthik Kambatla commented on YARN-2241: +1 ZKRMStateStore: On startup, show nicer messages when znodes already exist - Key: YARN-2241 URL: https://issues.apache.org/jira/browse/YARN-2241 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Minor Attachments: YARN-2241.patch, YARN-2241.patch When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
Varun Vasudev created YARN-2247: --- Summary: Allow RM web services users to authenticate using delegation tokens Key: YARN-2247 URL: https://issues.apache.org/jira/browse/YARN-2247 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2247: Attachment: apache-yarn-2247.0.patch Allow RM web services users to authenticate using delegation tokens --- Key: YARN-2247 URL: https://issues.apache.org/jira/browse/YARN-2247 Project: Hadoop YARN Issue Type: Sub-task Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2247.0.patch The RM webapp should allow users to authenticate using delegation tokens to maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2241) ZKRMStateStore: On startup, show nicer messages if znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2241: --- Summary: ZKRMStateStore: On startup, show nicer messages if znodes already exist (was: ZKRMStateStore: On startup, show nicer messages when znodes already exist) ZKRMStateStore: On startup, show nicer messages if znodes already exist --- Key: YARN-2241 URL: https://issues.apache.org/jira/browse/YARN-2241 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Minor Attachments: YARN-2241.patch, YARN-2241.patch When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2241) ZKRMStateStore: On startup, show nicer messages if znodes already exist
[ https://issues.apache.org/jira/browse/YARN-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050695#comment-14050695 ] Hudson commented on YARN-2241: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5811 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5811/]) YARN-2241. ZKRMStateStore: On startup, show nicer messages if znodes already exist. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607473) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java ZKRMStateStore: On startup, show nicer messages if znodes already exist --- Key: YARN-2241 URL: https://issues.apache.org/jira/browse/YARN-2241 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Minor Fix For: 2.5.0 Attachments: YARN-2241.patch, YARN-2241.patch When using the RMZKStateStore, if you restart the RM, you get a bunch of stack traces with messages like {{org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /rmstore}}. This is expected as these nodes already exist from before. We should catch these and print nicer messages. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2248) Capacity Scheduler changes for moving apps between queues
Janos Matyas created YARN-2248: -- Summary: Capacity Scheduler changes for moving apps between queues Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Priority: Minor We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050729#comment-14050729 ] Vinod Kumar Vavilapalli commented on YARN-2248: --- Do you mind attaching a patch against latest YARN trunk? Thanks.. Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Priority: Minor We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2248) Capacity Scheduler changes for moving apps between queues
[ https://issues.apache.org/jira/browse/YARN-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reassigned YARN-2248: - Assignee: Janos Matyas [~matyix], tx for opening this. Assigning it to you.. Capacity Scheduler changes for moving apps between queues - Key: YARN-2248 URL: https://issues.apache.org/jira/browse/YARN-2248 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Janos Matyas Assignee: Janos Matyas Priority: Minor We would like to have the capability (same as the Fair Scheduler has) to move applications between queues. We have made a baseline implementation and tests to start with - and we would like the community to review, come up with suggestions and finally have this contributed. The current implementation is available for 2.4.1 - so the first thing is that we'd need to identify the target version as there are differences between 2.4.* and 3.* interfaces. The story behind is available at http://blog.sequenceiq.com/blog/2014/07/02/move-applications-between-queues/ and the baseline implementation and test at: https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/ExtendedCapacityScheduler.java#L924 https://github.com/sequenceiq/hadoop-common/blob/branch-2.4.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/a/TestExtendedCapacitySchedulerAppMove.java -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2232) ClientRMService doesn't allow delegation token owner to cancel their own token in secure mode
[ https://issues.apache.org/jira/browse/YARN-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050738#comment-14050738 ] Vinod Kumar Vavilapalli commented on YARN-2232: --- Looks good, +1. Checking this in.. ClientRMService doesn't allow delegation token owner to cancel their own token in secure mode - Key: YARN-2232 URL: https://issues.apache.org/jira/browse/YARN-2232 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2232.0.patch, apache-yarn-2232.1.patch, apache-yarn-2232.2.patch The ClientRMSerivce doesn't allow delegation token owners to cancel their own tokens. The root cause is this piece of code from the cancelDelegationToken function - {noformat} String user = getRenewerForToken(token); ... private String getRenewerForToken(TokenRMDelegationTokenIdentifier token) throws IOException { UserGroupInformation user = UserGroupInformation.getCurrentUser(); UserGroupInformation loginUser = UserGroupInformation.getLoginUser(); // we can always renew our own tokens return loginUser.getUserName().equals(user.getUserName()) ? token.decodeIdentifier().getRenewer().toString() : user.getShortUserName(); } {noformat} It ends up passing the user short name to the cancelToken function whereas AbstractDelegationTokenSecretManager::cancelToken expects the full user name. This bug occurs in secure mode and is not an issue with simple auth. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2232) ClientRMService doesn't allow delegation token owner to cancel their own token in secure mode
[ https://issues.apache.org/jira/browse/YARN-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050753#comment-14050753 ] Hudson commented on YARN-2232: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5812 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5812/]) YARN-2232. Fixed ResourceManager to allow DelegationToken owners to be able to cancel their own tokens in secure mode. Contributed by Varun Vasudev. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607484) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestClientRMService.java ClientRMService doesn't allow delegation token owner to cancel their own token in secure mode - Key: YARN-2232 URL: https://issues.apache.org/jira/browse/YARN-2232 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.5.0 Attachments: apache-yarn-2232.0.patch, apache-yarn-2232.1.patch, apache-yarn-2232.2.patch The ClientRMSerivce doesn't allow delegation token owners to cancel their own tokens. The root cause is this piece of code from the cancelDelegationToken function - {noformat} String user = getRenewerForToken(token); ... private String getRenewerForToken(TokenRMDelegationTokenIdentifier token) throws IOException { UserGroupInformation user = UserGroupInformation.getCurrentUser(); UserGroupInformation loginUser = UserGroupInformation.getLoginUser(); // we can always renew our own tokens return loginUser.getUserName().equals(user.getUserName()) ? token.decodeIdentifier().getRenewer().toString() : user.getShortUserName(); } {noformat} It ends up passing the user short name to the cancelToken function whereas AbstractDelegationTokenSecretManager::cancelToken expects the full user name. This bug occurs in secure mode and is not an issue with simple auth. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Attachment: YARN-2242-070115.patch Hi [~djp], in this new patch I modified an existing unit test, on testing allocation time AM crashes, to verify correct information has been added to the diagnostic information. I added them in a new private method that may be reused in future to verify more diagnostic information. For now, I'm testing the diagnosis contains the proxy URL, and mentions logs to users. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Attachment: YARN-2242-070115-1.patch Better interface design for the private verification function on diagnostic information in the unit test. Now this function only takes the diagnostic info in. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115-1.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050784#comment-14050784 ] Hudson commented on YARN-2022: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5813 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5813/]) YARN-2022. Fixing CHANGES.txt to be correctly placed. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1607486) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050826#comment-14050826 ] Hadoop QA commented on YARN-2242: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653699/YARN-2242-070115.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4181//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4181//console This message is automatically generated. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115-1.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050838#comment-14050838 ] Hadoop QA commented on YARN-2242: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653703/YARN-2242-070115-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4182//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4182//console This message is automatically generated. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115-1.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: (was: YARN-2229.3.patch) Making ContainerId long type Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2249) AM resync may send container release request before container is actually recovered
Jian He created YARN-2249: - Summary: AM resync may send container release request before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He AM resync on RM restart will send outstanding resource requests, container release list etc. back to the new RM. It is possible that container release request is processed before the container is actually recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2249) AM resync may send container release request before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2249: -- Description: AM resync on RM restart will send outstanding resource requests, container release list etc. back to the new RM. It is possible that RM receives the container release request before the container is actually recovered.(was: AM resync on RM restart will send outstanding resource requests, container release list etc. back to the new RM. It is possible that container release request is processed before the container is actually recovered. ) AM resync may send container release request before container is actually recovered --- Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He AM resync on RM restart will send outstanding resource requests, container release list etc. back to the new RM. It is possible that RM receives the container release request before the container is actually recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2249) RM may receive container release request on AM resync before container is actually recovered
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2249: -- Summary: RM may receive container release request on AM resync before container is actually recovered (was: AM resync may send container release request before container is actually recovered) RM may receive container release request on AM resync before container is actually recovered Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He AM resync on RM restart will send outstanding resource requests, container release list etc. back to the new RM. It is possible that RM receives the container release request before the container is actually recovered. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2069: Attachment: YARN-2069-trunk-3.patch Rebasing and Updating the patch. Thanks, Mayank Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050913#comment-14050913 ] Jian He commented on YARN-1366: --- looks good, +1 AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.3.patch Making ContainerId long type Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050933#comment-14050933 ] Hadoop QA commented on YARN-2229: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653729/YARN-2229.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4184//console This message is automatically generated. Making ContainerId long type Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) Add cross-user preemption within CapacityScheduler's leaf-queue
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050947#comment-14050947 ] Hadoop QA commented on YARN-2069: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653724/YARN-2069-trunk-3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4183//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4183//console This message is automatically generated. Add cross-user preemption within CapacityScheduler's leaf-queue --- Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch Preemption today only works across queues and moves around resources across queues per demand and usage. We should also have user-level preemption within a queue, to balance capacity across users in a predictable manner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050961#comment-14050961 ] Tsuyoshi OZAWA commented on YARN-2229: -- [~jianhe], can you check whether trunk code with v3 patch can compile or not? It works on my local. It might be Jenkins CI problem. Making ContainerId long type Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050965#comment-14050965 ] Bikas Saha commented on YARN-1366: -- Why are we returning the old allocateResponse to the user? What is the user expected to do with this allocateResponse that has a RESYNC command in it? Should we make a second call to allocate (after re-registering) and then send that response back up to the user? {code}+// re register with RM +registerApplicationMaster(); +return allocateResponse; + }{code} There needs to be some clear documentation that if the user has not removed container requests that have already been satisfied, then the re-register may end up sending the entire ask list to the RM (including matched requests). Which would mean the RM could end up giving it a lot of new allocated containers. AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050975#comment-14050975 ] Jian He commented on YARN-1366: --- bq. Should we make a second call to allocate (after re-registering) and then send that response back up to the user? Followup jira YARN-2209 will replace the resync command with a proper exception, in which case the returned allocate response should be null. bq. There needs to be some clear documentation make sense, Rohith, can you add some documentation about this , thx! AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050977#comment-14050977 ] Jian He commented on YARN-1366: --- bq. in which case the returned allocate response should be null I meant an empty response.. AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050978#comment-14050978 ] Bikas Saha commented on YARN-1366: -- Does a null response make sense for the user? AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050990#comment-14050990 ] Jian He commented on YARN-2229: --- it failed with cannot find symbol for this method Long.compare(this.getEpoch(), other.getEpoch()); Making ContainerId long type Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051013#comment-14051013 ] Rohith commented on YARN-1366: -- bq. can you add some documentation about this Shall I add in java doc for AMRMClient#allocate()..? If it is in information guide, Could you please guide me how to add documentation.? AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051019#comment-14051019 ] Junping Du commented on YARN-2242: -- Patch looks good to me overall. Some minor fix is missing . at the end of diagnostics message. +1. Will commit it with this minor fix. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115-1.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051045#comment-14051045 ] Tsuyoshi OZAWA commented on YARN-2229: -- Thanks you, Jian. I found that Long.compare(long x, long y) is introduced in JDK 1.7. I'll update a patch not to use the method. Making ContainerId long type Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051049#comment-14051049 ] Jian He commented on YARN-1366: --- add java doc for AMRMClient#allocate(). should be enough. bq. Does a null response make sense for the user? doing one more allocate makes the response more consistent. AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2229) Making ContainerId long type
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.4.patch Updated a patch not to use Long.compare(long, long). Making ContainerId long type Key: YARN-2229 URL: https://issues.apache.org/jira/browse/YARN-2229 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2229.1.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch On YARN-2052, we changed containerId format: upper 10 bits are for epoch, lower 22 bits are for sequence number of Ids. This is for preserving semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM restarts 1024 times. To avoid the problem, its better to make containerId long. We need to define the new format of container Id with preserving backward compatibility on this JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2233) Implement web services to create, renew and cancel delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051058#comment-14051058 ] Hadoop QA commented on YARN-2233: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12653661/apache-yarn-2233.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-auth hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4185//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4185//console This message is automatically generated. Implement web services to create, renew and cancel delegation tokens Key: YARN-2233 URL: https://issues.apache.org/jira/browse/YARN-2233 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Priority: Blocker Attachments: apache-yarn-2233.0.patch, apache-yarn-2233.1.patch, apache-yarn-2233.2.patch Implement functionality to create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051068#comment-14051068 ] Li Lu commented on YARN-2242: - Thanks [~djp]! Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115-1.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051066#comment-14051066 ] Junping Du commented on YARN-2242: -- bq. Zhijie Shen, thank you for the notification. I agree with Li Lu - these patches can be helpful for users. And, I checked the patch and confirmed that these patches are bisect each other. Therefore, we can work separately. Junping Du, could you also review YARN-2013? Sure. [~ozawa], will look at YARN-2013 later. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115-1.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2242) Improve exception information on AM launch crashes
[ https://issues.apache.org/jira/browse/YARN-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2242: Attachment: YARN-2242-070115-2.patch Add the missing period in the end of the diagnostic information. Improve exception information on AM launch crashes -- Key: YARN-2242 URL: https://issues.apache.org/jira/browse/YARN-2242 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2242-070114-1.patch, YARN-2242-070114.patch, YARN-2242-070115-1.patch, YARN-2242-070115-2.patch, YARN-2242-070115.patch Now on each time AM Container crashes during launch, both the console and the webpage UI only report a ShellExitCodeExecption. This is not only unhelpful, but sometimes confusing. With the help of log aggregator, container logs are actually aggregated, and can be very helpful for debugging. One possible way to improve the whole process is to send a pointer to the aggregated logs to the programmer when reporting exception information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
[ https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2062: --- Target Version/s: 2.6.0 (was: 2.5.0) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover --- Key: YARN-2062 URL: https://issues.apache.org/jira/browse/YARN-2062 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla On busy clusters, we see several {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events invoked against NEW nodes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1354: --- Target Version/s: 2.6.0 (was: 2.5.0) Recover applications upon nodemanager restart - Key: YARN-1354 URL: https://issues.apache.org/jira/browse/YARN-1354 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1354-v1.patch, YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch The set of active applications in the nodemanager context need to be recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1342: --- Target Version/s: 2.6.0 (was: 2.5.0) Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1342.patch, YARN-1342v2.patch, YARN-1342v3-and-YARN-1987.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2110) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2110: --- Target Version/s: 2.6.0 (was: 2.5.0) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler --- Key: YARN-2110 URL: https://issues.apache.org/jira/browse/YARN-2110 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Chen He Labels: test Attachments: YARN-2110-v2.patch, YARN-2110.patch The TestAMRestart#testAMRestartWithExistingContainers does a cast to CapacityScheduler in a couple of places {code} ((CapacityScheduler) rm1.getResourceScheduler()) {code} If run with FairScheduler as default scheduler the test throws {code} java.lang.ClassCastException {code}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1106) The RM should point the tracking url to the RM app page if its empty
[ https://issues.apache.org/jira/browse/YARN-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1106: --- Target Version/s: 2.6.0 (was: 3.0.0, 2.5.0) The RM should point the tracking url to the RM app page if its empty Key: YARN-1106 URL: https://issues.apache.org/jira/browse/YARN-1106 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.0.0, 2.1.0-beta, 0.23.9 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1106.patch, YARN-1106.patch It would be nice if the Resourcemanager set the tracking url to the RM app page if the application master doesn't pass one or passes the empty string. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1408: --- Target Version/s: 2.6.0 (was: 2.5.0) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1856: --- Target Version/s: 2.6.0 (was: 2.5.0) cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1809) Synchronize RM and Generic History Service Web-UIs
[ https://issues.apache.org/jira/browse/YARN-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1809: --- Target Version/s: 2.6.0 (was: 2.5.0) Synchronize RM and Generic History Service Web-UIs -- Key: YARN-1809 URL: https://issues.apache.org/jira/browse/YARN-1809 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1809.1.patch, YARN-1809.2.patch, YARN-1809.3.patch, YARN-1809.4.patch, YARN-1809.5.patch, YARN-1809.5.patch, YARN-1809.6.patch, YARN-1809.7.patch, YARN-1809.8.patch, YARN-1809.9.patch After YARN-953, the web-UI of generic history service is provide more information than that of RM, the details about app attempt and container. It's good to provide similar web-UIs, but retrieve the data from separate source, i.e., RM cache and history store respectively. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1844) yarn.log.server.url should have a default value
[ https://issues.apache.org/jira/browse/YARN-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1844: --- Target Version/s: 2.6.0 (was: 2.5.0) yarn.log.server.url should have a default value --- Key: YARN-1844 URL: https://issues.apache.org/jira/browse/YARN-1844 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Currently yarn.log.server.url must be configured properly by a user when log aggregation is enabled so logs to continue to be served from their original URL after they've been aggregated. It would be nice if a default value for this property could be provided that would work out of the box for at least simple cluster setups (i.e.: already point to JHS or AHS accordingly). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1681) When banned.users is not set in LCE's container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS will receive unclear error message
[ https://issues.apache.org/jira/browse/YARN-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1681: --- Target Version/s: 2.6.0 (was: 2.5.0) When banned.users is not set in LCE's container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS will receive unclear error message --- Key: YARN-1681 URL: https://issues.apache.org/jira/browse/YARN-1681 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.2.0 Reporter: Zhichun Wu Assignee: Zhichun Wu Priority: Minor Labels: container, usability Attachments: YARN-1681.patch When using LCE in a secure setup, if banned.users is not set in container-executor.cfg, submit job with user in DEFAULT_BANNED_USERS (mapred, hdfs, bin, 0) will receive unclear error message. for example, if we use hdfs to submit a mr job, we may see the following the yarn app overview page: {code} appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: Application application_1391353981633_0003 initialization failed (exitCode=139) with output: {code} while the prefer error message may look like: {code} appattempt_1391353981633_0003_02 exited with exitCode: -1000 due to: Application application_1391353981633_0003 initialization failed (exitCode=139) with output: Requested user hdfs is banned {code} just a minor bug and I would like to start contributing to hadoop-common with it:) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2157) Document YARN metrics
[ https://issues.apache.org/jira/browse/YARN-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2157: --- Target Version/s: 2.6.0 (was: 2.5.0) Document YARN metrics - Key: YARN-2157 URL: https://issues.apache.org/jira/browse/YARN-2157 Project: Hadoop YARN Issue Type: Improvement Components: documentation Reporter: Akira AJISAKA Assignee: Akira AJISAKA Attachments: YARN-2157.2.patch, YARN-2157.patch YARN-side of HADOOP-6350. Add YARN metrics to Metrics document. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage
[ https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2162: --- Target Version/s: 2.6.0 (was: 2.5.0) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage Key: YARN-2162 URL: https://issues.apache.org/jira/browse/YARN-2162 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Ashwin Shankar Labels: scheduler minResources and maxResources in fair scheduler configs are expressed in terms of absolute numbers X mb, Y vcores. As a result, when we expand or shrink our hadoop cluster, we need to recalculate and change minResources/maxResources accordingly, which is pretty inconvenient. We can circumvent this problem if we can optionally configure these properties in terms of percentage of cluster capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2209: --- Target Version/s: 2.6.0 (was: 2.5.0) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)