[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148834#comment-14148834 ] Zhijie Shen commented on YARN-2468: --- The patch is generally good. Some minor comments, and puzzles about the code. 1. The first one is \@VisibleForTesting? And the second one is not necessary? {code} - private static String getNodeString(NodeId nodeId) { + public static String getNodeString(NodeId nodeId) { return nodeId.toString().replace(:, _); } - + + public static String getNodeString(String nodeId) { +return nodeId.replace(:, _); + } {code} 2. Add a TODO to say the test will be fixed in a in followup Jira, in case we forget it? {code} + @Ignore @Test public void testNoLogs() throws Exception { {code} 3. Based on my understanding, uploadedFiles is the candidate files to upload? If so, can we rename the variables and related methods? {code} +private SetFile uploadedFiles = new HashSetFile(); {code} 4. I assume this var is going to capture all the existing log files on HDFS, isn't it? If so, the computation of it seems to be problematic, because it doesn't exclude the files to be excluded. And what's the effect on alreadyUploadedLogs? {code} +private SetString allExistingFileMeta = new HashSetString(); {code} {code} IterableString mask = Iterables.filter(alreadyUploadedLogs, new PredicateString() { @Override public boolean apply(String next) { return currentExistingLogFiles.contains(next); } }); {code} 5. Make the old LogValue constructor based on the new one? 6. LogValue.write is not necessary to be changed? 7. It's recommended to close the Closable objects via IOUtils, but it seems that AggregatedLogFormat already has this issue before. Let's file a separate ticket for it. {code} +if (this.fsDataOStream != null) { + this.fsDataOStream.close(); +} {code} 8. nodeId seems to be of no use. No need to be passed into AppLogAggregatorImpl. {code} + private final NodeId nodeId; {code} 9. remoteNodeLogDirForApp doesn't affect remoteNodeTmpLogFileForApp, which only depends on remoteNodeLogFileForApp. remoteNodeLogFileForApp is determined at construction, so remoteNodeTmpLogFileForApp should be final and computed once in constructor as well. And constructor param remoteNodeLogDirForApp should be renamed back to remoteNodeLogFileForApp. {code} - private final Path remoteNodeTmpLogFileForApp; + private Path remoteNodeTmpLogFileForApp; {code} {code} - private Path getRemoteNodeTmpLogFileForApp() { + private Path getRemoteNodeTmpLogFileForApp(Path remoteNodeLogDirForApp) { return new Path(remoteNodeLogFileForApp.getParent(), -(remoteNodeLogFileForApp.getName() + TMP_FILE_SUFFIX)); + (remoteNodeLogFileForApp.getName() + LogAggregationUtils.TMP_FILE_SUFFIX)); } {code} 10. One typo {code} // if any of the previous uoloaded logs have been deleted, {code} 11. One question: if one file is failed at uploading in LogValue.write(), uploadedFiles will not reflect the missing uploaded file, and it will not be uploaded again? Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148904#comment-14148904 ] Tsuyoshi OZAWA commented on YARN-1458: -- [~kkambatl], do you mind backporting the patch for branch-2.5? It looks critical problem since 2.2.0. FairScheduler: Zero weight can lead to livelock --- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2612) Some completed containers are not reported to NM
hex108 created YARN-2612: Summary: Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: hex108 Fix For: 2.6.0 In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hex108 updated YARN-2612: - Attachment: YARN-2612.patch Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: hex108 Fix For: 2.6.0 Attachments: YARN-2612.patch In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hex108 updated YARN-2612: - Description: In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. was: In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: hex108 Fix For: 2.6.0 Attachments: YARN-2612.patch In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148951#comment-14148951 ] Steve Loughran commented on YARN-913: - Distributed shell test *appears* to be YARN-2607, i.e. independent of this patch; HADOOP-10668 covers TestZKFailoverControllerStress intermittent failure. {{TestSecureRMRegistryOperations}} is a failure in the setup phase, the setup of the registry path in a {{zookee...@example.com.doAs()}} clause is failing with permissions, as if the first test case has set up the path without write access. More diagnostics needed here, such as identity of user making the call, maybe start test with some diagnostics of the path {code} TestAnonReadAccess(org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations) Time elapsed: 0.099 sec ERROR! org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.fs.PathAccessDeniedException: `/registry/users / [ 1, 'world,'anyone 31, 'sasl,'zookee...@example.com 31, 'sasl,'zookee...@example.com 31, 'sasl,'zookee...@example.com ]': Permission denied: KeeperErrorCode = NoAuth for /registry/users at org.apache.zookeeper.KeeperException.create(KeeperException.java:113) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:688) at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:672) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:668) at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:453) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:443) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:423) at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44) at org.apache.hadoop.yarn.registry.client.services.zk.CuratorService.zkMkPath(CuratorService.java:539) at org.apache.hadoop.yarn.registry.client.services.zk.CuratorService.maybeCreate(CuratorService.java:426) at org.apache.hadoop.yarn.registry.server.services.RegistryAdminService.createRootRegistryPaths(RegistryAdminService.java:201) at org.apache.hadoop.yarn.registry.server.services.RegistryAdminService.serviceStart(RegistryAdminService.java:187) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:106) at org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations$1.run(TestSecureRMRegistryOperations.java:98) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1640) at org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations.startRMRegistryOperations(TestSecureRMRegistryOperations.java:97) at org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations.testAnonReadAccess(TestSecureRMRegistryOperations.java:130) {code} Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2612: --- Attachment: YARN-2612.2.patch Also change Capacity and FIFO Scheduler. Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 Attachments: YARN-2612.2.patch, YARN-2612.patch In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun Chen updated YARN-2612: Description: We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. was: In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 Attachments: YARN-2612.2.patch, YARN-2612.patch We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running
[jira] [Commented] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148983#comment-14148983 ] Hadoop QA commented on YARN-2612: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671413/YARN-2612.patch against trunk revision 662fc11. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5143//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5143//console This message is automatically generated. Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 Attachments: YARN-2612.2.patch, YARN-2612.patch We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149002#comment-14149002 ] Remus Rusanu commented on YARN-2198: The findbugs issue is HADOOP-11122 Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch, YARN-2198.separation.patch, YARN-2198.trunk.10.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch, YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2357: --- Attachment: YARN-2357.3.patch .3.patch is the port of YARN-2198.trunk.10.patch Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2 -- Key: YARN-2357 URL: https://issues.apache.org/jira/browse/YARN-2357 Project: Hadoop YARN Issue Type: Task Components: nodemanager Affects Versions: 2.4.0 Reporter: Remus Rusanu Assignee: Remus Rusanu Priority: Critical Labels: security, windows Attachments: YARN-2357.1.patch, YARN-2357.2.patch, YARN-2357.3.patch As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2612: --- Attachment: (was: YARN-2612.2.patch) Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2612: --- Attachment: (was: YARN-2612.patch) Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149006#comment-14149006 ] Hadoop QA commented on YARN-2612: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671420/YARN-2612.2.patch against trunk revision 662fc11. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5144//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5144//console This message is automatically generated. Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2612) Some completed containers are not reported to NM
[ https://issues.apache.org/jira/browse/YARN-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong updated YARN-2612: --- Description: We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. We think when RMAppAttempt call BaseFinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. was: We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:44,233 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {code} In YARN-1372, NM will report completed containers to RM until it gets ACK from RM. If AM does not call allocate, which means that AM does not ack RM, RM will not ack NM. We([~chenchun]) have observed these two cases when running Mapreduce task 'pi': 1) RM sends completed containers to AM. After receiving it, AM thinks it has done the work and does not need resource, so it does not call allocate. 2) When AM finishes, it could not ack to RM because AM itself has not finished yet. In order to solve this problem, we have two solutions: 1) When RMAppAttempt call FinalTransition, it means AppAttempt finishes, then RM could send this AppAttempt's completed containers to NM. 2) In FairScheduler#nodeUpdate, if completed containers sent by NM does not have corresponding RMContainer, RM just ack it to NM. We prefer to solution 2 because it is more clear and concise. However RM might ack same completed containers to NM many times. Some completed containers are not reported to NM Key: YARN-2612 URL: https://issues.apache.org/jira/browse/YARN-2612 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Jun Gong Fix For: 2.6.0 We are testing RM work preserving restart and found the following logs when we ran a simple MapReduce task PI. Some completed containers which already pulled by AM never reported back to NM, so NM continuously report the completed containers while AM had finished. {code} 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:42,228 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... 2014-09-26 17:00:43,230 INFO
[jira] [Commented] (YARN-2610) Hamlet doesn't close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149029#comment-14149029 ] Devaraj K commented on YARN-2610: - It seems this has been done purposefully. [~rchiang] Please have look into the discussion in jira MAPREDUCE-2993. Hamlet doesn't close table tags --- Key: YARN-2610 URL: https://issues.apache.org/jira/browse/YARN-2610 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Attachments: YARN-2610-01.patch, YARN-2610-02.patch Revisiting a subset of MAPREDUCE-2993. The th, td, thead, tfoot, tr tags are not configured to close properly in Hamlet. While this is allowed in HTML 4.01, missing closing table tags tends to wreak havoc with a lot of HTML processors (although not usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2357: --- Attachment: (was: YARN-2357.3.patch) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2 -- Key: YARN-2357 URL: https://issues.apache.org/jira/browse/YARN-2357 Project: Hadoop YARN Issue Type: Task Components: nodemanager Affects Versions: 2.4.0 Reporter: Remus Rusanu Assignee: Remus Rusanu Priority: Critical Labels: security, windows Attachments: YARN-2357.1.patch, YARN-2357.2.patch, YARN-2357.3.patch As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2357: --- Attachment: YARN-2357.3.patch Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2 -- Key: YARN-2357 URL: https://issues.apache.org/jira/browse/YARN-2357 Project: Hadoop YARN Issue Type: Task Components: nodemanager Affects Versions: 2.4.0 Reporter: Remus Rusanu Assignee: Remus Rusanu Priority: Critical Labels: security, windows Attachments: YARN-2357.1.patch, YARN-2357.2.patch, YARN-2357.3.patch As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (YARN-2610) Hamlet doesn't close table tags
[ https://issues.apache.org/jira/browse/YARN-2610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-2610: Comment: was deleted (was: It seems this has been done purposefully. [~rchiang] Please have look into the discussion in jira MAPREDUCE-2993. ) Hamlet doesn't close table tags --- Key: YARN-2610 URL: https://issues.apache.org/jira/browse/YARN-2610 Project: Hadoop YARN Issue Type: Bug Reporter: Ray Chiang Assignee: Ray Chiang Labels: supportability Attachments: YARN-2610-01.patch, YARN-2610-02.patch Revisiting a subset of MAPREDUCE-2993. The th, td, thead, tfoot, tr tags are not configured to close properly in Hamlet. While this is allowed in HTML 4.01, missing closing table tags tends to wreak havoc with a lot of HTML processors (although not usually browsers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149053#comment-14149053 ] Hudson commented on YARN-2608: -- FAILURE: Integrated in Hadoop-Yarn-trunk #692 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/692/]) YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock access. (Wei Yan via kasha) (kasha: rev f4357240a6f81065d91d5f443ed8fc8cd2a14a8f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java FairScheduler: Potential deadlocks in loading alloc files and clock access -- Key: YARN-2608 URL: https://issues.apache.org/jira/browse/YARN-2608 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch Two potential deadlocks exist inside the FairScheduler. 1. AllocationFileLoaderService would reload the queue configuration, which calls FairScheduler.AllocationReloadListener.onReload() function. And require *FairScheduler's lock*; {code} public void onReload(AllocationConfiguration queueInfo) { synchronized (FairScheduler.this) { } } {code} after that, it would require the *QueueManager's queues lock*. {code} private FSQueue getQueue(String name, boolean create, FSQueueType queueType) { name = ensureRootPrefix(name); synchronized (queues) { } } {code} Another thread FairScheduler.assignToQueue may also need to create a new queue when a new job submitted. This thread would hold the *QueueManager's queues lock* firstly, and then would like to hold the *FairScheduler's lock* as it needs to call FairScheduler.getClock() function when creating a new FSLeafQueue. Deadlock may happen here. 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's lock* first, and then waits for *FairScheduler's lock*. Another thread (like AdminService.refreshQueues) may call FairScheduler's reinitialize function, which holds *FairScheduler's lock* first, and then waits for *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149051#comment-14149051 ] Hudson commented on YARN-2523: -- FAILURE: Integrated in Hadoop-Yarn-trunk #692 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/692/]) YARN-2523. ResourceManager UI showing negative value for Decommissioned Nodes field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java ResourceManager UI showing negative value for Decommissioned Nodes field -- Key: YARN-2523 URL: https://issues.apache.org/jira/browse/YARN-2523 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0 Reporter: Nishan Shetty Assignee: Rohith Fix For: 2.6.0 Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, YARN-2523.patch 1. Decommission one NodeManager by configuring ip in excludehost file 2. Remove ip from excludehost file 3. Execute -refreshNodes command and restart Decommissioned NodeManager Observe that in RM UI negative value for Decommissioned Nodes field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.15.patch Updated. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149114#comment-14149114 ] Hadoop QA commented on YARN-1879: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671436/YARN-1879.15.patch against trunk revision 662fc11. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5145//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5145//console This message is automatically generated. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-913: Attachment: YARN-913-011.patch Print out detailed diags (inc ACLs) on permissions problems during registry bootstrap Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149194#comment-14149194 ] Hudson commented on YARN-2608: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1883 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1883/]) YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock access. (Wei Yan via kasha) (kasha: rev f4357240a6f81065d91d5f443ed8fc8cd2a14a8f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt FairScheduler: Potential deadlocks in loading alloc files and clock access -- Key: YARN-2608 URL: https://issues.apache.org/jira/browse/YARN-2608 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch Two potential deadlocks exist inside the FairScheduler. 1. AllocationFileLoaderService would reload the queue configuration, which calls FairScheduler.AllocationReloadListener.onReload() function. And require *FairScheduler's lock*; {code} public void onReload(AllocationConfiguration queueInfo) { synchronized (FairScheduler.this) { } } {code} after that, it would require the *QueueManager's queues lock*. {code} private FSQueue getQueue(String name, boolean create, FSQueueType queueType) { name = ensureRootPrefix(name); synchronized (queues) { } } {code} Another thread FairScheduler.assignToQueue may also need to create a new queue when a new job submitted. This thread would hold the *QueueManager's queues lock* firstly, and then would like to hold the *FairScheduler's lock* as it needs to call FairScheduler.getClock() function when creating a new FSLeafQueue. Deadlock may happen here. 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's lock* first, and then waits for *FairScheduler's lock*. Another thread (like AdminService.refreshQueues) may call FairScheduler's reinitialize function, which holds *FairScheduler's lock* first, and then waits for *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149192#comment-14149192 ] Hudson commented on YARN-2523: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1883 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1883/]) YARN-2523. ResourceManager UI showing negative value for Decommissioned Nodes field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java ResourceManager UI showing negative value for Decommissioned Nodes field -- Key: YARN-2523 URL: https://issues.apache.org/jira/browse/YARN-2523 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0 Reporter: Nishan Shetty Assignee: Rohith Fix For: 2.6.0 Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, YARN-2523.patch 1. Decommission one NodeManager by configuring ip in excludehost file 2. Remove ip from excludehost file 3. Execute -refreshNodes command and restart Decommissioned NodeManager Observe that in RM UI negative value for Decommissioned Nodes field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2523) ResourceManager UI showing negative value for Decommissioned Nodes field
[ https://issues.apache.org/jira/browse/YARN-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149271#comment-14149271 ] Hudson commented on YARN-2523: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1908 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1908/]) YARN-2523. ResourceManager UI showing negative value for Decommissioned Nodes field. Contributed by Rohith (jlowe: rev 8269bfa613999f71767de3c0369817b58cfe1416) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java ResourceManager UI showing negative value for Decommissioned Nodes field -- Key: YARN-2523 URL: https://issues.apache.org/jira/browse/YARN-2523 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0 Reporter: Nishan Shetty Assignee: Rohith Fix For: 2.6.0 Attachments: YARN-2523.1.patch, YARN-2523.2.patch, YARN-2523.patch, YARN-2523.patch 1. Decommission one NodeManager by configuring ip in excludehost file 2. Remove ip from excludehost file 3. Execute -refreshNodes command and restart Decommissioned NodeManager Observe that in RM UI negative value for Decommissioned Nodes field is shown -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2608) FairScheduler: Potential deadlocks in loading alloc files and clock access
[ https://issues.apache.org/jira/browse/YARN-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149273#comment-14149273 ] Hudson commented on YARN-2608: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1908 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1908/]) YARN-2608. FairScheduler: Potential deadlocks in loading alloc files and clock access. (Wei Yan via kasha) (kasha: rev f4357240a6f81065d91d5f443ed8fc8cd2a14a8f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt FairScheduler: Potential deadlocks in loading alloc files and clock access -- Key: YARN-2608 URL: https://issues.apache.org/jira/browse/YARN-2608 Project: Hadoop YARN Issue Type: Bug Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.6.0 Attachments: YARN-2608-1.patch, YARN-2608-2.patch, YARN-2608-3.patch Two potential deadlocks exist inside the FairScheduler. 1. AllocationFileLoaderService would reload the queue configuration, which calls FairScheduler.AllocationReloadListener.onReload() function. And require *FairScheduler's lock*; {code} public void onReload(AllocationConfiguration queueInfo) { synchronized (FairScheduler.this) { } } {code} after that, it would require the *QueueManager's queues lock*. {code} private FSQueue getQueue(String name, boolean create, FSQueueType queueType) { name = ensureRootPrefix(name); synchronized (queues) { } } {code} Another thread FairScheduler.assignToQueue may also need to create a new queue when a new job submitted. This thread would hold the *QueueManager's queues lock* firstly, and then would like to hold the *FairScheduler's lock* as it needs to call FairScheduler.getClock() function when creating a new FSLeafQueue. Deadlock may happen here. 2. The AllocationFileLoaderService holds *AllocationFileLoaderService's lock* first, and then waits for *FairScheduler's lock*. Another thread (like AdminService.refreshQueues) may call FairScheduler's reinitialize function, which holds *FairScheduler's lock* first, and then waits for *AllocationFileLoaderService's lock*. Deadlock may happen here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149310#comment-14149310 ] Hadoop QA commented on YARN-913: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671457/YARN-913-011.patch against trunk revision 662fc11. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 36 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1266 javac compiler warnings (more than the trunk's current 1265 warnings). {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5146//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-registry.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-common.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5146//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5146//console This message is automatically generated. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149565#comment-14149565 ] Thomas Graves commented on YARN-1769: - Thanks for the review Jason. I'll update the patch and remove some of the logging or make it truly debug. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149580#comment-14149580 ] Subru Krishnan commented on YARN-2611: -- With the fixes included in the previously attached patch, YARN-1051 got an all [clear | https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148765] from jenkins. The only test case that fails _TestMRCJCFileInputFormat _ is independent of this patch and is tracked in MAPREDUCE-6094. Fix jenkins findbugs warning and test case failures for trunk merge patch - Key: YARN-2611 URL: https://issues.apache.org/jira/browse/YARN-2611 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Attachments: YARN-2611.patch This JIRA is to fix jenkins findbugs warnings and test case failures for trunk merge patch as [reported | https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in YARN-1051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abin Shahab updated YARN-1964: -- Attachment: YARN-1964.patch Patch that scopes down the YARN integration as mentioned above. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149664#comment-14149664 ] Hadoop QA commented on YARN-1964: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671475/YARN-1964.patch against trunk revision a6049aa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5147//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5147//console This message is automatically generated. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149672#comment-14149672 ] Carlo Curino commented on YARN-2611: Minor: the equals() method for ReservationInterval could be simplified Other than that the patch looks good. Fix jenkins findbugs warning and test case failures for trunk merge patch - Key: YARN-2611 URL: https://issues.apache.org/jira/browse/YARN-2611 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Attachments: YARN-2611.patch This JIRA is to fix jenkins findbugs warnings and test case failures for trunk merge patch as [reported | https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in YARN-1051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2387: Target Version/s: 2.6.0 Resource Manager crashes with NPE due to lack of synchronization Key: YARN-2387 URL: https://issues.apache.org/jira/browse/YARN-2387 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it. {noformat} 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) at org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) at java.lang.Thread.run(Thread.java:722) 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike. We need to make these methods synchronized so that we do not encounter this problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2387: --- Priority: Blocker (was: Major) Resource Manager crashes with NPE due to lack of synchronization Key: YARN-2387 URL: https://issues.apache.org/jira/browse/YARN-2387 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Priority: Blocker We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it. {noformat} 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) at org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) at java.lang.Thread.run(Thread.java:722) 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike. We need to make these methods synchronized so that we do not encounter this problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2387: Attachment: YARN-2387.patch Attaching the patch Resource Manager crashes with NPE due to lack of synchronization Key: YARN-2387 URL: https://issues.apache.org/jira/browse/YARN-2387 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Priority: Blocker Attachments: YARN-2387.patch We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it. {noformat} 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) at org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) at java.lang.Thread.run(Thread.java:722) 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike. We need to make these methods synchronized so that we do not encounter this problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1963) Support priorities across applications within the same queue
[ https://issues.apache.org/jira/browse/YARN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149696#comment-14149696 ] Sunil G commented on YARN-1963: --- Thank you [~maysamyabandeh] for providing us the use cases. 1. bq.use case seems to be mentioned in Item 3 of Section 1.5.3 Yes. By changing priority of an application at runtime, will help to over come the scenario mentioned by you. I will in-cooperate the same by providing more scenarios and impacts about it. 2. bq.priority can also be incorporated to the fair share calculation Application Priority will be supported by both schedulers. And there are sub jiras opened for same, however we can re allign the same w.r.t the same base design, and I will include changes from Fair also. As of now priority labels and internal implementation will be common, however separate ACL/per queue priority-label configurations will be required per scheduler level. In future, when both scheduler shares same config and common code, this can be pulled out as common code. For now, configurations and its specific implementation can be done separate for both schedulers. Sub jiras will be split ted accordingly Support priorities across applications within the same queue - Key: YARN-1963 URL: https://issues.apache.org/jira/browse/YARN-1963 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Reporter: Arun C Murthy Assignee: Sunil G Attachments: YARN Application Priorities Design.pdf It will be very useful to support priorities among applications within the same queue, particularly in production scenarios. It allows for finer-grained controls without having to force admins to create a multitude of queues, plus allows existing applications to continue using existing queues which are usually part of institutional memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149701#comment-14149701 ] Karthik Kambatla commented on YARN-1458: I am open to backporting this to branch-2.5, but we don't have a 2.5.2 release planned yet. We should probably discuss 2.5.2 and the need for it on the dev lists. FairScheduler: Zero weight can lead to livelock --- Key: YARN-1458 URL: https://issues.apache.org/jira/browse/YARN-1458 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.2.0 Environment: Centos 2.6.18-238.19.1.el5 X86_64 hadoop2.2.0 Reporter: qingwu.fu Assignee: zhihai xu Labels: patch Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch Original Estimate: 408h Remaining Estimate: 408h The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of jstack command on resourcemanager pid: {code} ResourceManager Event Processor prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x43aa9000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) - waiting to lock 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) at java.lang.Thread.run(Thread.java:744) …… FairSchedulerUpdateThread daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 runnable [0x433a2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) - locked 0x00070026b6e0 (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) at java.lang.Thread.run(Thread.java:744) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan resolved YARN-2611. -- Resolution: Fixed Thanks [~curino] for reviewing the patch. The _ReservationInterval.equals()_ is autogenerated by eclipse. I just committed this to branch yarn-1051. Fix jenkins findbugs warning and test case failures for trunk merge patch - Key: YARN-2611 URL: https://issues.apache.org/jira/browse/YARN-2611 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Attachments: YARN-2611.patch This JIRA is to fix jenkins findbugs warnings and test case failures for trunk merge patch as [reported | https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in YARN-1051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlo Curino updated YARN-1051: --- Attachment: socc14-paper15.pdf Pre-camera ready version of SoCC paper. YARN Admission Control/Planner: enhancing the resource allocation model with time. -- Key: YARN-1051 URL: https://issues.apache.org/jira/browse/YARN-1051 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler Reporter: Carlo Curino Assignee: Carlo Curino Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, techreport.pdf In this umbrella JIRA we propose to extend the YARN RM to handle time explicitly, allowing users to reserve capacity over time. This is an important step towards SLAs, long-running services, workflows, and helps for gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149727#comment-14149727 ] Hadoop QA commented on YARN-1051: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671498/socc14-paper15.pdf against trunk revision 55302cc. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5149//console This message is automatically generated. YARN Admission Control/Planner: enhancing the resource allocation model with time. -- Key: YARN-1051 URL: https://issues.apache.org/jira/browse/YARN-1051 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler Reporter: Carlo Curino Assignee: Carlo Curino Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, techreport.pdf In this umbrella JIRA we propose to extend the YARN RM to handle time explicitly, allowing users to reserve capacity over time. This is an important step towards SLAs, long-running services, workflows, and helps for gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2577) Clarify ACL delimiter and how to configure ACL groups only
[ https://issues.apache.org/jira/browse/YARN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2577: --- Assignee: Miklos Christine Clarify ACL delimiter and how to configure ACL groups only -- Key: YARN-2577 URL: https://issues.apache.org/jira/browse/YARN-2577 Project: Hadoop YARN Issue Type: Improvement Components: documentation, fairscheduler Affects Versions: 2.5.1 Reporter: Miklos Christine Assignee: Miklos Christine Priority: Trivial Labels: newbie Attachments: YARN-2577.patch Reading through the Fair Scheduler documentation, it would be great to explicitly state that the delimiter for the fair scheduler ACLs is the space character. If specifying only ACL groups, users should begin the value with the space character. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2577) Clarify ACL delimiter and how to configure ACL groups only
[ https://issues.apache.org/jira/browse/YARN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149732#comment-14149732 ] Allen Wittenauer commented on YARN-2577: +1 lgtm. Will commit to trunk and branch-2. Thanks! Clarify ACL delimiter and how to configure ACL groups only -- Key: YARN-2577 URL: https://issues.apache.org/jira/browse/YARN-2577 Project: Hadoop YARN Issue Type: Improvement Components: documentation, fairscheduler Affects Versions: 2.5.1 Reporter: Miklos Christine Assignee: Miklos Christine Priority: Trivial Labels: newbie Attachments: YARN-2577.patch Reading through the Fair Scheduler documentation, it would be great to explicitly state that the delimiter for the fair scheduler ACLs is the space character. If specifying only ACL groups, users should begin the value with the space character. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2387) Resource Manager crashes with NPE due to lack of synchronization
[ https://issues.apache.org/jira/browse/YARN-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149744#comment-14149744 ] Hadoop QA commented on YARN-2387: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671494/YARN-2387.patch against trunk revision 55302cc. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5148//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5148//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5148//console This message is automatically generated. Resource Manager crashes with NPE due to lack of synchronization Key: YARN-2387 URL: https://issues.apache.org/jira/browse/YARN-2387 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Priority: Blocker Attachments: YARN-2387.patch We recently came across a 0.23 RM crashing with an NPE. Here is the stacktrace for it. {noformat} 2014-08-06 05:56:52,165 [ResourceManager Event Processor] FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToBuilder(ContainerStatusPBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.mergeLocalToProto(ContainerStatusPBImpl.java:68) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:53) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerStatusPBImpl.getProto(ContainerStatusPBImpl.java:34) at org.apache.hadoop.yarn.api.records.ProtoBase.toString(ProtoBase.java:55) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.api.records.impl.pb.ContainerPBImpl.toString(ContainerPBImpl.java:353) at java.lang.String.valueOf(String.java:2854) at java.lang.StringBuilder.append(StringBuilder.java:128) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1405) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:790) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:688) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:82) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:339) at java.lang.Thread.run(Thread.java:722) 2014-08-06 05:56:52,166 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {noformat} On investigating a on the issue we found that the ContainerStatusPBImpl has methods that are called by different threads and are not synchronized. Even the 2.X code looks alike. We need to make these methods synchronized so that we do not encounter this problem in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2577) Clarify ACL delimiter and how to configure ACL groups only
[ https://issues.apache.org/jira/browse/YARN-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149747#comment-14149747 ] Hudson commented on YARN-2577: -- FAILURE: Integrated in Hadoop-trunk-Commit #6121 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6121/]) YARN-2577. Clarify ACL delimiter and how to configure ACL groups only (Mikos Christine via aw) (aw: rev ac70c27473251b389f32f4a33085d6a9ee3a0b3c) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm * hadoop-yarn-project/CHANGES.txt Clarify ACL delimiter and how to configure ACL groups only -- Key: YARN-2577 URL: https://issues.apache.org/jira/browse/YARN-2577 Project: Hadoop YARN Issue Type: Improvement Components: documentation, fairscheduler Affects Versions: 2.5.1 Reporter: Miklos Christine Assignee: Miklos Christine Priority: Trivial Labels: newbie Fix For: 2.6.0 Attachments: YARN-2577.patch Reading through the Fair Scheduler documentation, it would be great to explicitly state that the delimiter for the fair scheduler ACLs is the space character. If specifying only ACL groups, users should begin the value with the space character. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149814#comment-14149814 ] Zhijie Shen commented on YARN-2527: --- This is the code in ContainerLaunchContextPBImpl. It seems that acls will never been null from CLC. {code} public MapApplicationAccessType, String getApplicationACLs() { initApplicationACLs(); return this.applicationACLS; } private void initApplicationACLs() { if (this.applicationACLS != null) { return; } ContainerLaunchContextProtoOrBuilder p = viaProto ? proto : builder; ListApplicationACLMapProto list = p.getApplicationACLsList(); this.applicationACLS = new HashMapApplicationAccessType, String(list .size()); for (ApplicationACLMapProto aclProto : list) { this.applicationACLS.put(ProtoUtils.convertFromProtoFormat(aclProto .getAccessType()), aclProto.getAcl()); } } {code} I'm still thinking it may be the race condition that app is already in RMContext but acls is not put into ApplicationACLsManager. It needs to be confirmed from [~miguenther]. Anyway NPE happens, and ApplicationACLsManager should be self-sufficient to handle the potential null case. Let's do the fix as just suggested. Will review the patch and come back to you asap. NPE in ApplicationACLsManager - Key: YARN-2527 URL: https://issues.apache.org/jira/browse/YARN-2527 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Benoy Antony Assignee: Benoy Antony Attachments: YARN-2527.patch, YARN-2527.patch NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. The relevant stacktrace snippet from the ResourceManager logs is as below {code} Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) at org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) at org.apache.hadoop.yarn.webapp.View.render(View.java:235) {code} This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149838#comment-14149838 ] Karthik Kambatla commented on YARN-2179: Thanks Chris. The latest patch looks good to me. Just two more nits, sorry for not noticing sooner. # Rename CacheStructureUtil to SharedCache(Structure)Util? # Mark RemoteAppChecker, Util-class, SharedCacheManager Private-Unstable. [~vinodkv] - do you have any other comments on this patch? Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated YARN-1769: Attachment: YARN-1769.patch patch with log statments changed to debug CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2180: --- Attachment: YARN-2180-trunk-v5.patch Attached v5. This is a slight rebase so that it applies cleanly on top of trunk+YARN-2179. In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2372) There are Chinese Characters in the FairScheduler's document
[ https://issues.apache.org/jira/browse/YARN-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149881#comment-14149881 ] Hudson commented on YARN-2372: -- FAILURE: Integrated in Hadoop-trunk-Commit #6125 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6125/]) YARN-2372. There are Chinese Characters in the FairScheduler's document (Fengdong Yu via aw) (aw: rev 32870db0fb91e115b5e44edb7b313368e8e81b1e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm There are Chinese Characters in the FairScheduler's document Key: YARN-2372 URL: https://issues.apache.org/jira/browse/YARN-2372 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.4.1 Reporter: Fengdong Yu Assignee: Fengdong Yu Priority: Minor Fix For: 2.6.0 Attachments: YARN-2372.patch, YARN-2372.patch, YARN-2372.patch, YARN-2372.patch, YARN-2372.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149891#comment-14149891 ] Zhijie Shen commented on YARN-2606: --- Talk to vinod offline shortly. It seems that the YARN daemons are supposed to make external calls until start stage. Application History Server tries to access hdfs before doing secure login - Key: YARN-2606 URL: https://issues.apache.org/jira/browse/YARN-2606 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2606.patch While testing the Application Timeline Server, the server would not come up in a secure cluster, as it would keep trying to access hdfs without having done the secure login. It would repeatedly try authenticating and finally hit stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2606) Application History Server tries to access hdfs before doing secure login
[ https://issues.apache.org/jira/browse/YARN-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149905#comment-14149905 ] Mit Desai commented on YARN-2606: - I see. Thanks for the info. I did not know about that. I will post a refreshed patch once I have made the changes and tested it. Application History Server tries to access hdfs before doing secure login - Key: YARN-2606 URL: https://issues.apache.org/jira/browse/YARN-2606 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2606.patch While testing the Application Timeline Server, the server would not come up in a secure cluster, as it would keep trying to access hdfs without having done the secure login. It would repeatedly try authenticating and finally hit stack overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-913: Attachment: YARN-913-012.patch tightened down code, docs, javadocs, move classes around to psitions things The test for registry security failing on jenkins didn't arise last patch submission there's no obvious reason for that (more precisely, why it arose in the frst place) Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2566: Attachment: (was: YARN-2566.000.patch) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_01 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2566: Attachment: YARN-2566.000.patch IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_01 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149937#comment-14149937 ] zhihai xu commented on YARN-2566: - [The Findbugs warnings: link | https://builds.apache.org/job/PreCommit-YARN-Build/5037//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html] does not exist. Reattach the patch to restart test. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149941#comment-14149941 ] Hadoop QA commented on YARN-1769: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671525/YARN-1769.patch against trunk revision 3a1f981. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5150//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5150//console This message is automatically generated. CapacityScheduler: Improve reservations Key: YARN-1769 URL: https://issues.apache.org/jira/browse/YARN-1769 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Thomas Graves Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch Currently the CapacityScheduler uses reservations in order to handle requests for large containers and the fact there might not currently be enough space available on a single host. The current algorithm for reservations is to reserve as many containers as currently required and then it will start to reserve more above that after a certain number of re-reservations (currently biased against larger containers). Anytime it hits the limit of number reserved it stops looking at any other nodes. This results in potentially missing nodes that have enough space to fullfill the request. The other place for improvement is currently reservations count against your queue capacity. If you have reservations you could hit the various limits which would then stop you from looking further at that node. The above 2 cases can cause an application requesting a larger container to take a long time to gets it resources. We could improve upon both of those by simply continuing to look at incoming nodes to see if we could potentially swap out a reservation for an actual allocation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2179: --- Attachment: YARN-2179-trunk-v9.patch [~kasha] [~vinodkv] Attached v9 to address last comments. Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data
[ https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2591: - Assignee: Zhijie Shen AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data --- Key: YARN-2591 URL: https://issues.apache.org/jira/browse/YARN-2591 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 3.0.0, 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data. Currently, it is going to return INTERNAL_SERVER_ERROR(500). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149952#comment-14149952 ] Jian He commented on YARN-668: -- Looks good overall, minor comments: - In ContainerTokenIdentifier, check null as well ?similarly for all {{getUser}} ? {code} public ContainerId getContainerID() { return new ContainerIdPBImpl(proto.getContainerId()); } {code} - This change change can be reverted back {code} // LogAggregationContext is set as null Assert.assertNull(getLogAggregationContextFromContainerToken(rm1, nm1, null)); {code} TokenIdentifier serialization should consider Unknown fields Key: YARN-668 URL: https://issues.apache.org/jira/browse/YARN-668 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Junping Du Priority: Blocker Attachments: YARN-668-demo.patch, YARN-668-v2.patch, YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch This would allow changing of the TokenIdentifier between versions. The current serialization is Writable. A simple way to achieve this would be to have a Proto object as the payload for TokenIdentifiers, instead of individual fields. TokenIdentifier continues to implement Writable to work with the RPC layer - but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150005#comment-14150005 ] Hadoop QA commented on YARN-2179: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671538/YARN-2179-trunk-v9.patch against trunk revision b40f433. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-sharedcachemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5153//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5153//console This message is automatically generated. Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-668: Attachment: YARN-668-v10.patch Nice catch, [~jianhe]! Fix these issues in v10 patch. TokenIdentifier serialization should consider Unknown fields Key: YARN-668 URL: https://issues.apache.org/jira/browse/YARN-668 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Junping Du Priority: Blocker Attachments: YARN-668-demo.patch, YARN-668-v10.patch, YARN-668-v2.patch, YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch This would allow changing of the TokenIdentifier between versions. The current serialization is Writable. A simple way to achieve this would be to have a Proto object as the payload for TokenIdentifiers, instead of individual fields. TokenIdentifier continues to implement Writable to work with the RPC layer - but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150026#comment-14150026 ] Hadoop QA commented on YARN-2566: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671536/YARN-2566.000.patch against trunk revision b40f433. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5152//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5152//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5152//console This message is automatically generated. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
[jira] [Updated] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2180: --- Attachment: (was: YARN-2180-trunk-v5.patch) In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-2180: --- Attachment: YARN-2180-trunk-v5.patch Re-attach v5 to accommodate for SharedCacheStructureUtil rename. In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2206) Update document for applications REST API response examples
[ https://issues.apache.org/jira/browse/YARN-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150044#comment-14150044 ] Hadoop QA commented on YARN-2206: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12652326/YARN-2206.patch against trunk revision aa5d925. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5155//console This message is automatically generated. Update document for applications REST API response examples --- Key: YARN-2206 URL: https://issues.apache.org/jira/browse/YARN-2206 Project: Hadoop YARN Issue Type: Improvement Components: documentation Affects Versions: 2.4.0 Reporter: Kenji Kikushima Assignee: Kenji Kikushima Priority: Minor Attachments: YARN-2206.patch In ResourceManagerRest.apt.vm, Applications API responses are missing some elements. - JSON response should have applicationType and applicationTags. - XML response should have applicationTags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2481) YARN should allow defining the location of java
[ https://issues.apache.org/jira/browse/YARN-2481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150049#comment-14150049 ] Arun C Murthy commented on YARN-2481: - [~ashahab] YARN already allows the JAVA_HOME to be overridable... take a look at {{ApplicationConstants.Environment.JAVA_HOME}} and {{YarnConfiguration.DEFAULT_NM_ENV_WHITELIST}} for the code-path. YARN should allow defining the location of java --- Key: YARN-2481 URL: https://issues.apache.org/jira/browse/YARN-2481 Project: Hadoop YARN Issue Type: New Feature Reporter: Abin Shahab Yarn right now uses the location of the JAVA_HOME on the host to launch containers. This does not work with Docker containers which have their own filesystem namespace and OS. If the location of the Java binary of the container to be launched is configurable, yarn can launch containers that have java in a different location than the host. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150060#comment-14150060 ] Hadoop QA commented on YARN-913: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671529/YARN-913-012.patch against trunk revision b40f433. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 36 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1266 javac compiler warnings (more than the trunk's current 1265 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.registry.secure.TestSecureRMRegistryOperations org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification The test build failed in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5151//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5151//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5151//console This message is automatically generated. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
Jian He created YARN-2613: - Summary: NMClient doesn't have retries for supporting rolling-upgrades Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150093#comment-14150093 ] Xuan Gong commented on YARN-2468: - bq. One question: if one file is failed at uploading in LogValue.write(), uploadedFiles will not reflect the missing uploaded file, and it will not be uploaded again? Good catch. Fixed this in the new patch. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2468: Attachment: YARN-2468.8.patch Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2566: Attachment: YARN-2566.001.patch IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_01 2014-09-13 23:33:25,187 INFO
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150108#comment-14150108 ] zhihai xu commented on YARN-2566: - upload a new patch YARN-2566.001.patch to fix the findbugs issue to catch IOException instead of Exception. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004
[jira] [Updated] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2583: Attachment: YARN-2583.1.patch Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data
[ https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2591: -- Target Version/s: 2.6.0 AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data --- Key: YARN-2591 URL: https://issues.apache.org/jira/browse/YARN-2591 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 3.0.0, 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data. Currently, it is going to return INTERNAL_SERVER_ERROR(500). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data
[ https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2591: -- Attachment: YARN-2591.1.patch Create a patch to throw ForbiddenException if the access is denied by ApplicationACLsManager. AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data --- Key: YARN-2591 URL: https://issues.apache.org/jira/browse/YARN-2591 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 3.0.0, 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2591.1.patch AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data. Currently, it is going to return INTERNAL_SERVER_ERROR(500). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150155#comment-14150155 ] Hadoop QA commented on YARN-668: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671553/YARN-668-v10.patch against trunk revision c7c8e38. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5154//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5154//console This message is automatically generated. TokenIdentifier serialization should consider Unknown fields Key: YARN-668 URL: https://issues.apache.org/jira/browse/YARN-668 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Junping Du Priority: Blocker Attachments: YARN-668-demo.patch, YARN-668-v10.patch, YARN-668-v2.patch, YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch This would allow changing of the TokenIdentifier between versions. The current serialization is Writable. A simple way to achieve this would be to have a Proto object as the payload for TokenIdentifiers, instead of individual fields. TokenIdentifier continues to implement Writable to work with the RPC layer - but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150162#comment-14150162 ] Jian He commented on YARN-2594: --- current patch looks good to me, thanks all for the discussion ! Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150177#comment-14150177 ] Karthik Kambatla commented on YARN-2594: As I commented earlier, the current approach is fine with me. My review comments still apply: we should avoid using readLock in other get methods that access RMAppImpl#currentAttempt. RMAppAttemptImpl should handle the thread-safety of its fields. Can we also file follow-up JIRAs to cleanup synchronization in SchedulingApplicationAttempt? Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2613: -- Attachment: YARN-2613.1.patch NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150185#comment-14150185 ] Jian He commented on YARN-2613: --- Patch: - Create a new NMProxy class for instantiating containerManageMentProxy with retry implementation. And created a new common base ServerProxy class. - Updated existing code to use the new NMProxy class - Manually tested on a single node cluster. Submit a MR job and kill NM, MR job will retry the NM. NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1615) Fix typos in FSSchedulerApp.java
[ https://issues.apache.org/jira/browse/YARN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150186#comment-14150186 ] Hadoop QA commented on YARN-1615: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12623903/YARN-1615.patch against trunk revision 6b7673e. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5157//console This message is automatically generated. Fix typos in FSSchedulerApp.java Key: YARN-1615 URL: https://issues.apache.org/jira/browse/YARN-1615 Project: Hadoop YARN Issue Type: Bug Components: documentation, scheduler Affects Versions: 2.2.0 Reporter: Akira AJISAKA Assignee: Akira AJISAKA Priority: Trivial Labels: newbie Attachments: YARN-1615.patch In FSSchedulerApp.java there're 4 typos: {code} * containers over rack-local or off-switch containers. To acheive this * we first only allow node-local assigments for a given prioirty level, * then relax the locality threshold once we've had a long enough period * without succesfully scheduling. We measure both the number of missed {code} They should be fixed as follows: {code} * containers over rack-local or off-switch containers. To achieve this * we first only allow node-local assignments for a given priority level, * then relax the locality threshold once we've had a long enough period * without successfully scheduling. We measure both the number of missed {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2179) Initial cache manager structure and context
[ https://issues.apache.org/jira/browse/YARN-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150200#comment-14150200 ] Karthik Kambatla commented on YARN-2179: +1. Initial cache manager structure and context --- Key: YARN-2179 URL: https://issues.apache.org/jira/browse/YARN-2179 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2179-trunk-v1.patch, YARN-2179-trunk-v2.patch, YARN-2179-trunk-v3.patch, YARN-2179-trunk-v4.patch, YARN-2179-trunk-v5.patch, YARN-2179-trunk-v6.patch, YARN-2179-trunk-v7.patch, YARN-2179-trunk-v8.patch, YARN-2179-trunk-v9.patch Implement the initial shared cache manager structure and context. The SCMContext will be used by a number of manager services (i.e. the backing store and the cleaner service). The AppChecker is used to gather the currently running applications on SCM startup (necessary for an scm that is backed by an in-memory store). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2591) AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data
[ https://issues.apache.org/jira/browse/YARN-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150220#comment-14150220 ] Hadoop QA commented on YARN-2591: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671578/YARN-2591.1.patch against trunk revision 6b7673e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5158//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5158//console This message is automatically generated. AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data --- Key: YARN-2591 URL: https://issues.apache.org/jira/browse/YARN-2591 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 3.0.0, 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2591.1.patch AHSWebServices should return FORBIDDEN(403) if the request user doesn't have access to the history data. Currently, it is going to return INTERNAL_SERVER_ERROR(500). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150221#comment-14150221 ] Zhijie Shen commented on YARN-2468: --- +1 for the latest patch Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150244#comment-14150244 ] Hadoop QA commented on YARN-2468: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671569/YARN-2468.8.patch against trunk revision 6b7673e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart org.apache.hadoop.yarn.server.TestContainerManageTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5159//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5159//console This message is automatically generated. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150245#comment-14150245 ] Hadoop QA commented on YARN-2613: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671581/YARN-2613.1.patch against trunk revision 6b7673e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.TestPBLocalizerRPC The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.server.TestContainerManageTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5160//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5160//console This message is automatically generated. NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150255#comment-14150255 ] Hadoop QA commented on YARN-2566: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671570/YARN-2566.001.patch against trunk revision 6b7673e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5156//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5156//console This message is automatically generated. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150266#comment-14150266 ] Zhijie Shen commented on YARN-2468: --- The test failures seems to be related to addressing binding conflicts. Restart a build. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2609) Example of use for the ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlo Curino updated YARN-2609: --- Attachment: YARN-2609.patch Example of use for the ReservationSystem Key: YARN-2609 URL: https://issues.apache.org/jira/browse/YARN-2609 Project: Hadoop YARN Issue Type: Improvement Reporter: Carlo Curino Assignee: Carlo Curino Priority: Minor Attachments: YARN-2609.patch This JIRA provides a simple new example in mapreduce-examples that request a reservation and submit a Pi computation in the reservation. This is meant just to show how to interact with the reservation system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2609) Example of use for the ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlo Curino updated YARN-2609: --- Attachment: YARN-2609.docx Example of use for the ReservationSystem Key: YARN-2609 URL: https://issues.apache.org/jira/browse/YARN-2609 Project: Hadoop YARN Issue Type: Improvement Reporter: Carlo Curino Assignee: Carlo Curino Priority: Minor Attachments: YARN-2609.docx, YARN-2609.patch This JIRA provides a simple new example in mapreduce-examples that request a reservation and submit a Pi computation in the reservation. This is meant just to show how to interact with the reservation system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2609) Example of use for the ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-2609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150285#comment-14150285 ] Carlo Curino commented on YARN-2609: Per [~kasha] request we provide a simple way to see YARN-1051 in action (adding to the mapreduce-examples), and provide a brief usage document. Please refer to the documents associated with YARN-1051 for a more context and design vision. This patch can be improved/extended for actual commit, this is just to facilitate the evaluation of YARN-1051 for the merge-to-trunk ongoing vote. Example of use for the ReservationSystem Key: YARN-2609 URL: https://issues.apache.org/jira/browse/YARN-2609 Project: Hadoop YARN Issue Type: Improvement Reporter: Carlo Curino Assignee: Carlo Curino Priority: Minor Attachments: YARN-2609.docx, YARN-2609.patch This JIRA provides a simple new example in mapreduce-examples that request a reservation and submit a Pi computation in the reservation. This is meant just to show how to interact with the reservation system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150292#comment-14150292 ] Hadoop QA commented on YARN-2468: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671569/YARN-2468.8.patch against trunk revision 6b7673e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5161//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5161//console This message is automatically generated. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch, YARN-2468.7.1.patch, YARN-2468.7.patch, YARN-2468.8.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2614) Cleanup synchronized method in SchedulerApplicationAttempt
Wangda Tan created YARN-2614: Summary: Cleanup synchronized method in SchedulerApplicationAttempt Key: YARN-2614 URL: https://issues.apache.org/jira/browse/YARN-2614 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Wangda Tan According to discussions in YARN-2594, there're some methods in SchedulerApplicationAttempt will be accessed by other modules, that will lead to potential dead lock in RM, we should cleanup them as much as we can. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2594) Potential deadlock in RM when querying ApplicationResourceUsageReport
[ https://issues.apache.org/jira/browse/YARN-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150301#comment-14150301 ] Wangda Tan commented on YARN-2594: -- Thanks [~jianhe] and [~kasha] for review, I created YARN-2614 to tracking SchedulerApplicationAttempt synchronization cleanups. Potential deadlock in RM when querying ApplicationResourceUsageReport - Key: YARN-2594 URL: https://issues.apache.org/jira/browse/YARN-2594 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karam Singh Assignee: Wangda Tan Priority: Blocker Attachments: YARN-2594.patch ResoruceManager sometimes become un-responsive: There was in exception in ResourceManager log and contains only following type of messages: {code} 2014-09-19 19:13:45,241 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 53000 2014-09-19 19:30:26,312 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 54000 2014-09-19 19:47:07,351 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 55000 2014-09-19 20:03:48,460 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 56000 2014-09-19 20:20:29,542 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 57000 2014-09-19 20:37:10,635 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 58000 2014-09-19 20:53:51,722 INFO event.AsyncDispatcher (AsyncDispatcher.java:handle(232)) - Size of event-queue is 59000 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150300#comment-14150300 ] Hudson commented on YARN-668: - SUCCESS: Integrated in Hadoop-trunk-Commit #6130 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6130/]) YARN-668. Changed NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use protobuf object as the payload. Contributed by Junping Du. (jianhe: rev 5391919b09ce9549d13c897aa89bb0a0536760fe) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/NMTokenIdentifierNewForTest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/ContainerTokenIdentifierForTest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/NMTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/AMRMTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/proto/test_token.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/proto/test_amrm_token.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/TestContainer.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/AMRMTokenIdentifierForTest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/pom.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java TokenIdentifier serialization should consider Unknown fields Key: YARN-668 URL: https://issues.apache.org/jira/browse/YARN-668 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Junping Du Priority: Blocker Fix For: 2.6.0 Attachments: YARN-668-demo.patch, YARN-668-v10.patch, YARN-668-v2.patch, YARN-668-v3.patch, YARN-668-v4.patch, YARN-668-v5.patch, YARN-668-v6.patch, YARN-668-v7.patch, YARN-668-v8.patch, YARN-668-v9.patch, YARN-668.patch This would allow changing of the TokenIdentifier between versions. The current serialization is Writable. A simple way to achieve this would be to have a Proto object as the payload for TokenIdentifiers, instead of individual fields. TokenIdentifier continues to implement Writable to work with the RPC layer - but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150324#comment-14150324 ] Karthik Kambatla commented on YARN-2180: Thanks for the updates, Chris. The overall approach looks good. Review comments: # SharedCacheManager#createSCMStoreService should use ReflectionUtils.newInstance. RMProxy is an example. # Thinking out loud here. YarnConfiguration (and yarn-default): I was wondering if we need a separate prefix for manager. Do we have more configs coming later specific to manager? yarn.sharedcache.store is not ambiguous. # SCMStore ## A couple of lines longer than 80 chars. ## For resources that are not in the store, isn't the access time trivially zero? I am okay with returning -1 for those cases, but will returning zero help at call sites? ## Nit: Would keep the methods concerning references all together. # InMemorySCMStore configuration - do we need a separate configuration class for in-memory-store? Why not include it in YarnConfiguration similar to RMStore implementations? According to what we decide on, we might want to change the actual config names. # InMemorySCMStore ## Can we rename map to something more descriptive? cacheResources? ## Nit: Move bootstrapping code to a different method for readability? ## Isn't the following synchronized block prone to races when different threads lock on different objects? {code} synchronized (initialApps) { initialApps = getInitialApps(conf); } {code} ## We can leave it as is for now, but the implementation of AppChecker should come from some util method based on whether it is embedded or not. If we are open to it, we can add that method now an make it return RemoteAppChecker by default. ## Nit: The following should fit on two lines: {code} MapString, String getInitialCachedResources(FileSystem fs, Configuration conf) throws IOException { {code} ## Use containsKey instead of the following? {code} String mapped = initialCachedEntries.get(key); if (mapped != null) { {code} ## clearCache() - we should annotate each TODO with a follow-up JIRA, so we don't forget. In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2180) In-memory backing store for cache manager
[ https://issues.apache.org/jira/browse/YARN-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150328#comment-14150328 ] Karthik Kambatla commented on YARN-2180: Forgot to mention - we should annotate new classes as Private- Evolving|Unstable as appropriate. In-memory backing store for cache manager - Key: YARN-2180 URL: https://issues.apache.org/jira/browse/YARN-2180 Project: Hadoop YARN Issue Type: Sub-task Reporter: Chris Trezzo Assignee: Chris Trezzo Attachments: YARN-2180-trunk-v1.patch, YARN-2180-trunk-v2.patch, YARN-2180-trunk-v3.patch, YARN-2180-trunk-v4.patch, YARN-2180-trunk-v5.patch Implement an in-memory backing store for the cache manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2613) NMClient doesn't have retries for supporting rolling-upgrades
[ https://issues.apache.org/jira/browse/YARN-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150334#comment-14150334 ] Hadoop QA commented on YARN-2613: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12671581/YARN-2613.1.patch against trunk revision 5f16c98. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5162//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5162//console This message is automatically generated. NMClient doesn't have retries for supporting rolling-upgrades - Key: YARN-2613 URL: https://issues.apache.org/jira/browse/YARN-2613 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2613.1.patch While NM is rolling upgrade, client should retry NM until it comes up. This jira is to add a NMProxy (similar to RMProxy) with retry implementation to support rolling upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)