[jira] [Created] (YARN-2711) TestDefaultContainerExecutor#testContainerLaunchError fails on Windows
Varun Vasudev created YARN-2711: --- Summary: TestDefaultContainerExecutor#testContainerLaunchError fails on Windows Key: YARN-2711 URL: https://issues.apache.org/jira/browse/YARN-2711 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev The testContainerLaunchError test fails on Windows with the following error - {noformat} java.io.FileNotFoundException: File file:/bin/echo does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.FilterFs.getFileStatus(FilterFs.java:120) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1117) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1113) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1113) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2019) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1978) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:145) at org.apache.hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor.testContainerLaunchError(TestDefaultContainerExecutor.java:289) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2711) TestDefaultContainerExecutor#testContainerLaunchError fails on Windows
[ https://issues.apache.org/jira/browse/YARN-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2711: Attachment: apache-yarn-2711.0.patch Patch with fix attached. TestDefaultContainerExecutor#testContainerLaunchError fails on Windows -- Key: YARN-2711 URL: https://issues.apache.org/jira/browse/YARN-2711 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2711.0.patch The testContainerLaunchError test fails on Windows with the following error - {noformat} java.io.FileNotFoundException: File file:/bin/echo does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.FilterFs.getFileStatus(FilterFs.java:120) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1117) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1113) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1113) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2019) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1978) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:145) at org.apache.hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor.testContainerLaunchError(TestDefaultContainerExecutor.java:289) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2711) TestDefaultContainerExecutor#testContainerLaunchError fails on Windows
[ https://issues.apache.org/jira/browse/YARN-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176746#comment-14176746 ] Hadoop QA commented on YARN-2711: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675786/apache-yarn-2711.0.patch against trunk revision da80c4d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5458//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5458//console This message is automatically generated. TestDefaultContainerExecutor#testContainerLaunchError fails on Windows -- Key: YARN-2711 URL: https://issues.apache.org/jira/browse/YARN-2711 Project: Hadoop YARN Issue Type: Bug Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-2711.0.patch The testContainerLaunchError test fails on Windows with the following error - {noformat} java.io.FileNotFoundException: File file:/bin/echo does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.FilterFs.getFileStatus(FilterFs.java:120) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1117) at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1113) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1113) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2019) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1978) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:145) at org.apache.hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor.testContainerLaunchError(TestDefaultContainerExecutor.java:289) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2691) User level API support for priority label
[ https://issues.apache.org/jira/browse/YARN-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2691: - Attachment: YARN-2691.patch User level API support for priority label - Key: YARN-2691 URL: https://issues.apache.org/jira/browse/YARN-2691 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Sunil G Assignee: Rohith Attachments: YARN-2691.patch Support for handling Application-Priority label coming from client to ApplicationSubmissionContext. Common api support for user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2010: --- Attachment: yarn-2010-3.patch Re-uploading the last patch, that has a single {{catch(Exception)}}. [~vinodkv] - would you still prefer having multiple catch-blocks, one for each exception. IMO, catching {{ConnectException}} doesn't seem very readable; we could add a comment on why we are adding that catch, but we might not be able to enumerate all possible cases. That said, I am okay with catching ConnectException and Exception separately. Please advise. RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177003#comment-14177003 ] Karthik Kambatla commented on YARN-2579: [~rohithsharma] - can you help me understand the issue here better. {{resetDispatcher}} is called either in transitionToStandby and transitionToActive, both of which are synchronized methods. Under what conditions, can {{resetDispatcher}} be called by two threads simultaneously? Both RM's state is Active , but 1 RM is not really active. -- Key: YARN-2579 URL: https://issues.apache.org/jira/browse/YARN-2579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Rohith Assignee: Rohith Attachments: YARN-2579.patch, YARN-2579.patch I encountered a situaltion where both RM's web page was able to access and its state displayed as Active. But One of the RM's ActiveServices were stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2398) TestResourceTrackerOnHA crashes
[ https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177087#comment-14177087 ] Wangda Tan commented on YARN-2398: -- [~ozawa], the log you attached will be resolved by YARN-2705, it's not as same as original error: https://issues.apache.org/jira/browse/YARN-2398?focusedCommentId=14090771page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14090771 TestResourceTrackerOnHA crashes --- Key: YARN-2398 URL: https://issues.apache.org/jira/browse/YARN-2398 Project: Hadoop YARN Issue Type: Bug Reporter: Jason Lowe Attachments: TestResourceTrackerOnHA-output.txt TestResourceTrackerOnHA is currently crashing and failing trunk builds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan reopened YARN-2710: -- RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk
[ https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177095#comment-14177095 ] Wangda Tan commented on YARN-2710: -- [~jianhe], [~ozawa], [~haosd...@gmail.com]: I just tried again, in the latest trunk, I run mvn clean test -Dtest=TestApplicationClientProtocolOnHA will success, but run -Dtest=TestResourceTrackerOnHA will fail. Attached log when running TestApplicationClientProtocolOnHA, even if it's succeeded, the Cannot connection / EOF error still exists. I guess it might be some network configuration caused issue. And to [~ozawa], as I commented in YARN-2398, this is not as same as YARN-2398, reopen the ticket and people can report here if they met same problem. Thanks, Wangda RM HA tests failed intermittently on trunk -- Key: YARN-2710 URL: https://issues.apache.org/jira/browse/YARN-2710 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Wangda Tan Attachments: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt Failure like, it can be happened in TestApplicationClientProtocolOnHA, TestResourceTrackerOnHA, etc. {code} org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 9.491 sec ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177099#comment-14177099 ] Vinod Kumar Vavilapalli commented on YARN-2010: --- Sorry missed this. Lost context, so please help clarify. This time, we got a ConnectException to Zookeeper due to which we are skipping apps? That doesn't sound right either. RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) ... 13 more {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
Tsuyoshi OZAWA created YARN-2712: Summary: Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Test Components: resourcemanager Reporter: Tsuyoshi OZAWA TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart
[ https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2712: - Issue Type: Sub-task (was: Test) Parent: YARN-556 Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart --- Key: YARN-2712 URL: https://issues.apache.org/jira/browse/YARN-2712 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about FairScheduler partially. We should support them. {code} // Until YARN-1959 is resolved if (scheduler.getClass() != FairScheduler.class) { assertEquals(availableResources, schedulerAttempt.getHeadroom()); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2704) Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time
[ https://issues.apache.org/jira/browse/YARN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2704: -- Attachment: YARN-2704.1.patch uploaded a patch - make RM automatically request hdfs delegation token on behalf of the user if 1) user doesn’t provide delegation token on app submission; Or 2) the hdfs delegation token is about to expire in 10 hours. - NMs heartBeat with RM to get the new tokens and use that for localization and log-aggregation - a config is added to disable/enable this feature. - This approach also requires namenode to config RM as the proxy user Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time -- Key: YARN-2704 URL: https://issues.apache.org/jira/browse/YARN-2704 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2704.1.patch In secure mode, YARN requires the hdfs-delegation token to do localization and log aggregation on behalf of the user. But the hdfs delegation token will eventually expire after max-token-life-time. So, localization and log aggregation will fail after the token expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177145#comment-14177145 ] Wangda Tan commented on YARN-2056: -- Hi [~eepayne], Thanks for the update, and sorry again for the late :). The generally method looks very good to me, still reviewing tests and other details. One quick suggestion is, you don't need re-implement a ordered list, its insertion time complexity is O(n), you can use PriorityQueue of Java or org.apache.hadoop.utils.PriorityQueue instead. Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177147#comment-14177147 ] Wangda Tan commented on YARN-2056: -- Oh, sorry, JIRA interpreted O\(n\), to O(n), not what originally I meant :-p Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, YARN-2056.201410132225.txt, YARN-2056.201410141330.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt
[ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177167#comment-14177167 ] Karthik Kambatla commented on YARN-2010: In this particular case, we are unable to renew HDFS delegation token due to ConnectException to HDFS. We are not yet clear why this happens. Even if this a transient HDFS issue, both RMs fail to transition to active and the individual RMActiveServices instances transition to STOPPED state. Any subsequent attempts to transition the RM to active fail because RMActiveServices is not INITED, as in the Standby case. I spent some more time thinking about this, and think there might be merit to catch exceptions separately. ConnectException hopefully is due to a transient issue, I don't think we can do much in case of a permanent issue. When we run into this, we should probably cleanly transition to standby, so subsequent attempts to transition to active may succeed. RM can't transition to active if it can't recover an app attempt Key: YARN-2010 URL: https://issues.apache.org/jira/browse/YARN-2010 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: bc Wong Assignee: Karthik Kambatla Priority: Critical Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. {noformat} 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) ... 4 more Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) ... 5 more Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) ... 8 more Caused by: java.lang.IllegalArgumentException: Missing argument at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) at
[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster
[ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177169#comment-14177169 ] Wangda Tan commented on YARN-2314: -- [~jlowe], thanks for update, patch looks good to me, +1! [~rajesh.balamohan], thanks for your performance report based on this. 20-30ms is still a latency for interactive tasks cannot be totally ignored. At least, we have way to cache connections via configuration option in this patch. Wangda ContainerManagementProtocolProxy can create thousands of threads for a large cluster Key: YARN-2314 URL: https://issues.apache.org/jira/browse/YARN-2314 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.1.0-beta Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Attachments: YARN-2314.patch, YARN-2314v2.patch, disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, tez-yarn-2314.xlsx ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable. However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2703: Attachment: YARN-2703.1.patch Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177185#comment-14177185 ] Xuan Gong commented on YARN-2703: - Add logUploadedTime into LogValue context for better display. This patch is based on YARN-2582 Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2713) Broken RM Home link in NM Web UI when RM HA is enabled
Karthik Kambatla created YARN-2713: -- Summary: Broken RM Home link in NM Web UI when RM HA is enabled Key: YARN-2713 URL: https://issues.apache.org/jira/browse/YARN-2713 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla When RM HA is enabled, the 'RM Home' link in the NM WebUI is broken. It points to the NM-host:RM-port instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2209: -- Attachment: YARN-2209.6.patch Given that we already broken compatibility for rolling upgrades, the patch should be fine in that sense. Updated the patch against latest trunk. Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: YARN-2673-102014.patch Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2673: Attachment: (was: YARN-2673-101914.patch) Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177274#comment-14177274 ] Zhijie Shen commented on YARN-2582: --- Looks good to me overall. Just one nit Make the following methods static? {code} private void containerLogNotFound(String containerId) { System.out.println(Logs for container + containerId + are not present in this log-file.); } private void logDirNotExist(String remoteAppLogDir) { System.out.println(remoteAppLogDir + does not exist.); System.out.println(Log aggregation has not completed or is not enabled.); } private void emptyLogDir(String remoteAppLogDir) { System.out.println(remoteAppLogDir + does not have any log files.); } {code} Same for {code} private void createContainerLogInLocalDir(Path appLogsDir, ContainerId containerId, FileSystem fs) throws Exception { {code} and {code} private void uploadContainerLogIntoRemoteDir(UserGroupInformation ugi, Configuration configuration, ListString rootLogDirs, NodeId nodeId, ContainerId containerId, Path appDir, FileSystem fs) throws Exception { {code} Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2582.1.patch, YARN-2582.2.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177296#comment-14177296 ] Xuan Gong commented on YARN-2582: - Thanks for the review. Uploaded a new patch to address all the comments Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2582: Attachment: YARN-2582.3.patch Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2704) Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time
[ https://issues.apache.org/jira/browse/YARN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177299#comment-14177299 ] Hadoop QA commented on YARN-2704: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675875/YARN-2704.1.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1269 javac compiler warnings (more than the trunk's current 1266 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 2 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5460//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5460//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5460//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5460//console This message is automatically generated. Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time -- Key: YARN-2704 URL: https://issues.apache.org/jira/browse/YARN-2704 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He Attachments: YARN-2704.1.patch In secure mode, YARN requires the hdfs-delegation token to do localization and log aggregation on behalf of the user. But the hdfs delegation token will eventually expire after max-token-life-time. So, localization and log aggregation will fail after the token expires. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type
[ https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2690: Attachment: YARN-2690.002.patch Done. I had kept it that way to make it easier to review and was planing to move them in a later patch. But it belongs logically here. So updated. Testing was done on the reservation unit tests and testReservationApis. Make ReservationSystem and its dependent classes independent of Scheduler type Key: YARN-2690 URL: https://issues.apache.org/jira/browse/YARN-2690 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2690.001.patch, YARN-2690.002.patch A lot of common reservation classes depend on CapacityScheduler and specifically its configuration. This jira is to make them ready for other Schedulers by abstracting out the configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2691) User level API support for priority label
[ https://issues.apache.org/jira/browse/YARN-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177302#comment-14177302 ] Sunil G commented on YARN-2691: --- A quick nit: ApplicationPriority can be comparable. This will help later for comparison and error checking. User level API support for priority label - Key: YARN-2691 URL: https://issues.apache.org/jira/browse/YARN-2691 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Sunil G Assignee: Rohith Attachments: YARN-2691.patch Support for handling Application-Priority label coming from client to ApplicationSubmissionContext. Common api support for user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177318#comment-14177318 ] Hadoop QA commented on YARN-2673: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675889/YARN-2673-102014.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5462//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5462//console This message is automatically generated. Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177322#comment-14177322 ] Xuan Gong commented on YARN-2701: - [~zxu] Thanks for the feedback. bq. Do we need to check the directory permission? I think we need. We need to make sure the directory has the right permission. bq. If we want to check permission, Can we change the permission if the permission doesn't match? I do not think that we need to do that. If we really want to do that, just changing the permission is not enough. We might need to go through all the sub-directories, and do some necessary checks. And it does not sound like a easy way to do it. I am thinking that we just keep it this way (check but no change the permission.). If we have further requirement, we need to spend more time to investigate it. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177326#comment-14177326 ] Zhijie Shen commented on YARN-2673: --- +1 will commit the patch Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2714) Localizer thread might stuck if NM is OOM
Ming Ma created YARN-2714: - Summary: Localizer thread might stuck if NM is OOM Key: YARN-2714 URL: https://issues.apache.org/jira/browse/YARN-2714 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma When NM JVM runs out of memory; normally it is uncaught exception and the process will exit. But RPC server used by node manager catches OutOfMemoryError to give a chance GC to catch up so NM doesn't need to exit and can recover from OutOfMemoryError situation. However, in some rare situation when this happens, one of the NM localizer thread didn't get the RPC response from node manager and just waited there. The explanation of why node manager RPC server doesn't respond is because RPC server responder thread swallowed OutOfMemoryError and didn't process outstanding RPC response. On the RPC client side, the RPC timeout is set to 0 and it relies on Ping to detect RPC server availability. {noformat} Thread 481 (LocalizerRunner for container_1413487737702_2948_01_013383): State: WAITING Blocked count: 27 Waited count: 84 Waiting on org.apache.hadoop.ipc.Client$Call@6be5add3 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) org.apache.hadoop.ipc.Client.call(Client.java:1396) org.apache.hadoop.ipc.Client.call(Client.java:1363) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) com.sun.proxy.$Proxy36.heartbeat(Unknown Source) org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235) org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107) org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995) {noformat} The consequence of this depends on which ContainerExecutor NM uses. If it uses DefaultContainerExecutor, given its startLocalizer method is synchronized, it will blocks other localizer threads. If you use LinuxContainerExecutor, at least other localizer threads can still proceed. But in theory it can slowly drain all available localizer threads. There are couple ways to fix it. Some of these fixes are complementary. 1. Fix it at haoop-common layer. It seems RPC server hosted by worker services such ad NM doesn't really need to catch OutOfMemoryError; the service JVM can just exit. Even for the NN and RM, given we have HA, it might be ok to do so. 2. Set RPC timeout at HadoopYarnProtoRPC layer so that all YARN clients will timeout if RPC server drops the response. 3. Fix it at yarn localization service. For example, a) Fix DefaultContainerExecutor so that synchronization isn't required for startLocalizer method. b) Download executor thread used by ContainerLocalizer currently catches any exceptions. We can fix ContainerLocalizer so that when Download executor thread catches OutOfMemoryError, it can exit its host process. IMHO, fix it at RPC server layer is better as it addresses other scenarios. Appreciate any input others might have. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs
[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177357#comment-14177357 ] Hudson commented on YARN-2673: -- FAILURE: Integrated in Hadoop-trunk-Commit #6293 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6293/]) YARN-2673. Made timeline client put APIs retry if ConnectException happens. Contributed by Li Lu. (zjshen: rev 89427419a3c5eaab0f73bae98d675979b9efab5f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Add retry for timeline client put APIs -- Key: YARN-2673 URL: https://issues.apache.org/jira/browse/YARN-2673 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch Timeline client now does not handle the case gracefully when the server is down. Jobs from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177366#comment-14177366 ] Hadoop QA commented on YARN-2582: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675907/YARN-2582.3.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5463//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5463//console This message is automatically generated. Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177383#comment-14177383 ] Hadoop QA commented on YARN-2209: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675884/YARN-2209.6.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1288 javac compiler warnings (more than the trunk's current 1266 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.api.impl.TestAMRMClient org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5461//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5461//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5461//console This message is automatically generated. Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type
[ https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177387#comment-14177387 ] Hadoop QA commented on YARN-2690: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675908/YARN-2690.002.patch against trunk revision d5084b9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The test build failed in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5464//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5464//console This message is automatically generated. Make ReservationSystem and its dependent classes independent of Scheduler type Key: YARN-2690 URL: https://issues.apache.org/jira/browse/YARN-2690 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2690.001.patch, YARN-2690.002.patch A lot of common reservation classes depend on CapacityScheduler and specifically its configuration. This jira is to make them ready for other Schedulers by abstracting out the configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177392#comment-14177392 ] Zhijie Shen commented on YARN-2582: --- +1 for the last patch. Will commit it. Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type
[ https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2690: Attachment: YARN-2690.002.patch Uploading again to kick jenkins. The previous failure were bind related issues, seemingly unrelated to this patch. Reran the tests and they passed locally. Make ReservationSystem and its dependent classes independent of Scheduler type Key: YARN-2690 URL: https://issues.apache.org/jira/browse/YARN-2690 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2690.001.patch, YARN-2690.002.patch, YARN-2690.002.patch A lot of common reservation classes depend on CapacityScheduler and specifically its configuration. This jira is to make them ready for other Schedulers by abstracting out the configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abin Shahab updated YARN-1964: -- Attachment: YARN-1964.patch This patch simplifies the use-case by exposing only one docker configuration param: The image. Now the user must configure the image completely so that all require resources and environment variables are defined in the image. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2209: -- Attachment: YARN-2209.6.patch Fixed test failures Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch, YARN-2209.6.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS
[ https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177464#comment-14177464 ] Hudson commented on YARN-2582: -- FAILURE: Integrated in Hadoop-trunk-Commit #6294 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6294/]) YARN-2582. Fixed Log CLI and Web UI for showing aggregated logs of LRS. Contributed Xuan Gong. (zjshen: rev e90718fa5a0e7c18592af61534668acebb9db51b) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogAggregationUtils.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/TestAggregatedLogsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/log/AggregatedLogsBlock.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestLogsCLI.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/LogsCLI.java Log related CLI and Web UI changes for Aggregated Logs in LRS - Key: YARN-2582 URL: https://issues.apache.org/jira/browse/YARN-2582 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch After YARN-2468, we have change the log layout to support log aggregation for Long Running Service. Log CLI and related Web UI should be modified accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
Zhijie Shen created YARN-2715: - Summary: Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177477#comment-14177477 ] zhihai xu commented on YARN-2701: - Hi [~xgong], thanks for the details explanation. The explanation sounds reasonable to me. Some nits: 1. since we only check permission for the final directory component, I think we also need check finalComponent in the first call check_permission. change {code} } else if (check_permission(sb.st_mode, perm) == -1) { {code} to {code} } else if (finalComponent == 1 check_permission(sb.st_mode, perm) == -1) { {code} 2. Can we create a new function check_dir to remove the duplicate code which verify the existing directory at two places? We can also remove function check_permission by moving check_permission code into check_dir. This is check_dir function: {code} int check_dir(char* npath, mode_t st_mode, mode_t desired, int finalComponent) { // Check whether it is a directory if (!S_ISDIR (st_mode)) { fprintf(LOGFILE, Path %s is file not dir\n, npath); return -1; } else if (finalComponent == 1) { int filePermInt = st_mode (S_IRWXU | S_IRWXG | S_IRWXO); int desiredInt = desired (S_IRWXU | S_IRWXG | S_IRWXO); if (filePermInt != desiredInt) { fprintf(LOGFILE, Path %s does not have desired permission.\n, npath); return -1; } } return 0; } {code} 3. Can we move free(npath); from create_validate_dirs to mkdirs? It will be better to free the memory at the same function(mkdirs) which allocated the memory. in mkdirs {code} if (create_validate_dirs(npath, perm, path, 0) == -1) { free(npath); return -1; } {code} 4. a little more optimization to remove redundant code: we can merge these two piece of code: fprintf(LOGFILE, Can't create directory %s in %s - %s\n, npath, path, strerror(errno)); by if (errno != EEXIST || stat(npath, sb) != 0) { The code after change will be like the following: {code} int create_validate_dir(char* npath, mode_t perm, char* path, int finalComponent) { struct stat sb; if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { if (errno != EEXIST || stat(npath, sb) != 0) { fprintf(LOGFILE, Can't create directory %s in %s - %s\n, npath, path, strerror(errno)); return -1; } // The directory npath should exist. if (check_dir(npath, sb.st_mode, perm, finalComponent) == -1) { return -1; } } } else if(check_dir(npath, sb.st_mode, perm, finalComponent) == -1){ return -1; } return 0; } {code} 5. Can we change the name create_validate_dirs to create_validate_dir? since we only create one directory in create_validate_dirs. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2709: Attachment: YARN-2709-102014.patch I've done a patch for this issue. In this patch, I refactored the retry logic in jersey retry filter and built a more generalized retry wrapper for timeline client. Both the jersey retry filter and the delegation token call can use this wrapper to retry, according to the retry settings (added in YARN-2673). To use the retry wrapper, the user only needs to implement a TimelineClientRetryOp, providing a) the operation that should be retried and b) a verifier to tell, on given exception e, whether a retry should happen. I've also added a unit test for retried on getting delegation token. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN
[ https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177507#comment-14177507 ] Hadoop QA commented on YARN-1964: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675937/YARN-1964.patch against trunk revision 8942741. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5466//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5466//console This message is automatically generated. Create Docker analog of the LinuxContainerExecutor in YARN -- Key: YARN-1964 URL: https://issues.apache.org/jira/browse/YARN-1964 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.2.0 Reporter: Arun C Murthy Assignee: Abin Shahab Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch Docker (https://www.docker.io/) is, increasingly, a very popular container technology. In context of YARN, the support for Docker will provide a very elegant solution to allow applications to *package* their software into a Docker container (entire Linux file system incl. custom versions of perl, python etc.) and use it as a blueprint to launch all their YARN containers with requisite software environment. This provides both consistency (all YARN containers will have the same software environment) and isolation (no interference with whatever is installed on the physical machine). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type
[ https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177527#comment-14177527 ] Hadoop QA commented on YARN-2690: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675930/YARN-2690.002.patch against trunk revision 8942741. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5465//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5465//console This message is automatically generated. Make ReservationSystem and its dependent classes independent of Scheduler type Key: YARN-2690 URL: https://issues.apache.org/jira/browse/YARN-2690 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2690.001.patch, YARN-2690.002.patch, YARN-2690.002.patch A lot of common reservation classes depend on CapacityScheduler and specifically its configuration. This jira is to make them ready for other Schedulers by abstracting out the configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chang li updated YARN-2556: --- Attachment: yarn2556_wip.patch Thanks [~airbots] for the substantial early work! I have moved the test job into mapreduce jobclient tests to avoid circular dependency. I have tested the patch, and it has successfully shown the write time, write counters and write per second. I will continue to work on it to add more metric of measurement such as transaction rates, IO rates and memory usage. Tool to measure the performance of the timeline server -- Key: YARN-2556 URL: https://issues.apache.org/jira/browse/YARN-2556 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Jonathan Eagles Assignee: chang li Attachments: YARN-2556-WIP.patch, yarn2556_wip.patch We need to be able to understand the capacity model for the timeline server to give users the tools they need to deploy a timeline server with the correct capacity. I propose we create a mapreduce job that can measure timeline server write and read performance. Transactions per second, I/O for both read and write would be a good start. This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177546#comment-14177546 ] Hadoop QA commented on YARN-2709: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675947/YARN-2709-102014.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1267 javac compiler warnings (more than the trunk's current 1266 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5469//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5469//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5469//console This message is automatically generated. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2714) Localizer thread might stuck if NM is OOM
[ https://issues.apache.org/jira/browse/YARN-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177551#comment-14177551 ] zhihai xu commented on YARN-2714: - YARN-2578 will address item 2, which try to fix the RPC to set timeout 1 min. For me, 3.b will be a good low risk fix. Also 3.a will be a good optimization. Localizer thread might stuck if NM is OOM - Key: YARN-2714 URL: https://issues.apache.org/jira/browse/YARN-2714 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma When NM JVM runs out of memory; normally it is uncaught exception and the process will exit. But RPC server used by node manager catches OutOfMemoryError to give a chance GC to catch up so NM doesn't need to exit and can recover from OutOfMemoryError situation. However, in some rare situation when this happens, one of the NM localizer thread didn't get the RPC response from node manager and just waited there. The explanation of why node manager RPC server doesn't respond is because RPC server responder thread swallowed OutOfMemoryError and didn't process outstanding RPC response. On the RPC client side, the RPC timeout is set to 0 and it relies on Ping to detect RPC server availability. {noformat} Thread 481 (LocalizerRunner for container_1413487737702_2948_01_013383): State: WAITING Blocked count: 27 Waited count: 84 Waiting on org.apache.hadoop.ipc.Client$Call@6be5add3 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) org.apache.hadoop.ipc.Client.call(Client.java:1396) org.apache.hadoop.ipc.Client.call(Client.java:1363) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) com.sun.proxy.$Proxy36.heartbeat(Unknown Source) org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235) org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107) org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995) {noformat} The consequence of this depends on which ContainerExecutor NM uses. If it uses DefaultContainerExecutor, given its startLocalizer method is synchronized, it will blocks other localizer threads. If you use LinuxContainerExecutor, at least other localizer threads can still proceed. But in theory it can slowly drain all available localizer threads. There are couple ways to fix it. Some of these fixes are complementary. 1. Fix it at haoop-common layer. It seems RPC server hosted by worker services such ad NM doesn't really need to catch OutOfMemoryError; the service JVM can just exit. Even for the NN and RM, given we have HA, it might be ok to do so. 2. Set RPC timeout at HadoopYarnProtoRPC layer so that all YARN clients will timeout if RPC server drops the response. 3. Fix it at yarn localization service. For example, a) Fix DefaultContainerExecutor so that synchronization isn't required for startLocalizer method. b) Download executor thread used by ContainerLocalizer currently catches any exceptions. We can fix ContainerLocalizer so that when Download executor thread catches OutOfMemoryError, it can exit its host process. IMHO, fix it at RPC server layer is better as it addresses other scenarios. Appreciate any input others might have. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177550#comment-14177550 ] Hadoop QA commented on YARN-2703: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675881/YARN-2703.1.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5468//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5468//console This message is automatically generated. Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177562#comment-14177562 ] Xuan Gong commented on YARN-2701: - addressed all comments Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2701: Attachment: YARN-2701.4.patch Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177568#comment-14177568 ] Vinod Kumar Vavilapalli commented on YARN-2715: --- bq. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. This is getting complex. I propose the following: - Have a single yarn.resourcemanager.proxyuser.* prefix - Change both YARN RM RPC server and webapps to use the above prefix if explictly configured. Otherwise, fall back to the common configs. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177584#comment-14177584 ] Hadoop QA commented on YARN-2209: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675943/YARN-2209.6.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1283 javac compiler warnings (more than the trunk's current 1266 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5467//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5467//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5467//console This message is automatically generated. Replace AM resync/shutdown command with corresponding exceptions Key: YARN-2209 URL: https://issues.apache.org/jira/browse/YARN-2209 Project: Hadoop YARN Issue Type: Improvement Reporter: Jian He Assignee: Jian He Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch, YARN-2209.6.patch YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate application to re-register on RM restart. we should do the same for AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177601#comment-14177601 ] Hadoop QA commented on YARN-2701: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675954/YARN-2701.4.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5470//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5470//console This message is automatically generated. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2709: Attachment: YARN-2709-102014-1.patch Added a tag to suppress the warnings when getting the delegation token. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2694) Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2694: - Attachment: YARN-2694-20141020-1.patch Attached ver.1 patch and kick Jenkins Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest with multiple node labels will make user limit computation becomes tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177627#comment-14177627 ] zhihai xu commented on YARN-2701: - thanks [~xgong], the latest patch looks most good to me, Just one typo in function check_dir: return 0; should be outside the inner }. change: {code} return -1; } return 0; } } {code} to {code} return -1; } } return 0; } {code} Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177631#comment-14177631 ] Zhijie Shen commented on YARN-2715: --- Vinod, thanks for the comments. It makes sense to me. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2701: Attachment: YARN-2701.5.patch Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177636#comment-14177636 ] Xuan Gong commented on YARN-2701: - Good catch Fixed Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator
Jian He created YARN-2716: - Summary: Refactor ZKRMStateStore retry code with Apache Curator Key: YARN-2716 URL: https://issues.apache.org/jira/browse/YARN-2716 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Per suggestion by [~kasha] in YARN-2131, it's nice to use curator to simplify the retry logic in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator
[ https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter reassigned YARN-2716: --- Assignee: Robert Kanter Refactor ZKRMStateStore retry code with Apache Curator -- Key: YARN-2716 URL: https://issues.apache.org/jira/browse/YARN-2716 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Robert Kanter Per suggestion by [~kasha] in YARN-2131, it's nice to use curator to simplify the retry logic in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type
[ https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177647#comment-14177647 ] Subru Krishnan commented on YARN-2690: -- Thanks [~adhoot] for updating the patch. +1 from my side. Couple of minor nits: * We could have a protected _ReservationSchedulerConfiguration_ variable in _AbstractReservationSystem_ to avoid invoking _ReservationSchedulerConfiguration reservationConfig = getReservationSchedulerConfiguration()_ everywhere. * It'll be good to have some Javadocs for _ReservationSchedulerConfiguration_ describing what the reservation system configs are. Make ReservationSystem and its dependent classes independent of Scheduler type Key: YARN-2690 URL: https://issues.apache.org/jira/browse/YARN-2690 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2690.001.patch, YARN-2690.002.patch, YARN-2690.002.patch A lot of common reservation classes depend on CapacityScheduler and specifically its configuration. This jira is to make them ready for other Schedulers by abstracting out the configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177655#comment-14177655 ] zhihai xu commented on YARN-2701: - thanks [~xgong], The latest patch LGTM. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2715: -- Attachment: YARN-2715.1.patch Made a patch with the following changes: 1. Use yarn.resourcemanager.proxyuser instead of yarn.resourcemanager.webapp.proxyuser as the RM proxy user prefix for both RPC and HTTP channel. 2. Before setting ProxyUsers#sip, use yarn.resourcemanager.proxyuser to overwrite hadoop.proxyuser configurations if it exists. 3. Always read configurations with hadoop.proxyuser for consistency. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177663#comment-14177663 ] Hadoop QA commented on YARN-2709: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675965/YARN-2709-102014-1.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5472//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5472//console This message is automatically generated. Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177675#comment-14177675 ] Hadoop QA commented on YARN-2701: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675971/YARN-2701.5.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5473//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5473//console This message is automatically generated. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2703: Attachment: YARN-2703.2.patch Fix testcase failures Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch, YARN-2703.2.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2692) ktutil test hanging on some machines/ktutil versions
[ https://issues.apache.org/jira/browse/YARN-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Nauroth updated YARN-2692: Hadoop Flags: Reviewed +1 for the patch. I agree that we're not really losing any test coverage by removing this. {{TestSecureRegistry}} will make use of the same keytab file implicitly. ktutil test hanging on some machines/ktutil versions Key: YARN-2692 URL: https://issues.apache.org/jira/browse/YARN-2692 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2692-001.patch a couple of the registry security tests run native {{ktutil}}; this is primarily to debug the keytab generation. [~cnauroth] reports that some versions of {{kinit}} hang. Fix: rm the tests. [YARN-2689] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2694) Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY
[ https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177712#comment-14177712 ] Hadoop QA commented on YARN-2694: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675967/YARN-2694-20141020-1.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5471//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5471//console This message is automatically generated. Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY Key: YARN-2694 URL: https://issues.apache.org/jira/browse/YARN-2694 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2694-20141020-1.patch Currently, node label expression supporting in capacity scheduler is partial completed. Now node label expression specified in Resource Request will only respected when it specified at ANY level. And a ResourceRequest with multiple node labels will make user limit computation becomes tricky. Now we need temporarily disable them, changes include, - AMRMClient - ApplicationMasterService -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2194) Add Cgroup support for RedHat 7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2194: -- Attachment: YARN-2194-1.patch A prelim patch that implements the systemd-based cpu resource isolation for Redhat 7. A summary: (1) Create a new resource handler SystemdLCEResourceHandler. Users can use this handle by configuring the field yarn.nodemanager.linux-container-executor.resources-handler.class. (2) For each container, create one slice and one scope. The scope is put inside the slice, and cpuShare isolation is also attached to the slice. All container's slices are organized in a root slice (named hadoop_yarn.slice in default). Will add some testcases later. Add Cgroup support for RedHat 7 --- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2194-1.patch In previous versions of RedHat, we can build custom cgroup hierarchies with use of the cgconfig command from the libcgroup package. From RedHat 7, package libcgroup is deprecated and it is not recommended to use it since it can easily create conflicts with the default cgroup hierarchy. The systemd is provided and recommended for cgroup management. We need to add support for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display
[ https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177734#comment-14177734 ] Hadoop QA commented on YARN-2703: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675982/YARN-2703.2.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5475//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5475//console This message is automatically generated. Add logUploadedTime into LogValue for better display Key: YARN-2703 URL: https://issues.apache.org/jira/browse/YARN-2703 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2703.1.patch, YARN-2703.2.patch Right now, the container can upload its logs multiple times. Sometimes, containers write different logs into the same log file. After the log aggregation, when we query those logs, it will show: LogType: stderr LogContext: LogType: stdout LogContext: LogType: stderr LogContext: LogType: stdout LogContext: The same files could be displayed multiple times. But we can not figure out which logs come first. We could add extra loguploadedTime to let users have better understanding on the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2701: Attachment: YARN-2701.6.patch Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177750#comment-14177750 ] Xuan Gong commented on YARN-2701: - Had some off line discussion with [~jianhan]. We think that for now, reverting the previous method changes might be the safest way to solve this issue. Uploaded a new patch to do it Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177752#comment-14177752 ] Hadoop QA commented on YARN-2715: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675975/YARN-2715.1.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5474//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5474//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5474//console This message is automatically generated. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts
[ https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177753#comment-14177753 ] Xuan Gong commented on YARN-2161: - [~decster] [~aw] For fixing YARN-2701, i need to revert the native code changes for mkdirs in container-executor.c Fix build on macosx: YARN parts --- Key: YARN-2161 URL: https://issues.apache.org/jira/browse/YARN-2161 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.6.0 Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch When compiling on macosx with -Pnative, there are several warning and errors, fix this would help hadoop developers with macosx env. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts
[ https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177756#comment-14177756 ] Xuan Gong commented on YARN-2161: - The changes for mkdirs in container-executor.c bring the race condition when two containers are trying to check and create directory at the same time. Fix build on macosx: YARN parts --- Key: YARN-2161 URL: https://issues.apache.org/jira/browse/YARN-2161 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.6.0 Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch When compiling on macosx with -Pnative, there are several warning and errors, fix this would help hadoop developers with macosx env. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177761#comment-14177761 ] Xuan Gong commented on YARN-2701: - Sorry. Online discussion with [~jianhe]. And thanks for the review. [~zxu] Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177765#comment-14177765 ] Jian He commented on YARN-2701: --- since the previous method has been used/tested thoroughly, I also prefer reverting the patch for solving the problem for now. thanks [~zxu] for reviewing the previous patch ! Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2715: -- Attachment: YARN-2715.2.patch Fix the findbugs warning and the test failure Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177799#comment-14177799 ] Hadoop QA commented on YARN-2701: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12675996/YARN-2701.6.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5476//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5476//console This message is automatically generated. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177814#comment-14177814 ] Jian He commented on YARN-2701: --- +1 Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7
[ https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177823#comment-14177823 ] Beckham007 commented on YARN-2194: -- startSystemdSlice/stopSystemdSlice needs root privilege? Let container-executor to run sudo systemctl start ? Add Cgroup support for RedHat 7 --- Key: YARN-2194 URL: https://issues.apache.org/jira/browse/YARN-2194 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Attachments: YARN-2194-1.patch In previous versions of RedHat, we can build custom cgroup hierarchies with use of the cgconfig command from the libcgroup package. From RedHat 7, package libcgroup is deprecated and it is not recommended to use it since it can easily create conflicts with the default cgroup hierarchy. The systemd is provided and recommended for cgroup management. We need to add support for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177834#comment-14177834 ] zhihai xu commented on YARN-2701: - thanks [~jianhe], The latest patch LGTM. Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.
[ https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177859#comment-14177859 ] Hadoop QA commented on YARN-2715: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676000/YARN-2715.2.patch against trunk revision e90718f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5477//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5477//console This message is automatically generated. Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set. Key: YARN-2715 URL: https://issues.apache.org/jira/browse/YARN-2715 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2715.1.patch, YARN-2715.2.patch After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC interface, it's not going to work, because ProxyUsers#sip is a singleton per daemon. After YARN-2656, RM has both channels that want to set this configuration: RPC and HTTP. RPC interface sets it first by reading hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty because yarn.resourcemanager.webapp.proxyuser doesn't exist. The fix for it could be similar to what we've done for YARN-2676: make the HTTP interface anyway source hadoop.proxyuser first, then yarn.resourcemanager.webapp.proxyuser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor
[ https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177866#comment-14177866 ] Hudson commented on YARN-2701: -- FAILURE: Integrated in Hadoop-trunk-Commit #6297 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6297/]) YARN-2701. Potential race condition in startLocalizer when using LinuxContainerExecutor. Contributed by Xuan Gong (jianhe: rev 2839365f230165222f63129979ea82ada79ec56e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java Missing file for YARN-2701 (jianhe: rev 4fa1fb3193bf39fcb1bd7f8f8391a78f69c3c302) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockContainerLocalizer.java Potential race condition in startLocalizer when using LinuxContainerExecutor -- Key: YARN-2701 URL: https://issues.apache.org/jira/browse/YARN-2701 Project: Hadoop YARN Issue Type: Bug Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch When using LinuxContainerExecutor do startLocalizer, we are using native code container-executor.c. {code} if (stat(npath, sb) != 0) { if (mkdir(npath, perm) != 0) { {code} We are using check and create method to create the appDir under /usercache. But if there are two containers trying to do this at the same time, race condition may happen. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177886#comment-14177886 ] Rohith commented on YARN-2579: -- bq. Under what conditions, can resetDispatcher be called by two threads simultaneously? resetDispatcher is called only once in synchronized block(transitionToStandBy or transitinedToActive). Here the problem is , *Thread-1 :* just before stoppingActiveServices() from trainsitionToStandBy() method if RMFatalEvent is thrown then RMFatalEventDispatcher wait for trainsitionToStandBy() for obtaining lock.RMFatalEventDispatcher is BLOCKED on trainsitionToStandBy(). *Thread-2 :* From the elector, trainsitionedTotandBy() stops dispatcher in resetDispatcher() method. (Service)Dispatcher.stop() wait for draining out RMFatalEventDispatcher event.But AsyncDispatcher event handler is WAITING on dispatcher thread to finish. Both RM's state is Active , but 1 RM is not really active. -- Key: YARN-2579 URL: https://issues.apache.org/jira/browse/YARN-2579 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: Rohith Assignee: Rohith Attachments: YARN-2579.patch, YARN-2579.patch I encountered a situaltion where both RM's web page was able to access and its state displayed as Active. But One of the RM's ActiveServices were stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177898#comment-14177898 ] Zhijie Shen commented on YARN-2709: --- [~gtCarrera], thanks for the patch. Here're some comments. 1. Any reason why TimelineClientRetryOp, TimelineClientConnectionRetry and TimelineJerseyRetryFilter are not private? 2. Redundant reference. {code} TimelineClientImpl.this.connectionRetry.retryOn(jerseyRetryOp); {code} 3. Not sure why can't this code be put into run() directly? At least it shouldn't be public. {code} public TokenTimelineDelegationTokenIdentifier getDelegationTokenInternal(final String renewer) throws IOException { {code} 4. It's safer to create connectionRetry before retryFilter, because retryFilter may invoke retryFilter, though it won't actually in practice. {code} TimelineJerseyRetryFilter retryFilter = new TimelineJerseyRetryFilter(); client = new Client(new URLConnectionClientHandler( new TimelineURLConnectionFactory()), cc); token = new DelegationTokenAuthenticatedURL.Token(); client.addFilter(retryFilter); connectionRetry = new TimelineClientConnectionRetry(conf); {code} 5. Unnecessary import in TimelineClientImpl 6. I believe the following mock is not necessary. The reason why you want to tack this code is because of HADOOP-11215. Due to this issue, it will throw cast exception here. Please leave a comment about the mock code bellow. {code} doThrow(new ConnectException(Connection refused)).when(client) .getDelegationTokenInternal(any(String.class)); {code} 7. It's not meaningful renewer. You can say UserGroupInformation.getCurrentUser().getShortUserName() here. {code} TokenTimelineDelegationTokenIdentifier token = client .getDelegationToken(http://localhost:8/resource?delegation=;); {code} Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Bug Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method
[ https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2709: -- Issue Type: Sub-task (was: Bug) Parent: YARN-1530 Add retry for timeline client getDelegationToken method --- Key: YARN-2709 URL: https://issues.apache.org/jira/browse/YARN-2709 Project: Hadoop YARN Issue Type: Sub-task Reporter: Li Lu Assignee: Li Lu Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch As mentioned in YARN-2673, we need to add retry mechanism to timeline client for secured clusters. This means if the timeline server is not available, a timeline client needs to retry to get a delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor
[ https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177902#comment-14177902 ] Jian He commented on YARN-1972: --- I merged this to branch-2.6 Implement secure Windows Container Executor --- Key: YARN-1972 URL: https://issues.apache.org/jira/browse/YARN-1972 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Fix For: 2.6.0 Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch, YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch, YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch h1. Windows Secure Container Executor (WCE) YARN-1063 adds the necessary infrasturcture to launch a process as a domain user as a solution for the problem of having a security boundary between processes executed in YARN containers and the Hadoop services. The WCE is a container executor that leverages the winutils capabilities introduced in YARN-1063 and launches containers as an OS process running as the job submitter user. A description of the S4U infrastructure used by YARN-1063 alternatives considered can be read on that JIRA. The WCE is based on the DefaultContainerExecutor. It relies on the DCE to drive the flow of execution, but it overwrrides some emthods to the effect of: * change the DCE created user cache directories to be owned by the job user and by the nodemanager group. * changes the actual container run command to use the 'createAsUser' command of winutils task instead of 'create' * runs the localization as standalone process instead of an in-process Java method call. This in turn relies on the winutil createAsUser feature to run the localization as the job user. When compared to LinuxContainerExecutor (LCE), the WCE has some minor differences: * it does no delegate the creation of the user cache directories to the native implementation. * it does no require special handling to be able to delete user files The approach on the WCE came from a practical trial-and-error approach. I had to iron out some issues around the Windows script shell limitations (command line length) to get it to work, the biggest issue being the huge CLASSPATH that is commonplace in Hadoop environment container executions. The job container itself is already dealing with this via a so called 'classpath jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch as a separate container the same issue had to be resolved and I used the same 'classpath jar' approach. h2. Deployment Requirements To use the WCE one needs to set the `yarn.nodemanager.container-executor.class` to `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` and set the `yarn.nodemanager.windows-secure-container-executor.group` to a Windows security group name that is the nodemanager service principal is a member of (equivalent of LCE `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE does not require any configuration outside of the Hadoop own's yar-site.xml. For WCE to work the nodemanager must run as a service principal that is member of the local Administrators group or LocalSystem. this is derived from the need to invoke LoadUserProfile API which mention these requirements in the specifications. This is in addition to the SE_TCB privilege mentioned in YARN-1063, but this requirement will automatically imply that the SE_TCB privilege is held by the nodemanager. For the Linux speakers in the audience, the requirement is basically to run NM as root. h2. Dedicated high privilege Service Due to the high privilege required by the WCE we had discussed the need to isolate the high privilege operations into a separate process, an 'executor' service that is solely responsible to start the containers (incloding the localizer). The NM would have to authenticate, authorize and communicate with this service via an IPC mechanism and use this service to launch the containers. I still believe we'll end up deploying such a service, but the effort to onboard such a new platfrom specific new service on the project are not trivial. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2717) containerLogNotFound log shows multiple time for the same container
Xuan Gong created YARN-2717: --- Summary: containerLogNotFound log shows multiple time for the same container Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2717) containerLogNotFound log shows multiple time for the same container
[ https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2717: Attachment: YARN-2717.1.patch trivial patch containerLogNotFound log shows multiple time for the same container --- Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2717.1.patch containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177919#comment-14177919 ] Hudson commented on YARN-1879: -- FAILURE: Integrated in Hadoop-trunk-Commit #6298 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6298/]) Missing file for YARN-1879 (jianhe: rev 4a78a752286effbf1a0d8695325f9d7464a09fb4) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestApplicationMasterServiceProtocolOnHA.java Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Fix For: 2.6.0 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container
[ https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177944#comment-14177944 ] Hadoop QA commented on YARN-2717: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12676023/YARN-2717.1.patch against trunk revision 4a78a75. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5478//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5478//console This message is automatically generated. containerLogNotFound log shows multiple time for the same container --- Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2717.1.patch containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container
[ https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177959#comment-14177959 ] Zhijie Shen commented on YARN-2717: --- +1, will commit the patch containerLogNotFound log shows multiple time for the same container --- Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2717.1.patch containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2707) Potential null dereference in FSDownload
[ https://issues.apache.org/jira/browse/YARN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov reassigned YARN-2707: --- Assignee: Gera Shegalov Potential null dereference in FSDownload Key: YARN-2707 URL: https://issues.apache.org/jira/browse/YARN-2707 Project: Hadoop YARN Issue Type: Bug Reporter: Ted Yu Assignee: Gera Shegalov Priority: Minor Here is related code in call(): {code} Pattern pattern = null; String p = resource.getPattern(); if (p != null) { pattern = Pattern.compile(p); } unpack(new File(dTmp.toUri()), new File(dFinal.toUri()), pattern); {code} In unpack(): {code} RunJar.unJar(localrsrc, dst, pattern); {code} unJar() would dereference the pattern without checking whether it is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts
[ https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177969#comment-14177969 ] Binglin Chang commented on YARN-2161: - Hi [~xgong], sorry for break the code. I see in YARN-2701 you already have fix code, but decide to revert the code in the end to be more safe, but this breaks the mac build, how about use #ifdef to use old code when compiling on glibc 2.10(http://linux.die.net/man/2/openat) and use your fixing code otherwise? Fix build on macosx: YARN parts --- Key: YARN-2161 URL: https://issues.apache.org/jira/browse/YARN-2161 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.6.0 Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch When compiling on macosx with -Pnative, there are several warning and errors, fix this would help hadoop developers with macosx env. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container
[ https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177973#comment-14177973 ] Hudson commented on YARN-2717: -- FAILURE: Integrated in Hadoop-trunk-Commit #6299 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6299/]) YARN-2717. Avoided duplicate logging when container logs are not found. Contributed by Xuan Gong. (zjshen: rev 171f2376d23d51b61b9c9b3804ee86dbd4de033a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java * hadoop-yarn-project/CHANGES.txt containerLogNotFound log shows multiple time for the same container --- Key: YARN-2717 URL: https://issues.apache.org/jira/browse/YARN-2717 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.6.0 Attachments: YARN-2717.1.patch containerLogNotFound is called multiple times when the container log for the same container does not exist -- This message was sent by Atlassian JIRA (v6.3.4#6332)