date:20141020


[ 
https://issues.apache.org/jira/browse/YARN-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14176746#comment-14176746
 ] 

Hadoop QA commented on YARN-2711:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12675786/apache-yarn-2711.0.patch
  against trunk revision da80c4d.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5458//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5458//console

This message is automatically generated.

 TestDefaultContainerExecutor#testContainerLaunchError fails on Windows
 --

 Key: YARN-2711
 URL: https://issues.apache.org/jira/browse/YARN-2711
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Varun Vasudev
Assignee: Varun Vasudev
 Attachments: apache-yarn-2711.0.patch


 The testContainerLaunchError test fails on Windows with the following error -
 {noformat}
 java.io.FileNotFoundException: File file:/bin/echo does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
   at 
 org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111)
   at org.apache.hadoop.fs.FilterFs.getFileStatus(FilterFs.java:120)
   at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1117)
   at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1113)
   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1113)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2019)
   at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1978)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:145)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor.testContainerLaunchError(TestDefaultContainerExecutor.java:289)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2691) User level API support for priority label

2014-10-20 Thread Rohith (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith updated YARN-2691:
-
Attachment: YARN-2691.patch

 User level API support for priority label
 -

 Key: YARN-2691
 URL: https://issues.apache.org/jira/browse/YARN-2691
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Sunil G
Assignee: Rohith
 Attachments: YARN-2691.patch


 Support for handling Application-Priority label coming from client to 
 ApplicationSubmissionContext.
 Common api support for user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2010) RM can't transition to active if it can't recover an app attempt


 [ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-2010:
---
Attachment: yarn-2010-3.patch

Re-uploading the last patch, that has a single {{catch(Exception)}}.

[~vinodkv] - would you still prefer having multiple catch-blocks, one for each 
exception. IMO, catching {{ConnectException}} doesn't seem very readable; we 
could add a comment on why we are adding that catch, but we might not be able 
to enumerate all possible cases. That said, I am okay with catching 
ConnectException and Exception separately. Please advise. 

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
 yarn-2010-3.patch, yarn-2010-3.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.


[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177003#comment-14177003
 ] 

Karthik Kambatla commented on YARN-2579:


[~rohithsharma] - can you help me understand the issue here better. 

{{resetDispatcher}} is called either in transitionToStandby and 
transitionToActive, both of which are synchronized methods. Under what 
conditions, can {{resetDispatcher}} be called by two threads simultaneously? 

 Both RM's state is Active , but 1 RM is not really active.
 --

 Key: YARN-2579
 URL: https://issues.apache.org/jira/browse/YARN-2579
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Rohith
Assignee: Rohith
 Attachments: YARN-2579.patch, YARN-2579.patch


 I encountered a situaltion where both RM's web page was able to access and 
 its state displayed as Active. But One of the RM's ActiveServices were 
 stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2398) TestResourceTrackerOnHA crashes


[ 
https://issues.apache.org/jira/browse/YARN-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177087#comment-14177087
 ] 

Wangda Tan commented on YARN-2398:
--

[~ozawa], the log you attached will be resolved by YARN-2705, it's not as same 
as original error: 
https://issues.apache.org/jira/browse/YARN-2398?focusedCommentId=14090771page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14090771

 TestResourceTrackerOnHA crashes
 ---

 Key: YARN-2398
 URL: https://issues.apache.org/jira/browse/YARN-2398
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jason Lowe
 Attachments: TestResourceTrackerOnHA-output.txt


 TestResourceTrackerOnHA is currently crashing and failing trunk builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (YARN-2710) RM HA tests failed intermittently on trunk


 [ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan reopened YARN-2710:
--

 RM HA tests failed intermittently on trunk
 --

 Key: YARN-2710
 URL: https://issues.apache.org/jira/browse/YARN-2710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Wangda Tan
 Attachments: 
 org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt


 Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
 TestResourceTrackerOnHA, etc.
 {code}
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
 testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
   Time elapsed: 9.491 sec   ERROR!
 java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
 to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
   at 
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2710) RM HA tests failed intermittently on trunk


[ 
https://issues.apache.org/jira/browse/YARN-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177095#comment-14177095
 ] 

Wangda Tan commented on YARN-2710:
--

[~jianhe], [~ozawa], [~haosd...@gmail.com]:
I just tried again, in the latest trunk, I run mvn clean test 
-Dtest=TestApplicationClientProtocolOnHA will success, but run 
-Dtest=TestResourceTrackerOnHA will fail. 
Attached log when running TestApplicationClientProtocolOnHA, even if it's 
succeeded, the Cannot connection /  EOF error still exists.
I guess it might be some network configuration caused issue.

And to [~ozawa], as I commented in YARN-2398, this is not as same as YARN-2398, 
reopen the ticket and people can report here if they met same problem.

Thanks,
Wangda

 RM HA tests failed intermittently on trunk
 --

 Key: YARN-2710
 URL: https://issues.apache.org/jira/browse/YARN-2710
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Wangda Tan
 Attachments: 
 org.apache.hadoop.yarn.client.TestResourceTrackerOnHA-output.txt


 Failure like, it can be happened in TestApplicationClientProtocolOnHA, 
 TestResourceTrackerOnHA, etc.
 {code}
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA
 testGetApplicationAttemptsOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA)
   Time elapsed: 9.491 sec   ERROR!
 java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 
 to asf905.gq1.ygridcore.net:28032 failed on connection exception: 
 java.net.ConnectException: Connection refused; For more details see:  
 http://wiki.apache.org/hadoop/ConnectionRefused
   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   at 
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
   at 
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
   at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
   at 
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
   at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
   at org.apache.hadoop.ipc.Client.call(Client.java:1438)
   at org.apache.hadoop.ipc.Client.call(Client.java:1399)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
   at com.sun.proxy.$Proxy17.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationAttempts(ApplicationClientProtocolPBClientImpl.java:372)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
   at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
   at com.sun.proxy.$Proxy18.getApplicationAttempts(Unknown Source)
   at 
 org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationAttempts(YarnClientImpl.java:583)
   at 
 org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetApplicationAttemptsOnHA(TestApplicationClientProtocolOnHA.java:137)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt

2014-10-20 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177099#comment-14177099
 ] 

Vinod Kumar Vavilapalli commented on YARN-2010:
---

Sorry missed this. Lost context, so please help clarify. This time, we got a 
ConnectException to Zookeeper due to which we are skipping apps? That doesn't 
sound right either.

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
 yarn-2010-3.patch, yarn-2010-3.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)
  
 ... 13 more 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-20 Thread Tsuyoshi OZAWA (JIRA)

Tsuyoshi OZAWA created YARN-2712:


 Summary: Adding tests about FSQueue and headroom of FairScheduler 
to TestWorkPreservingRMRestart
 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Test
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA


TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases about 
FairScheduler partially. We should support them.

{code}
   // Until YARN-1959 is resolved
   if (scheduler.getClass() != FairScheduler.class) {
 assertEquals(availableResources, schedulerAttempt.getHeadroom());
   }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2712) Adding tests about FSQueue and headroom of FairScheduler to TestWorkPreservingRMRestart

2014-10-20 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2712:
-
Issue Type: Sub-task  (was: Test)
Parent: YARN-556

 Adding tests about FSQueue and headroom of FairScheduler to 
 TestWorkPreservingRMRestart
 ---

 Key: YARN-2712
 URL: https://issues.apache.org/jira/browse/YARN-2712
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Tsuyoshi OZAWA

 TestWorkPreservingRMRestart#testSchedulerRecovery doesn't have test cases 
 about FairScheduler partially. We should support them.
 {code}
// Until YARN-1959 is resolved
if (scheduler.getClass() != FairScheduler.class) {
  assertEquals(availableResources, schedulerAttempt.getHeadroom());
}
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2704) Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time


 [ 
https://issues.apache.org/jira/browse/YARN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2704:
--
Attachment: YARN-2704.1.patch

uploaded a patch
- make RM automatically request hdfs delegation token on behalf of the user if 
1) user doesn’t provide delegation token on app submission; Or 2) the hdfs 
delegation token is about to expire in 10 hours. 
- NMs heartBeat with RM to get the new tokens and use that for localization and 
log-aggregation
-  a config is added to disable/enable this feature.
- This approach also requires namenode to config RM as the proxy user

  Localization and log-aggregation will fail if hdfs delegation token expired 
 after token-max-life-time
 --

 Key: YARN-2704
 URL: https://issues.apache.org/jira/browse/YARN-2704
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2704.1.patch


 In secure mode, YARN requires the hdfs-delegation token to do localization 
 and log aggregation on behalf of the user. But the hdfs delegation token will 
 eventually expire after max-token-life-time.  So,  localization and log 
 aggregation will fail after the token expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2056) Disable preemption at Queue level


[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177145#comment-14177145
 ] 

Wangda Tan commented on YARN-2056:
--

Hi [~eepayne],
Thanks for the update, and sorry again for the late :).
The generally method looks very good to me, still reviewing tests and other 
details. One quick suggestion is, you don't need re-implement a ordered list, 
its insertion time complexity is O(n), you can use PriorityQueue of Java or 
org.apache.hadoop.utils.PriorityQueue instead.

Wangda



 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, 
 YARN-2056.201410132225.txt, YARN-2056.201410141330.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2056) Disable preemption at Queue level


[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177147#comment-14177147
 ] 

Wangda Tan commented on YARN-2056:
--

Oh, sorry, JIRA interpreted O\(n\), to O(n), not what originally I meant :-p

 Disable preemption at Queue level
 -

 Key: YARN-2056
 URL: https://issues.apache.org/jira/browse/YARN-2056
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Mayank Bansal
Assignee: Eric Payne
 Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
 YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, 
 YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, 
 YARN-2056.201409232329.txt, YARN-2056.201409242210.txt, 
 YARN-2056.201410132225.txt, YARN-2056.201410141330.txt


 We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt


[ 
https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177167#comment-14177167
 ] 

Karthik Kambatla commented on YARN-2010:


In this particular case, we are unable to renew HDFS delegation token due to 
ConnectException to HDFS. We are not yet clear why this happens. Even if this a 
transient HDFS issue, both RMs fail to transition to active and the individual 
RMActiveServices instances transition to STOPPED state. Any subsequent attempts 
to transition the RM to active fail because RMActiveServices is not INITED, as 
in the Standby case.

I spent some more time thinking about this, and think there might be merit to 
catch exceptions separately. ConnectException hopefully is due to a transient 
issue, I don't think we can do much in case of a permanent issue. When we run 
into this, we should probably cleanly transition to standby, so subsequent 
attempts to transition to active may succeed. 

 RM can't transition to active if it can't recover an app attempt
 

 Key: YARN-2010
 URL: https://issues.apache.org/jira/browse/YARN-2010
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.3.0
Reporter: bc Wong
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, 
 yarn-2010-3.patch, yarn-2010-3.patch


 If the RM fails to recover an app attempt, it won't come up. We should make 
 it more resilient.
 Specifically, the underlying error is that the app was submitted before 
 Kerberos security got turned on. Makes sense for the app to fail in this 
 case. But YARN should still start.
 {noformat}
 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Exception handling the winning of election 
 org.apache.hadoop.ha.ServiceFailedException: RM could not transition to 
 Active 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804)
  
 at 
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
  
 at 
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
 Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
 transitioning to Active mode 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)
  
 ... 4 more 
 Caused by: org.apache.hadoop.service.ServiceStateException: 
 org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)
  
 ... 5 more 
 Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
 java.lang.IllegalArgumentException: Missing argument 
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)
  
 at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
 ... 8 more 
 Caused by: java.lang.IllegalArgumentException: Missing argument 
 at javax.crypto.spec.SecretKeySpec.init(SecretKeySpec.java:93) 
 at 
 org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)
  
 at 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)
  
 at

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster


[ 
https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177169#comment-14177169
 ] 

Wangda Tan commented on YARN-2314:
--

[~jlowe], thanks for update, patch looks good to me, +1!
[~rajesh.balamohan], thanks for your performance report based on this. 20-30ms 
is still a latency for interactive tasks cannot be totally ignored. At least, 
we have way to cache connections via configuration option in this patch.

Wangda

 ContainerManagementProtocolProxy can create thousands of threads for a large 
 cluster
 

 Key: YARN-2314
 URL: https://issues.apache.org/jira/browse/YARN-2314
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.1.0-beta
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical
 Attachments: YARN-2314.patch, YARN-2314v2.patch, 
 disable-cm-proxy-cache.patch, nmproxycachefix.prototype.patch, 
 tez-yarn-2314.xlsx


 ContainerManagementProtocolProxy has a cache of NM proxies, and the size of 
 this cache is configurable.  However the cache can grow far beyond the 
 configured size when running on a large cluster and blow AM address/container 
 limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2703) Add logUploadedTime into LogValue for better display


 [ 
https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2703:

Attachment: YARN-2703.1.patch

 Add logUploadedTime into LogValue for better display
 

 Key: YARN-2703
 URL: https://issues.apache.org/jira/browse/YARN-2703
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2703.1.patch


 Right now, the container can upload its logs multiple times. Sometimes, 
 containers write different logs into the same log file.  After the log 
 aggregation, when we query those logs, it will show:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 The same files could be displayed multiple times. But we can not figure out 
 which logs come first. We could add extra loguploadedTime to let users have 
 better understanding on the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display


[ 
https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177185#comment-14177185
 ] 

Xuan Gong commented on YARN-2703:
-

Add logUploadedTime into LogValue context for better display. This patch is 
based on YARN-2582

 Add logUploadedTime into LogValue for better display
 

 Key: YARN-2703
 URL: https://issues.apache.org/jira/browse/YARN-2703
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2703.1.patch


 Right now, the container can upload its logs multiple times. Sometimes, 
 containers write different logs into the same log file.  After the log 
 aggregation, when we query those logs, it will show:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 The same files could be displayed multiple times. But we can not figure out 
 which logs come first. We could add extra loguploadedTime to let users have 
 better understanding on the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2713) Broken RM Home link in NM Web UI when RM HA is enabled

Karthik Kambatla created YARN-2713:
--

 Summary: Broken RM Home link in NM Web UI when RM HA is enabled
 Key: YARN-2713
 URL: https://issues.apache.org/jira/browse/YARN-2713
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


When RM HA is enabled, the 'RM Home' link in the NM WebUI is broken. It points 
to the NM-host:RM-port instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions


 [ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2209:
--
Attachment: YARN-2209.6.patch

Given that we already broken compatibility for rolling upgrades, the patch 
should be fine in that sense. Updated the patch against latest trunk.

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs


 [ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2673:

Attachment: YARN-2673-102014.patch

 Add retry for timeline client put APIs
 --

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2673) Add retry for timeline client put APIs


 [ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2673:

Attachment: (was: YARN-2673-101914.patch)

 Add retry for timeline client put APIs
 --

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS


[ 
https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177274#comment-14177274
 ] 

Zhijie Shen commented on YARN-2582:
---

Looks good to me overall. Just one nit

Make the following methods static?
{code}
  private void containerLogNotFound(String containerId) {
System.out.println(Logs for container  + containerId
  +  are not present in this log-file.);
  }

  private void logDirNotExist(String remoteAppLogDir) {
System.out.println(remoteAppLogDir + does not exist.);
System.out.println(Log aggregation has not completed or is not enabled.);
  }

  private void emptyLogDir(String remoteAppLogDir) {
System.out.println(remoteAppLogDir + does not have any log files.);
  }
{code}
Same for 
{code}
  private void createContainerLogInLocalDir(Path appLogsDir,
  ContainerId containerId, FileSystem fs) throws Exception {
{code}
and
{code}
  private void uploadContainerLogIntoRemoteDir(UserGroupInformation ugi,
  Configuration configuration, ListString rootLogDirs, NodeId nodeId,
  ContainerId containerId, Path appDir, FileSystem fs) throws Exception {
{code}


 Log related CLI and Web UI changes for Aggregated Logs in LRS
 -

 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2582.1.patch, YARN-2582.2.patch


 After YARN-2468, we have change the log layout to support log aggregation for 
 Long Running Service. Log CLI and related Web UI should be modified 
 accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS


[ 
https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177296#comment-14177296
 ] 

Xuan Gong commented on YARN-2582:
-

Thanks for the review.

Uploaded a new patch to address all the comments

 Log related CLI and Web UI changes for Aggregated Logs in LRS
 -

 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch


 After YARN-2468, we have change the log layout to support log aggregation for 
 Long Running Service. Log CLI and related Web UI should be modified 
 accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS


 [ 
https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2582:

Attachment: YARN-2582.3.patch

 Log related CLI and Web UI changes for Aggregated Logs in LRS
 -

 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch


 After YARN-2468, we have change the log layout to support log aggregation for 
 Long Running Service. Log CLI and related Web UI should be modified 
 accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2704) Localization and log-aggregation will fail if hdfs delegation token expired after token-max-life-time


[ 
https://issues.apache.org/jira/browse/YARN-2704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177299#comment-14177299
 ] 

Hadoop QA commented on YARN-2704:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675875/YARN-2704.1.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1269 javac 
compiler warnings (more than the trunk's current 1266 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 2 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5460//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5460//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5460//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5460//console

This message is automatically generated.

  Localization and log-aggregation will fail if hdfs delegation token expired 
 after token-max-life-time
 --

 Key: YARN-2704
 URL: https://issues.apache.org/jira/browse/YARN-2704
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2704.1.patch


 In secure mode, YARN requires the hdfs-delegation token to do localization 
 and log aggregation on behalf of the user. But the hdfs delegation token will 
 eventually expire after max-token-life-time.  So,  localization and log 
 aggregation will fail after the token expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type

2014-10-20 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2690:

Attachment: YARN-2690.002.patch

Done. I had kept it that way to make it easier to review and was planing to 
move them in a later patch. But it belongs logically here. So updated.
Testing was done on the reservation unit tests and testReservationApis.

 Make ReservationSystem and its dependent classes independent of Scheduler 
 type  
 

 Key: YARN-2690
 URL: https://issues.apache.org/jira/browse/YARN-2690
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-2690.001.patch, YARN-2690.002.patch


 A lot of common reservation classes depend on CapacityScheduler and 
 specifically its configuration. This jira is to make them ready for other 
 Schedulers by abstracting out the configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2691) User level API support for priority label

2014-10-20 Thread Sunil G (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177302#comment-14177302
 ] 

Sunil G commented on YARN-2691:
---

A quick nit:
ApplicationPriority can be comparable. This will help later for comparison and 
error checking.

 User level API support for priority label
 -

 Key: YARN-2691
 URL: https://issues.apache.org/jira/browse/YARN-2691
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Reporter: Sunil G
Assignee: Rohith
 Attachments: YARN-2691.patch


 Support for handling Application-Priority label coming from client to 
 ApplicationSubmissionContext.
 Common api support for user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs


[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177318#comment-14177318
 ] 

Hadoop QA commented on YARN-2673:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12675889/YARN-2673-102014.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5462//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5462//console

This message is automatically generated.

 Add retry for timeline client put APIs
 --

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177322#comment-14177322
 ] 

Xuan Gong commented on YARN-2701:
-

[~zxu] Thanks for the feedback.

bq. Do we need to check the directory permission?

I think we need. We need to make sure the directory has the right permission.

bq. If we want to check permission, Can we change the permission if the 
permission doesn't match?

I do not think that we need to do that. If we really want to do that, just 
changing the permission is not enough. We might need to go through all the 
sub-directories, and do some necessary checks. And it does not sound like a 
easy way to do it. I am thinking that we just keep it this way (check but no 
change the permission.). If we have further requirement, we need to spend more 
time to investigate it.




 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs


[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177326#comment-14177326
 ] 

Zhijie Shen commented on YARN-2673:
---

+1 will commit the patch

 Add retry for timeline client put APIs
 --

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2714) Localizer thread might stuck if NM is OOM

2014-10-20 Thread Ming Ma (JIRA)

Ming Ma created YARN-2714:
-

Summary: Localizer thread might stuck if NM is OOM
Key: YARN-2714
URL: https://issues.apache.org/jira/browse/YARN-2714
Project: Hadoop YARN
Issue Type: Bug
Reporter: Ming Ma

However, in some rare situation when this happens, one of the NM localizer
thread didn't get the RPC response from node manager and just waited there. The
explanation of why node manager RPC server doesn't respond is because RPC
server responder thread swallowed OutOfMemoryError and didn't process
outstanding RPC response. On the RPC client side, the RPC timeout is set to 0
and it relies on Ping to detect RPC server availability.

{noformat}
Thread 481 (LocalizerRunner for container_1413487737702_2948_01_013383):
State: WAITING
Blocked count: 27
Waited count: 84
Waiting on org.apache.hadoop.ipc.Client$Call@6be5add3
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
org.apache.hadoop.ipc.Client.call(Client.java:1396)
org.apache.hadoop.ipc.Client.call(Client.java:1363)

org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
com.sun.proxy.$Proxy36.heartbeat(Unknown Source)

org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)

org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235)

org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)

org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107)

org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995)
{noformat}

The consequence of this depends on which ContainerExecutor NM uses. If it uses
DefaultContainerExecutor, given its startLocalizer method is synchronized, it
will blocks other localizer threads. If you use LinuxContainerExecutor, at
least other localizer threads can still proceed. But in theory it can slowly
drain all available localizer threads.

There are couple ways to fix it. Some of these fixes are complementary.

1. Fix it at haoop-common layer. It seems RPC server hosted by worker services
such ad NM doesn't really need to catch OutOfMemoryError; the service JVM can
just exit. Even for the NN and RM, given we have HA, it might be ok to do so.
2. Set RPC timeout at HadoopYarnProtoRPC layer so that all YARN clients will
timeout if RPC server drops the response.
3. Fix it at yarn localization service. For example,
a) Fix DefaultContainerExecutor so that synchronization isn't required for
startLocalizer method.
b) Download executor thread used by ContainerLocalizer currently catches any
exceptions. We can fix ContainerLocalizer so that when Download executor thread
catches OutOfMemoryError, it can exit its host process.

IMHO, fix it at RPC server layer is better as it addresses other scenarios.
Appreciate any input others might have.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2673) Add retry for timeline client put APIs


[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177357#comment-14177357
 ] 

Hudson commented on YARN-2673:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6293 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6293/])
YARN-2673. Made timeline client put APIs retry if ConnectException happens. 
Contributed by Li Lu. (zjshen: rev 89427419a3c5eaab0f73bae98d675979b9efab5f)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java


 Add retry for timeline client put APIs
 --

 Key: YARN-2673
 URL: https://issues.apache.org/jira/browse/YARN-2673
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, 
 YARN-2673-101414.patch, YARN-2673-101714.patch, YARN-2673-102014.patch


 Timeline client now does not handle the case gracefully when the server is 
 down. Jobs from distributed shell may fail due to ATS restart. We may need to 
 add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS


[ 
https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177366#comment-14177366
 ] 

Hadoop QA commented on YARN-2582:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675907/YARN-2582.3.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5463//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5463//console

This message is automatically generated.

 Log related CLI and Web UI changes for Aggregated Logs in LRS
 -

 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch


 After YARN-2468, we have change the log layout to support log aggregation for 
 Long Running Service. Log CLI and related Web UI should be modified 
 accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions


[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177383#comment-14177383
 ] 

Hadoop QA commented on YARN-2209:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675884/YARN-2209.6.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1288 javac 
compiler warnings (more than the trunk's current 1266 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.client.api.impl.TestAMRMClient
  org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5461//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5461//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5461//console

This message is automatically generated.

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type


[ 
https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177387#comment-14177387
 ] 

Hadoop QA commented on YARN-2690:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675908/YARN-2690.002.patch
  against trunk revision d5084b9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The test build failed in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5464//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5464//console

This message is automatically generated.

 Make ReservationSystem and its dependent classes independent of Scheduler 
 type  
 

 Key: YARN-2690
 URL: https://issues.apache.org/jira/browse/YARN-2690
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-2690.001.patch, YARN-2690.002.patch


 A lot of common reservation classes depend on CapacityScheduler and 
 specifically its configuration. This jira is to make them ready for other 
 Schedulers by abstracting out the configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS


[ 
https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177392#comment-14177392
 ] 

Zhijie Shen commented on YARN-2582:
---

+1 for the last patch. Will commit it.

 Log related CLI and Web UI changes for Aggregated Logs in LRS
 -

 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch


 After YARN-2468, we have change the log layout to support log aggregation for 
 Long Running Service. Log CLI and related Web UI should be modified 
 accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type

2014-10-20 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2690:

Attachment: YARN-2690.002.patch

Uploading again to kick jenkins. The previous failure were bind related issues, 
seemingly unrelated to this patch. Reran the tests and they passed locally.

 Make ReservationSystem and its dependent classes independent of Scheduler 
 type  
 

 Key: YARN-2690
 URL: https://issues.apache.org/jira/browse/YARN-2690
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-2690.001.patch, YARN-2690.002.patch, 
 YARN-2690.002.patch


 A lot of common reservation classes depend on CapacityScheduler and 
 specifically its configuration. This jira is to make them ready for other 
 Schedulers by abstracting out the configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN

2014-10-20 Thread Abin Shahab (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abin Shahab updated YARN-1964:
--
Attachment: YARN-1964.patch

This patch simplifies the use-case by exposing only one docker configuration 
param: The image. Now the user must configure the image completely so that all 
require resources and environment variables are defined in the image.


 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions


 [ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2209:
--
Attachment: YARN-2209.6.patch

Fixed test failures

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch, YARN-2209.6.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2582) Log related CLI and Web UI changes for Aggregated Logs in LRS


[ 
https://issues.apache.org/jira/browse/YARN-2582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177464#comment-14177464
 ] 

Hudson commented on YARN-2582:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6294 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6294/])
YARN-2582. Fixed Log CLI and Web UI for showing aggregated logs of LRS. 
Contributed Xuan Gong. (zjshen: rev e90718fa5a0e7c18592af61534668acebb9db51b)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogAggregationUtils.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/LogCLIHelpers.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/logaggregation/TestAggregatedLogsBlock.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/log/AggregatedLogsBlock.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/cli/TestLogsCLI.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/LogsCLI.java


 Log related CLI and Web UI changes for Aggregated Logs in LRS
 -

 Key: YARN-2582
 URL: https://issues.apache.org/jira/browse/YARN-2582
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Fix For: 2.6.0

 Attachments: YARN-2582.1.patch, YARN-2582.2.patch, YARN-2582.3.patch


 After YARN-2468, we have change the log layout to support log aggregation for 
 Long Running Service. Log CLI and related Web UI should be modified 
 accordingly.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.

Zhijie Shen created YARN-2715:
-

 Summary: Proxy user is problem for RPC interface if 
yarn.resourcemanager.webapp.proxyuser is not set.
 Key: YARN-2715
 URL: https://issues.apache.org/jira/browse/YARN-2715
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker


After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC 
interface, it's not going to work, because ProxyUsers#sip is a singleton per 
daemon. After YARN-2656, RM has both channels that want to set this 
configuration: RPC and HTTP. RPC interface sets it first by reading 
hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to empty 
because yarn.resourcemanager.webapp.proxyuser doesn't exist.

The fix for it could be similar to what we've done for YARN-2676: make the HTTP 
interface anyway source hadoop.proxyuser first, then 
yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177477#comment-14177477
 ] 

zhihai xu commented on YARN-2701:
-

Hi [~xgong], thanks for the details explanation. The explanation sounds 
reasonable to me. 

Some nits:
1. since we only check permission for the final directory component,
I think we also need check finalComponent in the first call check_permission.
change
{code}
} else if (check_permission(sb.st_mode, perm) == -1) {
{code}
to
{code}
} else if (finalComponent == 1  check_permission(sb.st_mode, perm) == 
-1) {
{code}

2. Can we create a new function check_dir to remove the duplicate code which 
verify the existing directory at two places?
We can also remove function check_permission by moving check_permission code 
into check_dir.
This is check_dir function:
{code}
int check_dir(char* npath, mode_t st_mode, mode_t desired, int finalComponent) {
// Check whether it is a directory
if (!S_ISDIR (st_mode)) {
  fprintf(LOGFILE, Path %s is file not dir\n, npath);
  return -1;
   } else if (finalComponent == 1) {
  int filePermInt = st_mode  (S_IRWXU | S_IRWXG | S_IRWXO);
  int desiredInt = desired  (S_IRWXU | S_IRWXG | S_IRWXO);
  if (filePermInt != desiredInt) {
  fprintf(LOGFILE, Path %s does not have desired permission.\n, npath);
  return -1;
}
   }
   return 0;
}
{code}

3. Can we move free(npath); from create_validate_dirs to mkdirs?
It will be better to free the memory at the same function(mkdirs) which 
allocated the memory.
in mkdirs 
{code}
if (create_validate_dirs(npath, perm, path, 0) == -1) {
  free(npath);
   return -1;
 }
{code}

4. a little more optimization to remove redundant code:
we can merge these two piece of code:
fprintf(LOGFILE, Can't create directory %s in %s - %s\n, npath,
path, strerror(errno));
by 
if (errno != EEXIST || stat(npath, sb) != 0) {

The code after change will be like the following: 
{code}
int create_validate_dir(char* npath, mode_t perm, char* path, int 
finalComponent) {
  struct stat sb;
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
  if (errno != EEXIST  || stat(npath, sb) != 0) {
fprintf(LOGFILE, Can't create directory %s in %s - %s\n, npath,
path, strerror(errno));
return -1;
  }
  // The directory npath should exist.
  if (check_dir(npath, sb.st_mode, perm, finalComponent) == -1) {
  return -1;
   }
}
  } else if(check_dir(npath, sb.st_mode, perm, finalComponent) == -1){
  return -1;
  }
  return 0;
}
{code}

5. Can we change the name create_validate_dirs to create_validate_dir? since we 
only create one directory in create_validate_dirs.

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method


 [ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2709:

Attachment: YARN-2709-102014.patch

I've done a patch for this issue. In this patch, I refactored the retry logic 
in jersey retry filter and built a more generalized retry wrapper for timeline 
client. Both the jersey retry filter and the delegation token call can use this 
wrapper to retry, according to the retry settings (added in YARN-2673). 

To use the retry wrapper, the user only needs to implement a 
TimelineClientRetryOp, providing a) the operation that should be retried and b) 
a verifier to tell, on given exception e, whether a retry should happen. 

I've also added a unit test for retried on getting delegation token. 

 Add retry for timeline client getDelegationToken method
 ---

 Key: YARN-2709
 URL: https://issues.apache.org/jira/browse/YARN-2709
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2709-102014.patch


 As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
 for secured clusters. This means if the timeline server is not available, a 
 timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1964) Create Docker analog of the LinuxContainerExecutor in YARN


[ 
https://issues.apache.org/jira/browse/YARN-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177507#comment-14177507
 ] 

Hadoop QA commented on YARN-1964:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675937/YARN-1964.patch
  against trunk revision 8942741.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5466//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5466//console

This message is automatically generated.

 Create Docker analog of the LinuxContainerExecutor in YARN
 --

 Key: YARN-1964
 URL: https://issues.apache.org/jira/browse/YARN-1964
 Project: Hadoop YARN
  Issue Type: New Feature
Affects Versions: 2.2.0
Reporter: Arun C Murthy
Assignee: Abin Shahab
 Attachments: YARN-1964.patch, YARN-1964.patch, YARN-1964.patch, 
 YARN-1964.patch, YARN-1964.patch, yarn-1964-branch-2.2.0-docker.patch, 
 yarn-1964-branch-2.2.0-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch, yarn-1964-docker.patch, yarn-1964-docker.patch, 
 yarn-1964-docker.patch


 Docker (https://www.docker.io/) is, increasingly, a very popular container 
 technology.
 In context of YARN, the support for Docker will provide a very elegant 
 solution to allow applications to *package* their software into a Docker 
 container (entire Linux file system incl. custom versions of perl, python 
 etc.) and use it as a blueprint to launch all their YARN containers with 
 requisite software environment. This provides both consistency (all YARN 
 containers will have the same software environment) and isolation (no 
 interference with whatever is installed on the physical machine).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type


[ 
https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177527#comment-14177527
 ] 

Hadoop QA commented on YARN-2690:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675930/YARN-2690.002.patch
  against trunk revision 8942741.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5465//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5465//console

This message is automatically generated.

 Make ReservationSystem and its dependent classes independent of Scheduler 
 type  
 

 Key: YARN-2690
 URL: https://issues.apache.org/jira/browse/YARN-2690
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-2690.001.patch, YARN-2690.002.patch, 
 YARN-2690.002.patch


 A lot of common reservation classes depend on CapacityScheduler and 
 specifically its configuration. This jira is to make them ready for other 
 Schedulers by abstracting out the configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2556) Tool to measure the performance of the timeline server

2014-10-20 Thread chang li (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chang li updated YARN-2556:
---
Attachment: yarn2556_wip.patch

Thanks [~airbots] for the substantial early work! I have moved the test job 
into mapreduce jobclient tests to avoid circular dependency. I have tested the 
patch, and it has successfully shown the write time, write counters and write 
per second. I will continue to work on it to add more metric of measurement 
such as transaction rates, IO rates and memory usage.

 Tool to measure the performance of the timeline server
 --

 Key: YARN-2556
 URL: https://issues.apache.org/jira/browse/YARN-2556
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Jonathan Eagles
Assignee: chang li
 Attachments: YARN-2556-WIP.patch, yarn2556_wip.patch


 We need to be able to understand the capacity model for the timeline server 
 to give users the tools they need to deploy a timeline server with the 
 correct capacity.
 I propose we create a mapreduce job that can measure timeline server write 
 and read performance. Transactions per second, I/O for both read and write 
 would be a good start.
 This could be done as an example or test job that could be tied into gridmix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method


[ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177546#comment-14177546
 ] 

Hadoop QA commented on YARN-2709:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12675947/YARN-2709-102014.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1267 javac 
compiler warnings (more than the trunk's current 1266 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5469//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5469//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5469//console

This message is automatically generated.

 Add retry for timeline client getDelegationToken method
 ---

 Key: YARN-2709
 URL: https://issues.apache.org/jira/browse/YARN-2709
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2709-102014.patch


 As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
 for secured clusters. This means if the timeline server is not available, a 
 timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2714) Localizer thread might stuck if NM is OOM

[
https://issues.apache.org/jira/browse/YARN-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177551#comment-14177551
]

zhihai xu commented on YARN-2714:
-

YARN-2578 will address item 2, which try to fix the RPC to set timeout 1 min.
For me, 3.b will be a good low risk fix.
Also 3.a will be a good optimization.

Localizer thread might stuck if NM is OOM
-

Key: YARN-2714
URL: https://issues.apache.org/jira/browse/YARN-2714
Project: Hadoop YARN
Issue Type: Bug
Reporter: Ming Ma

When NM JVM runs out of memory; normally it is uncaught exception and the
process will exit. But RPC server used by node manager catches
OutOfMemoryError to give a chance GC to catch up so NM doesn't need to exit
and can recover from OutOfMemoryError situation.
However, in some rare situation when this happens, one of the NM localizer
thread didn't get the RPC response from node manager and just waited there.
The explanation of why node manager RPC server doesn't respond is because RPC
server responder thread swallowed OutOfMemoryError and didn't process
outstanding RPC response. On the RPC client side, the RPC timeout is set to 0
and it relies on Ping to detect RPC server availability.
{noformat}
Thread 481 (LocalizerRunner for container_1413487737702_2948_01_013383):
State: WAITING
Blocked count: 27
Waited count: 84
Waiting on org.apache.hadoop.ipc.Client$Call@6be5add3
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
org.apache.hadoop.ipc.Client.call(Client.java:1396)
org.apache.hadoop.ipc.Client.call(Client.java:1363)

org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
com.sun.proxy.$Proxy36.heartbeat(Unknown Source)

org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)

org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235)

org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)

org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107)

org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995)
{noformat}
The consequence of this depends on which ContainerExecutor NM uses. If it
uses DefaultContainerExecutor, given its startLocalizer method is
synchronized, it will blocks other localizer threads. If you use
LinuxContainerExecutor, at least other localizer threads can still proceed.
But in theory it can slowly drain all available localizer threads.
There are couple ways to fix it. Some of these fixes are complementary.
1. Fix it at haoop-common layer. It seems RPC server hosted by worker
services such ad NM doesn't really need to catch OutOfMemoryError; the
service JVM can just exit. Even for the NN and RM, given we have HA, it might
be ok to do so.
2. Set RPC timeout at HadoopYarnProtoRPC layer so that all YARN clients will
timeout if RPC server drops the response.
3. Fix it at yarn localization service. For example,
a) Fix DefaultContainerExecutor so that synchronization isn't required for
startLocalizer method.
b) Download executor thread used by ContainerLocalizer currently catches any
exceptions. We can fix ContainerLocalizer so that when Download executor
thread catches OutOfMemoryError, it can exit its host process.
IMHO, fix it at RPC server layer is better as it addresses other scenarios.
Appreciate any input others might have.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display


[ 
https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177550#comment-14177550
 ] 

Hadoop QA commented on YARN-2703:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675881/YARN-2703.1.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  org.apache.hadoop.yarn.logaggregation.TestAggregatedLogFormat
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5468//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5468//console

This message is automatically generated.

 Add logUploadedTime into LogValue for better display
 

 Key: YARN-2703
 URL: https://issues.apache.org/jira/browse/YARN-2703
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2703.1.patch


 Right now, the container can upload its logs multiple times. Sometimes, 
 containers write different logs into the same log file.  After the log 
 aggregation, when we query those logs, it will show:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 The same files could be displayed multiple times. But we can not figure out 
 which logs come first. We could add extra loguploadedTime to let users have 
 better understanding on the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177562#comment-14177562
 ] 

Xuan Gong commented on YARN-2701:
-

addressed all comments

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


 [ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2701:

Attachment: YARN-2701.4.patch

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.

2014-10-20 Thread Vinod Kumar Vavilapalli (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177568#comment-14177568
]

Vinod Kumar Vavilapalli commented on YARN-2715:
---

bq. The fix for it could be similar to what we've done for YARN-2676: make the
HTTP interface anyway source hadoop.proxyuser first, then
yarn.resourcemanager.webapp.proxyuser.
This is getting complex. I propose the following:
- Have a single yarn.resourcemanager.proxyuser.* prefix
- Change both YARN RM RPC server and webapps to use the above prefix if
explictly configured. Otherwise, fall back to the common configs.

Proxy user is problem for RPC interface if
yarn.resourcemanager.webapp.proxyuser is not set.

Key: YARN-2715
URL: https://issues.apache.org/jira/browse/YARN-2715
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker

After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC
interface, it's not going to work, because ProxyUsers#sip is a singleton per
daemon. After YARN-2656, RM has both channels that want to set this
configuration: RPC and HTTP. RPC interface sets it first by reading
hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to
empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
The fix for it could be similar to what we've done for YARN-2676: make the
HTTP interface anyway source hadoop.proxyuser first, then
yarn.resourcemanager.webapp.proxyuser.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions


[ 
https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177584#comment-14177584
 ] 

Hadoop QA commented on YARN-2209:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675943/YARN-2209.6.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1283 javac 
compiler warnings (more than the trunk's current 1266 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5467//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5467//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5467//console

This message is automatically generated.

 Replace AM resync/shutdown command with corresponding exceptions
 

 Key: YARN-2209
 URL: https://issues.apache.org/jira/browse/YARN-2209
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Jian He
Assignee: Jian He
 Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, 
 YARN-2209.4.patch, YARN-2209.5.patch, YARN-2209.6.patch, YARN-2209.6.patch


 YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate 
 application to re-register on RM restart. we should do the same for 
 AMS#allocate call also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177601#comment-14177601
 ] 

Hadoop QA commented on YARN-2701:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675954/YARN-2701.4.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5470//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5470//console

This message is automatically generated.

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method


 [ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2709:

Attachment: YARN-2709-102014-1.patch

Added a tag to suppress the warnings when getting the delegation token. 

 Add retry for timeline client getDelegationToken method
 ---

 Key: YARN-2709
 URL: https://issues.apache.org/jira/browse/YARN-2709
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch


 As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
 for secured clusters. This means if the timeline server is not available, a 
 timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2694) Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY


 [ 
https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2694:
-
Attachment: YARN-2694-20141020-1.patch

Attached ver.1 patch and kick Jenkins

 Ensure only single node labels specified in resource request, and node label 
 expression only specified when resourceName=ANY
 

 Key: YARN-2694
 URL: https://issues.apache.org/jira/browse/YARN-2694
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2694-20141020-1.patch


 Currently, node label expression supporting in capacity scheduler is partial 
 completed. Now node label expression specified in Resource Request will only 
 respected when it specified at ANY level. And a ResourceRequest with multiple 
 node labels will make user limit computation becomes tricky.
 Now we need temporarily disable them, changes include,
 - AMRMClient
 - ApplicationMasterService



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177627#comment-14177627
 ] 

zhihai xu commented on YARN-2701:
-

thanks [~xgong], the latest patch looks most good to me, Just one typo in 
function check_dir:
return 0; should be outside the inner }.
change:
{code}
return -1;
}
return 0;
  }
}
{code}
to
{code}
return -1;
}
  }
  return 0;
}
{code}

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.


[ 
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177631#comment-14177631
 ] 

Zhijie Shen commented on YARN-2715:
---

Vinod, thanks for the comments. It makes sense to me.

 Proxy user is problem for RPC interface if 
 yarn.resourcemanager.webapp.proxyuser is not set.
 

 Key: YARN-2715
 URL: https://issues.apache.org/jira/browse/YARN-2715
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker

 After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC 
 interface, it's not going to work, because ProxyUsers#sip is a singleton per 
 daemon. After YARN-2656, RM has both channels that want to set this 
 configuration: RPC and HTTP. RPC interface sets it first by reading 
 hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to 
 empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
 The fix for it could be similar to what we've done for YARN-2676: make the 
 HTTP interface anyway source hadoop.proxyuser first, then 
 yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


 [ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2701:

Attachment: YARN-2701.5.patch

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177636#comment-14177636
 ] 

Xuan Gong commented on YARN-2701:
-

Good catch Fixed

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator

Jian He created YARN-2716:
-

 Summary: Refactor ZKRMStateStore retry code with Apache Curator
 Key: YARN-2716
 URL: https://issues.apache.org/jira/browse/YARN-2716
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He


Per suggestion by [~kasha] in YARN-2131,  it's nice to use curator to simplify 
the retry logic in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator

2014-10-20 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter reassigned YARN-2716:
---

Assignee: Robert Kanter

 Refactor ZKRMStateStore retry code with Apache Curator
 --

 Key: YARN-2716
 URL: https://issues.apache.org/jira/browse/YARN-2716
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Robert Kanter

 Per suggestion by [~kasha] in YARN-2131,  it's nice to use curator to 
 simplify the retry logic in ZKRMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2690) Make ReservationSystem and its dependent classes independent of Scheduler type

2014-10-20 Thread Subru Krishnan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177647#comment-14177647
 ] 

Subru Krishnan commented on YARN-2690:
--

Thanks [~adhoot] for updating the patch. +1 from my side.

Couple of minor nits:
  * We could have a protected _ReservationSchedulerConfiguration_ variable in 
_AbstractReservationSystem_ to avoid invoking 
_ReservationSchedulerConfiguration reservationConfig = 
getReservationSchedulerConfiguration()_ everywhere.
  * It'll be good to have some Javadocs for _ReservationSchedulerConfiguration_ 
describing what the reservation system configs are.

 Make ReservationSystem and its dependent classes independent of Scheduler 
 type  
 

 Key: YARN-2690
 URL: https://issues.apache.org/jira/browse/YARN-2690
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: fairscheduler
Reporter: Anubhav Dhoot
Assignee: Anubhav Dhoot
 Attachments: YARN-2690.001.patch, YARN-2690.002.patch, 
 YARN-2690.002.patch


 A lot of common reservation classes depend on CapacityScheduler and 
 specifically its configuration. This jira is to make them ready for other 
 Schedulers by abstracting out the configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177655#comment-14177655
 ] 

zhihai xu commented on YARN-2701:
-

thanks [~xgong], The latest patch LGTM.

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.

[
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhijie Shen updated YARN-2715:
--
Attachment: YARN-2715.1.patch

Made a patch with the following changes:

1. Use yarn.resourcemanager.proxyuser instead of
yarn.resourcemanager.webapp.proxyuser as the RM proxy user prefix for both
RPC and HTTP channel.

2. Before setting ProxyUsers#sip, use yarn.resourcemanager.proxyuser to
overwrite hadoop.proxyuser configurations if it exists.

3. Always read configurations with hadoop.proxyuser for consistency.

Proxy user is problem for RPC interface if
yarn.resourcemanager.webapp.proxyuser is not set.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method


[ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177663#comment-14177663
 ] 

Hadoop QA commented on YARN-2709:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12675965/YARN-2709-102014-1.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5472//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5472//console

This message is automatically generated.

 Add retry for timeline client getDelegationToken method
 ---

 Key: YARN-2709
 URL: https://issues.apache.org/jira/browse/YARN-2709
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch


 As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
 for secured clusters. This means if the timeline server is not available, a 
 timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177675#comment-14177675
 ] 

Hadoop QA commented on YARN-2701:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675971/YARN-2701.5.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5473//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5473//console

This message is automatically generated.

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2703) Add logUploadedTime into LogValue for better display


 [ 
https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2703:

Attachment: YARN-2703.2.patch

Fix testcase failures

 Add logUploadedTime into LogValue for better display
 

 Key: YARN-2703
 URL: https://issues.apache.org/jira/browse/YARN-2703
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2703.1.patch, YARN-2703.2.patch


 Right now, the container can upload its logs multiple times. Sometimes, 
 containers write different logs into the same log file.  After the log 
 aggregation, when we query those logs, it will show:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 The same files could be displayed multiple times. But we can not figure out 
 which logs come first. We could add extra loguploadedTime to let users have 
 better understanding on the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2692) ktutil test hanging on some machines/ktutil versions

2014-10-20 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2692:

Hadoop Flags: Reviewed

+1 for the patch.  I agree that we're not really losing any test coverage by 
removing this.  {{TestSecureRegistry}} will make use of the same keytab file 
implicitly.

 ktutil test hanging on some machines/ktutil versions
 

 Key: YARN-2692
 URL: https://issues.apache.org/jira/browse/YARN-2692
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2692-001.patch


 a couple of the registry security tests run native {{ktutil}}; this is 
 primarily to debug the keytab generation. [~cnauroth] reports that some 
 versions of {{kinit}} hang. Fix: rm the tests. [YARN-2689]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2694) Ensure only single node labels specified in resource request, and node label expression only specified when resourceName=ANY


[ 
https://issues.apache.org/jira/browse/YARN-2694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177712#comment-14177712
 ] 

Hadoop QA commented on YARN-2694:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12675967/YARN-2694-20141020-1.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5471//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5471//console

This message is automatically generated.

 Ensure only single node labels specified in resource request, and node label 
 expression only specified when resourceName=ANY
 

 Key: YARN-2694
 URL: https://issues.apache.org/jira/browse/YARN-2694
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: capacityscheduler, resourcemanager
Reporter: Wangda Tan
Assignee: Wangda Tan
 Attachments: YARN-2694-20141020-1.patch


 Currently, node label expression supporting in capacity scheduler is partial 
 completed. Now node label expression specified in Resource Request will only 
 respected when it specified at ANY level. And a ResourceRequest with multiple 
 node labels will make user limit computation becomes tricky.
 Now we need temporarily disable them, changes include,
 - AMRMClient
 - ApplicationMasterService



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2194) Add Cgroup support for RedHat 7

2014-10-20 Thread Wei Yan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Yan updated YARN-2194:
--
Attachment: YARN-2194-1.patch

A prelim patch that implements the systemd-based cpu resource isolation for 
Redhat 7.
A summary:
(1) Create a new resource handler SystemdLCEResourceHandler. Users can use this 
handle by configuring the field 
yarn.nodemanager.linux-container-executor.resources-handler.class.
(2) For each container, create one slice and one scope. The scope is put inside 
the slice, and cpuShare isolation is also attached to the slice.  All 
container's slices are organized in a root slice (named hadoop_yarn.slice in 
default).

Will add some testcases later.

 Add Cgroup support for RedHat 7
 ---

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2194-1.patch


 In previous versions of RedHat, we can build custom cgroup hierarchies with 
 use of the cgconfig command from the libcgroup package. From RedHat 7, 
 package libcgroup is deprecated and it is not recommended to use it since it 
 can easily create conflicts with the default cgroup hierarchy. The systemd is 
 provided and recommended for cgroup management. We need to add support for 
 this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2703) Add logUploadedTime into LogValue for better display


[ 
https://issues.apache.org/jira/browse/YARN-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177734#comment-14177734
 ] 

Hadoop QA commented on YARN-2703:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675982/YARN-2703.2.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5475//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5475//console

This message is automatically generated.

 Add logUploadedTime into LogValue for better display
 

 Key: YARN-2703
 URL: https://issues.apache.org/jira/browse/YARN-2703
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, resourcemanager
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2703.1.patch, YARN-2703.2.patch


 Right now, the container can upload its logs multiple times. Sometimes, 
 containers write different logs into the same log file.  After the log 
 aggregation, when we query those logs, it will show:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 LogType: stderr
 LogContext:
 LogType: stdout
 LogContext:
 The same files could be displayed multiple times. But we can not figure out 
 which logs come first. We could add extra loguploadedTime to let users have 
 better understanding on the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


 [ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2701:

Attachment: YARN-2701.6.patch

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177750#comment-14177750
 ] 

Xuan Gong commented on YARN-2701:
-

Had some off line discussion with [~jianhan]. We think that for now, reverting 
the previous method changes might be the safest way to solve this issue.

Uploaded a new patch to do it

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.


[ 
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177752#comment-14177752
 ] 

Hadoop QA commented on YARN-2715:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675975/YARN-2715.1.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests:

  
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokenAuthentication

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5474//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5474//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-common.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5474//console

This message is automatically generated.

 Proxy user is problem for RPC interface if 
 yarn.resourcemanager.webapp.proxyuser is not set.
 

 Key: YARN-2715
 URL: https://issues.apache.org/jira/browse/YARN-2715
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2715.1.patch


 After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC 
 interface, it's not going to work, because ProxyUsers#sip is a singleton per 
 daemon. After YARN-2656, RM has both channels that want to set this 
 configuration: RPC and HTTP. RPC interface sets it first by reading 
 hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to 
 empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
 The fix for it could be similar to what we've done for YARN-2676: make the 
 HTTP interface anyway source hadoop.proxyuser first, then 
 yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts


[ 
https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177753#comment-14177753
 ] 

Xuan Gong commented on YARN-2161:
-

[~decster] [~aw] 
For fixing YARN-2701, i need to revert the native code changes for mkdirs in 
container-executor.c

 Fix build on macosx: YARN parts
 ---

 Key: YARN-2161
 URL: https://issues.apache.org/jira/browse/YARN-2161
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Binglin Chang
Assignee: Binglin Chang
 Fix For: 2.6.0

 Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch


 When compiling on macosx with -Pnative, there are several warning and errors, 
 fix this would help hadoop developers with macosx env. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts


[ 
https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177756#comment-14177756
 ] 

Xuan Gong commented on YARN-2161:
-

The changes for mkdirs in container-executor.c bring the race condition when 
two containers are trying to check and create directory at the same time.

 Fix build on macosx: YARN parts
 ---

 Key: YARN-2161
 URL: https://issues.apache.org/jira/browse/YARN-2161
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Binglin Chang
Assignee: Binglin Chang
 Fix For: 2.6.0

 Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch


 When compiling on macosx with -Pnative, there are several warning and errors, 
 fix this would help hadoop developers with macosx env. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177761#comment-14177761
 ] 

Xuan Gong commented on YARN-2701:
-

Sorry. Online discussion with [~jianhe].  And thanks for the review. [~zxu]

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177765#comment-14177765
 ] 

Jian He commented on YARN-2701:
---

since the previous method has been used/tested thoroughly, I also prefer 
reverting the patch for solving the problem for now. thanks [~zxu] for 
reviewing the previous patch !

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.


 [ 
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2715:
--
Attachment: YARN-2715.2.patch

Fix the findbugs warning and the test failure

 Proxy user is problem for RPC interface if 
 yarn.resourcemanager.webapp.proxyuser is not set.
 

 Key: YARN-2715
 URL: https://issues.apache.org/jira/browse/YARN-2715
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2715.1.patch, YARN-2715.2.patch


 After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC 
 interface, it's not going to work, because ProxyUsers#sip is a singleton per 
 daemon. After YARN-2656, RM has both channels that want to set this 
 configuration: RPC and HTTP. RPC interface sets it first by reading 
 hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to 
 empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
 The fix for it could be similar to what we've done for YARN-2676: make the 
 HTTP interface anyway source hadoop.proxyuser first, then 
 yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177799#comment-14177799
 ] 

Hadoop QA commented on YARN-2701:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12675996/YARN-2701.6.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5476//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5476//console

This message is automatically generated.

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177814#comment-14177814
 ] 

Jian He commented on YARN-2701:
---

+1

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2194) Add Cgroup support for RedHat 7

2014-10-20 Thread Beckham007 (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177823#comment-14177823
 ] 

Beckham007 commented on YARN-2194:
--

startSystemdSlice/stopSystemdSlice needs root privilege? Let container-executor 
to run sudo systemctl start  ?

 Add Cgroup support for RedHat 7
 ---

 Key: YARN-2194
 URL: https://issues.apache.org/jira/browse/YARN-2194
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Wei Yan
Assignee: Wei Yan
 Attachments: YARN-2194-1.patch


 In previous versions of RedHat, we can build custom cgroup hierarchies with 
 use of the cgconfig command from the libcgroup package. From RedHat 7, 
 package libcgroup is deprecated and it is not recommended to use it since it 
 can easily create conflicts with the default cgroup hierarchy. The systemd is 
 provided and recommended for cgroup management. We need to add support for 
 this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177834#comment-14177834
 ] 

zhihai xu commented on YARN-2701:
-

thanks [~jianhe], The latest patch LGTM.

 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2715) Proxy user is problem for RPC interface if yarn.resourcemanager.webapp.proxyuser is not set.


[ 
https://issues.apache.org/jira/browse/YARN-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177859#comment-14177859
 ] 

Hadoop QA commented on YARN-2715:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676000/YARN-2715.2.patch
  against trunk revision e90718f.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5477//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5477//console

This message is automatically generated.

 Proxy user is problem for RPC interface if 
 yarn.resourcemanager.webapp.proxyuser is not set.
 

 Key: YARN-2715
 URL: https://issues.apache.org/jira/browse/YARN-2715
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Zhijie Shen
Assignee: Zhijie Shen
Priority: Blocker
 Attachments: YARN-2715.1.patch, YARN-2715.2.patch


 After YARN-2656, if people set hadoop.proxyuser for the client--RM RPC 
 interface, it's not going to work, because ProxyUsers#sip is a singleton per 
 daemon. After YARN-2656, RM has both channels that want to set this 
 configuration: RPC and HTTP. RPC interface sets it first by reading 
 hadoop.proxyuser, but it is overwritten by HTTP interface, who sets it to 
 empty because yarn.resourcemanager.webapp.proxyuser doesn't exist.
 The fix for it could be similar to what we've done for YARN-2676: make the 
 HTTP interface anyway source hadoop.proxyuser first, then 
 yarn.resourcemanager.webapp.proxyuser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2701) Potential race condition in startLocalizer when using LinuxContainerExecutor


[ 
https://issues.apache.org/jira/browse/YARN-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177866#comment-14177866
 ] 

Hudson commented on YARN-2701:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6297 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6297/])
YARN-2701. Potential race condition in startLocalizer when using 
LinuxContainerExecutor. Contributed by Xuan Gong (jianhe: rev 
2839365f230165222f63129979ea82ada79ec56e)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java
Missing file for YARN-2701 (jianhe: rev 
4fa1fb3193bf39fcb1bd7f8f8391a78f69c3c302)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/MockContainerLocalizer.java


 Potential race condition in startLocalizer when using LinuxContainerExecutor  
 --

 Key: YARN-2701
 URL: https://issues.apache.org/jira/browse/YARN-2701
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Xuan Gong
Assignee: Xuan Gong
Priority: Blocker
 Attachments: YARN-2701.1.patch, YARN-2701.2.patch, YARN-2701.3.patch, 
 YARN-2701.4.patch, YARN-2701.5.patch, YARN-2701.6.patch


 When using LinuxContainerExecutor do startLocalizer, we are using native code 
 container-executor.c. 
 {code}
  if (stat(npath, sb) != 0) {
if (mkdir(npath, perm) != 0) {
 {code}
 We are using check and create method to create the appDir under /usercache. 
 But if there are two containers trying to do this at the same time, race 
 condition may happen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-20 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177886#comment-14177886
 ] 

Rohith commented on YARN-2579:
--

bq. Under what conditions, can resetDispatcher be called by two threads 
simultaneously? 
resetDispatcher is called only once in synchronized block(transitionToStandBy 
or transitinedToActive). 

Here the problem is , 
*Thread-1 :* just before stoppingActiveServices() from trainsitionToStandBy() 
method if RMFatalEvent is thrown then RMFatalEventDispatcher wait for 
trainsitionToStandBy() for obtaining lock.RMFatalEventDispatcher is BLOCKED on 
trainsitionToStandBy().
*Thread-2 :* From the elector, trainsitionedTotandBy() stops dispatcher in 
resetDispatcher() method. (Service)Dispatcher.stop() wait for draining out 
RMFatalEventDispatcher event.But AsyncDispatcher event handler is WAITING on 
dispatcher thread to finish.


 Both RM's state is Active , but 1 RM is not really active.
 --

 Key: YARN-2579
 URL: https://issues.apache.org/jira/browse/YARN-2579
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.5.1
Reporter: Rohith
Assignee: Rohith
 Attachments: YARN-2579.patch, YARN-2579.patch


 I encountered a situaltion where both RM's web page was able to access and 
 its state displayed as Active. But One of the RM's ActiveServices were 
 stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2709) Add retry for timeline client getDelegationToken method


[ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177898#comment-14177898
 ] 

Zhijie Shen commented on YARN-2709:
---

[~gtCarrera], thanks for the patch. Here're some comments.

1. Any reason why TimelineClientRetryOp, TimelineClientConnectionRetry and 
TimelineJerseyRetryFilter are not private?

2. Redundant reference.
{code}
TimelineClientImpl.this.connectionRetry.retryOn(jerseyRetryOp);
{code}

3. Not sure why can't this code be put into run() directly? At least it 
shouldn't be public.
{code}
  public TokenTimelineDelegationTokenIdentifier
getDelegationTokenInternal(final String renewer) throws IOException {
{code}

4. It's safer to create connectionRetry before retryFilter, because retryFilter 
may invoke retryFilter, though it won't actually in practice.
{code}
  TimelineJerseyRetryFilter retryFilter = new TimelineJerseyRetryFilter();
  client = new Client(new URLConnectionClientHandler(
  new TimelineURLConnectionFactory()), cc);
  token = new DelegationTokenAuthenticatedURL.Token();
  client.addFilter(retryFilter);
  connectionRetry = new TimelineClientConnectionRetry(conf);
{code}

5. Unnecessary import in TimelineClientImpl

6. I believe the following mock is not necessary. The reason why you want to 
tack this code is because of HADOOP-11215. Due to this issue, it will throw 
cast exception here. Please leave a comment about the mock code bellow.
{code}
doThrow(new ConnectException(Connection refused)).when(client)
  .getDelegationTokenInternal(any(String.class));
{code}

7. It's not meaningful renewer. You can say 
UserGroupInformation.getCurrentUser().getShortUserName() here.
{code}
  TokenTimelineDelegationTokenIdentifier token = client
  .getDelegationToken(http://localhost:8/resource?delegation=;);
{code}

 Add retry for timeline client getDelegationToken method
 ---

 Key: YARN-2709
 URL: https://issues.apache.org/jira/browse/YARN-2709
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch


 As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
 for secured clusters. This means if the timeline server is not available, a 
 timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2709) Add retry for timeline client getDelegationToken method


 [ 
https://issues.apache.org/jira/browse/YARN-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2709:
--
Issue Type: Sub-task  (was: Bug)
Parent: YARN-1530

 Add retry for timeline client getDelegationToken method
 ---

 Key: YARN-2709
 URL: https://issues.apache.org/jira/browse/YARN-2709
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Li Lu
Assignee: Li Lu
 Attachments: YARN-2709-102014-1.patch, YARN-2709-102014.patch


 As mentioned in YARN-2673, we need to add retry mechanism to timeline client 
 for secured clusters. This means if the timeline server is not available, a 
 timeline client needs to retry to get a delegation token. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

[
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177902#comment-14177902
]

Jian He commented on YARN-1972:
---

I merged this to branch-2.6

Implement secure Windows Container Executor
---

Key: YARN-1972
URL: https://issues.apache.org/jira/browse/YARN-1972
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Fix For: 2.6.0

Attachments: YARN-1972.1.patch, YARN-1972.2.patch, YARN-1972.3.patch,
YARN-1972.delta.4.patch, YARN-1972.delta.5-branch-2.patch,
YARN-1972.delta.5.patch, YARN-1972.trunk.4.patch, YARN-1972.trunk.5.patch

h1. Windows Secure Container Executor (WCE)
YARN-1063 adds the necessary infrasturcture to launch a process as a domain
user as a solution for the problem of having a security boundary between
processes executed in YARN containers and the Hadoop services. The WCE is a
container executor that leverages the winutils capabilities introduced in
YARN-1063 and launches containers as an OS process running as the job
submitter user. A description of the S4U infrastructure used by YARN-1063
alternatives considered can be read on that JIRA.
The WCE is based on the DefaultContainerExecutor. It relies on the DCE to
drive the flow of execution, but it overwrrides some emthods to the effect of:
* change the DCE created user cache directories to be owned by the job user
and by the nodemanager group.
* changes the actual container run command to use the 'createAsUser' command
of winutils task instead of 'create'
* runs the localization as standalone process instead of an in-process Java
method call. This in turn relies on the winutil createAsUser feature to run
the localization as the job user.

When compared to LinuxContainerExecutor (LCE), the WCE has some minor
differences:
* it does no delegate the creation of the user cache directories to the
native implementation.
* it does no require special handling to be able to delete user files
The approach on the WCE came from a practical trial-and-error approach. I had
to iron out some issues around the Windows script shell limitations (command
line length) to get it to work, the biggest issue being the huge CLASSPATH
that is commonplace in Hadoop environment container executions. The job
container itself is already dealing with this via a so called 'classpath
jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch
as a separate container the same issue had to be resolved and I used the same
'classpath jar' approach.
h2. Deployment Requirements
To use the WCE one needs to set the
`yarn.nodemanager.container-executor.class` to
`org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor`
and set the `yarn.nodemanager.windows-secure-container-executor.group` to a
Windows security group name that is the nodemanager service principal is a
member of (equivalent of LCE
`yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE
does not require any configuration outside of the Hadoop own's yar-site.xml.
For WCE to work the nodemanager must run as a service principal that is
member of the local Administrators group or LocalSystem. this is derived from
the need to invoke LoadUserProfile API which mention these requirements in
the specifications. This is in addition to the SE_TCB privilege mentioned in
YARN-1063, but this requirement will automatically imply that the SE_TCB
privilege is held by the nodemanager. For the Linux speakers in the audience,
the requirement is basically to run NM as root.
h2. Dedicated high privilege Service
Due to the high privilege required by the WCE we had discussed the need to
isolate the high privilege operations into a separate process, an 'executor'
service that is solely responsible to start the containers (incloding the
localizer). The NM would have to authenticate, authorize and communicate with
this service via an IPC mechanism and use this service to launch the
containers. I still believe we'll end up deploying such a service, but the
effort to onboard such a new platfrom specific new service on the project are
not trivial.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2717) containerLogNotFound log shows multiple time for the same container

Xuan Gong created YARN-2717:
---

 Summary: containerLogNotFound log shows multiple time for the same 
container
 Key: YARN-2717
 URL: https://issues.apache.org/jira/browse/YARN-2717
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Xuan Gong
Assignee: Xuan Gong


containerLogNotFound is called multiple times when the container log for the 
same container does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2717) containerLogNotFound log shows multiple time for the same container


 [ 
https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2717:

Attachment: YARN-2717.1.patch

trivial patch

 containerLogNotFound log shows multiple time for the same container
 ---

 Key: YARN-2717
 URL: https://issues.apache.org/jira/browse/YARN-2717
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2717.1.patch


 containerLogNotFound is called multiple times when the container log for the 
 same container does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over


[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177919#comment-14177919
 ] 

Hudson commented on YARN-1879:
--

FAILURE: Integrated in Hadoop-trunk-Commit #6298 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/6298/])
Missing file for YARN-1879 (jianhe: rev 
4a78a752286effbf1a0d8695325f9d7464a09fb4)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestApplicationMasterServiceProtocolOnHA.java


 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM 
 fail over
 

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Fix For: 2.6.0

 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, 
 YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, 
 YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, 
 YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, 
 YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.25.patch, 
 YARN-1879.26.patch, YARN-1879.27.patch, YARN-1879.28.patch, 
 YARN-1879.29.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, 
 YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container


[ 
https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177944#comment-14177944
 ] 

Hadoop QA commented on YARN-2717:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12676023/YARN-2717.1.patch
  against trunk revision 4a78a75.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5478//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5478//console

This message is automatically generated.

 containerLogNotFound log shows multiple time for the same container
 ---

 Key: YARN-2717
 URL: https://issues.apache.org/jira/browse/YARN-2717
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2717.1.patch


 containerLogNotFound is called multiple times when the container log for the 
 same container does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container


[ 
https://issues.apache.org/jira/browse/YARN-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177959#comment-14177959
 ] 

Zhijie Shen commented on YARN-2717:
---

+1, will commit the patch

 containerLogNotFound log shows multiple time for the same container
 ---

 Key: YARN-2717
 URL: https://issues.apache.org/jira/browse/YARN-2717
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: log-aggregation
Reporter: Xuan Gong
Assignee: Xuan Gong
 Attachments: YARN-2717.1.patch


 containerLogNotFound is called multiple times when the container log for the 
 same container does not exist



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-2707) Potential null dereference in FSDownload

2014-10-20 Thread Gera Shegalov (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov reassigned YARN-2707:
---

Assignee: Gera Shegalov

 Potential null dereference in FSDownload
 

 Key: YARN-2707
 URL: https://issues.apache.org/jira/browse/YARN-2707
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Gera Shegalov
Priority: Minor

 Here is related code in call():
 {code}
   Pattern pattern = null;
   String p = resource.getPattern();
   if (p != null) {
 pattern = Pattern.compile(p);
   }
   unpack(new File(dTmp.toUri()), new File(dFinal.toUri()), pattern);
 {code}
 In unpack():
 {code}
 RunJar.unJar(localrsrc, dst, pattern);
 {code}
 unJar() would dereference the pattern without checking whether it is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2161) Fix build on macosx: YARN parts

2014-10-20 Thread Binglin Chang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177969#comment-14177969
 ] 

Binglin Chang commented on YARN-2161:
-

Hi [~xgong], sorry for break the code. I see in YARN-2701 you already have fix 
code, but decide to revert the code in the end to be more safe, but this breaks 
the mac build, how  about use  #ifdef to use old code when compiling on glibc  
2.10(http://linux.die.net/man/2/openat) and use your fixing code otherwise?

 Fix build on macosx: YARN parts
 ---

 Key: YARN-2161
 URL: https://issues.apache.org/jira/browse/YARN-2161
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Binglin Chang
Assignee: Binglin Chang
 Fix For: 2.6.0

 Attachments: YARN-2161.v1.patch, YARN-2161.v2.patch


 When compiling on macosx with -Pnative, there are several warning and errors, 
 fix this would help hadoop developers with macosx env. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2717) containerLogNotFound log shows multiple time for the same container