[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-06-24 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692695#comment-13692695
 ] 

Karthik Kambatla commented on YARN-884:
---

The test TestAMAuthorization fails on trunk as well. Don't think the patch can 
affect the test in any way.

> AM expiry interval should be set to smaller of {am, 
> nm}.liveness-monitor.expiry-interval-ms
> ---
>
> Key: YARN-884
> URL: https://issues.apache.org/jira/browse/YARN-884
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: configuration
> Attachments: yarn-884-1.patch
>
>
> As the AM can't outlive the NM on which it is running, it is a good idea to 
> disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
> than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692689#comment-13692689
 ] 

Hadoop QA commented on YARN-884:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589529/yarn-884-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1395//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1395//console

This message is automatically generated.

> AM expiry interval should be set to smaller of {am, 
> nm}.liveness-monitor.expiry-interval-ms
> ---
>
> Key: YARN-884
> URL: https://issues.apache.org/jira/browse/YARN-884
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: configuration
> Attachments: yarn-884-1.patch
>
>
> As the AM can't outlive the NM on which it is running, it is a good idea to 
> disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
> than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-885) TestBinaryTokenFile (and others) fail

2013-06-24 Thread Kam Kasravi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692687#comment-13692687
 ] 

Kam Kasravi commented on YARN-885:
--

Changing ContainerLocalizer.runLocalization to where the local context uses the 
same tokens as the user context seems to fix this problem. 

> TestBinaryTokenFile (and others) fail
> -
>
> Key: YARN-885
> URL: https://issues.apache.org/jira/browse/YARN-885
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.4-alpha
>Reporter: Kam Kasravi
>
> Seeing the following stack trace and the unit test goes into a infinite loop:
> 2013-06-24 17:03:58,316 ERROR [LocalizerRunner for 
> container_1372118631537_0001_01_01] security.UserGroupInformation 
> (UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
> as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: Server asks us to fall 
> back to SIMPLE auth, but this client is configured to only allow secure 
> connections.
> 2013-06-24 17:03:58,317 WARN  [LocalizerRunner for 
> container_1372118631537_0001_01_01] ipc.Client (Client.java:run(579)) - 
> Exception encountered while connecting to the server : java.io.IOException: 
> Server asks us to fall back to SIMPLE auth, but this client is configured to 
> only allow secure connections.
> 2013-06-24 17:03:58,318 ERROR [LocalizerRunner for 
> container_1372118631537_0001_01_01] security.UserGroupInformation 
> (UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
> as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: java.io.IOException: 
> Server asks us to fall back to SIMPLE auth, but this client is configured to 
> only allow secure connections.
> java.lang.reflect.UndeclaredThrowableException
> at 
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
> at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:56)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:247)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:103)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:859)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692675#comment-13692675
 ] 

Hadoop QA commented on YARN-874:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589527/YARN-874.2.txt
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1394//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1394//console

This message is automatically generated.

> Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
> -
>
> Key: YARN-874
> URL: https://issues.apache.org/jira/browse/YARN-874
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>Priority: Blocker
> Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt
>
>
> HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-06-24 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692668#comment-13692668
 ] 

Omkar Vinit Joshi commented on YARN-884:


[~kkambatl] makes sense...


> AM expiry interval should be set to smaller of {am, 
> nm}.liveness-monitor.expiry-interval-ms
> ---
>
> Key: YARN-884
> URL: https://issues.apache.org/jira/browse/YARN-884
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: configuration
> Attachments: yarn-884-1.patch
>
>
> As the AM can't outlive the NM on which it is running, it is a good idea to 
> disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
> than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM

2013-06-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692666#comment-13692666
 ] 

Sandy Ryza commented on YARN-763:
-

Can we move all of this into the switch statement, replace break with return, 
and get rid of the stop variable?  Unless the thinking is that returning from a 
method in the middle is bad, I think this would be a lot cleaner.

> AMRMClientAsync should stop heartbeating after receiving shutdown from RM
> -
>
> Key: YARN-763
> URL: https://issues.apache.org/jira/browse/YARN-763
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-763.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827

2013-06-24 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692664#comment-13692664
 ] 

Omkar Vinit Joshi commented on YARN-874:


tested YARN-872-2...on local cluster... with patch it is running now. 

> Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
> -
>
> Key: YARN-874
> URL: https://issues.apache.org/jira/browse/YARN-874
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>Priority: Blocker
> Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt
>
>
> HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-06-24 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-884:
--

Attachment: yarn-884-1.patch

Uploading a straight-forward patch.

> AM expiry interval should be set to smaller of {am, 
> nm}.liveness-monitor.expiry-interval-ms
> ---
>
> Key: YARN-884
> URL: https://issues.apache.org/jira/browse/YARN-884
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: configuration
> Attachments: yarn-884-1.patch
>
>
> As the AM can't outlive the NM on which it is running, it is a good idea to 
> disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
> than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692641#comment-13692641
 ] 

Hadoop QA commented on YARN-763:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589525/YARN-763.1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1152 javac 
compiler warnings (more than the trunk's current 1151 warnings).

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client:

  org.apache.hadoop.yarn.client.api.impl.TestNMClient

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1393//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1393//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1393//console

This message is automatically generated.

> AMRMClientAsync should stop heartbeating after receiving shutdown from RM
> -
>
> Key: YARN-763
> URL: https://issues.apache.org/jira/browse/YARN-763
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-763.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-874) Tracking YARN/MR test failures after HADOOP-9421 and YARN-827

2013-06-24 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-874:
-

Attachment: YARN-874.2.txt

Updated patch with a new testing validating the common changes.

> Tracking YARN/MR test failures after HADOOP-9421 and YARN-827
> -
>
> Key: YARN-874
> URL: https://issues.apache.org/jira/browse/YARN-874
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Vinod Kumar Vavilapalli
>Priority: Blocker
> Attachments: YARN-874.1.txt, YARN-874.2.txt, YARN-874.txt
>
>
> HADOOP-9421 and YARN-827 broke some YARN/MR tests. Tracking those..

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (YARN-758) Fair scheduler has some bug that causes TestRMRestart to fail

2013-06-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-758.
-

Resolution: Not A Problem

> Fair scheduler has some bug that causes TestRMRestart to fail
> -
>
> Key: YARN-758
> URL: https://issues.apache.org/jira/browse/YARN-758
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Sandy Ryza
>
> YARN-757 got fixed by changing the scheduler from Fair to default (which is 
> capacity).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-06-24 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692639#comment-13692639
 ] 

Karthik Kambatla commented on YARN-884:
---

If AM_EXPIRY < NM_EXPIRY,
# the user has explicitly set AM_EXPIRY to be smaller than NM_EXPIRY
# I agree it is possible that the RM might expire the first attempt and start 
another attempt, in case the NM fails to connect to the RM for a time 't' such 
that AM_EXPIRY < t < NM_EXPIRY. However, the user has asked for a shorter 
expiry interval for a reason.

If AM_EXPIRY > NM_EXPIRY,
# When NM dies, the AMs on it also would have died. However, IIUC, the RM 
wouldn't schedule another attempt until AM_EXPIRY is met. Correct me if I am 
wrong.


> AM expiry interval should be set to smaller of {am, 
> nm}.liveness-monitor.expiry-interval-ms
> ---
>
> Key: YARN-884
> URL: https://issues.apache.org/jira/browse/YARN-884
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: configuration
>
> As the AM can't outlive the NM on which it is running, it is a good idea to 
> disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
> than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-808) ApplicationReport does not clearly tell that the attempt is running or not

2013-06-24 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692632#comment-13692632
 ] 

Xuan Gong commented on YARN-808:


How about we expose the current attempt Id with attempt status as well as 
previous attempt Id with attempt status if they are exist ??

> ApplicationReport does not clearly tell that the attempt is running or not
> --
>
> Key: YARN-808
> URL: https://issues.apache.org/jira/browse/YARN-808
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Xuan Gong
>
> When an app attempt fails and is being retried, ApplicationReport immediately 
> gives the new attemptId and non-null values of host etc. There is no way for 
> clients to know that the attempt is running other than connecting to it and 
> timing out on invalid host. Solution would be to expose the attempt state or 
> return a null value for host instead of "N/A"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM

2013-06-24 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-763:
---

Attachment: YARN-763.1.patch

> AMRMClientAsync should stop heartbeating after receiving shutdown from RM
> -
>
> Key: YARN-763
> URL: https://issues.apache.org/jira/browse/YARN-763
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bikas Saha
>Assignee: Xuan Gong
> Attachments: YARN-763.1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-873) YARNClient.getApplicationReport(unknownAppId) returns a null report

2013-06-24 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692601#comment-13692601
 ] 

Xuan Gong commented on YARN-873:


At commandLine, if we type yarn application -status $UnKnowAppId, it will 
output Application with id $UnKnowAppId doesn't exist in RM.

> YARNClient.getApplicationReport(unknownAppId) returns a null report
> ---
>
> Key: YARN-873
> URL: https://issues.apache.org/jira/browse/YARN-873
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.1.0-beta
>Reporter: Bikas Saha
>Assignee: Xuan Gong
>
> How can the client find out that app does not exist?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-885) TestBinaryTokenFile (and others) fail

2013-06-24 Thread Kam Kasravi (JIRA)
Kam Kasravi created YARN-885:


 Summary: TestBinaryTokenFile (and others) fail
 Key: YARN-885
 URL: https://issues.apache.org/jira/browse/YARN-885
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.4-alpha
Reporter: Kam Kasravi


Seeing the following stack trace and the unit test goes into a infinite loop:

2013-06-24 17:03:58,316 ERROR [LocalizerRunner for 
container_1372118631537_0001_01_01] security.UserGroupInformation 
(UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: Server asks us to fall 
back to SIMPLE auth, but this client is configured to only allow secure 
connections.
2013-06-24 17:03:58,317 WARN  [LocalizerRunner for 
container_1372118631537_0001_01_01] ipc.Client (Client.java:run(579)) - 
Exception encountered while connecting to the server : java.io.IOException: 
Server asks us to fall back to SIMPLE auth, but this client is configured to 
only allow secure connections.
2013-06-24 17:03:58,318 ERROR [LocalizerRunner for 
container_1372118631537_0001_01_01] security.UserGroupInformation 
(UserGroupInformation.java:doAs(1480)) - PriviledgedActionException 
as:kamkasravi (auth:SIMPLE) cause:java.io.IOException: java.io.IOException: 
Server asks us to fall back to SIMPLE auth, but this client is configured to 
only allow secure connections.
java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.unwrapAndThrowException(YarnRemoteExceptionPBImpl.java:135)
at 
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:56)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:247)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:181)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:103)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:859)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

2013-06-24 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692572#comment-13692572
 ] 

Chris Douglas commented on YARN-569:


{{TestAMAuthorization}} also fails on trunk, YARN-878

> CapacityScheduler: support for preemption (using a capacity monitor)
> 
>
> Key: YARN-569
> URL: https://issues.apache.org/jira/browse/YARN-569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, 
> preemption.2.patch, YARN-569.10.patch, YARN-569.1.patch, YARN-569.2.patch, 
> YARN-569.3.patch, YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch, 
> YARN-569.8.patch, YARN-569.9.patch, YARN-569.patch, YARN-569.patch
>
>
> There is a tension between the fast-pace reactive role of the 
> CapacityScheduler, which needs to respond quickly to 
> applications resource requests, and node updates, and the more introspective, 
> time-based considerations 
> needed to observe and correct for capacity balance. To this purpose we opted 
> instead of hacking the delicate
> mechanisms of the CapacityScheduler directly to add support for preemption by 
> means of a "Capacity Monitor",
> which can be run optionally as a separate service (much like the 
> NMLivelinessMonitor).
> The capacity monitor (similarly to equivalent functionalities in the fairness 
> scheduler) operates running on intervals 
> (e.g., every 3 seconds), observe the state of the assignment of resources to 
> queues from the capacity scheduler, 
> performs off-line computation to determine if preemption is needed, and how 
> best to "edit" the current schedule to 
> improve capacity, and generates events that produce four possible actions:
> # Container de-reservations
> # Resource-based preemptions
> # Container-based preemptions
> # Container killing
> The actions listed above are progressively more costly, and it is up to the 
> policy to use them as desired to achieve the rebalancing goals. 
> Note that due to the "lag" in the effect of these actions the policy should 
> operate at the macroscopic level (e.g., preempt tens of containers
> from a queue) and not trying to tightly and consistently micromanage 
> container allocations. 
> - Preemption policy  (ProportionalCapacityPreemptionPolicy): 
> - 
> Preemption policies are by design pluggable, in the following we present an 
> initial policy (ProportionalCapacityPreemptionPolicy) we have been 
> experimenting with.  The ProportionalCapacityPreemptionPolicy behaves as 
> follows:
> # it gathers from the scheduler the state of the queues, in particular, their 
> current capacity, guaranteed capacity and pending requests (*)
> # if there are pending requests from queues that are under capacity it 
> computes a new ideal balanced state (**)
> # it computes the set of preemptions needed to repair the current schedule 
> and achieve capacity balance (accounting for natural completion rates, and 
> respecting bounds on the amount of preemption we allow for each round)
> # it selects which applications to preempt from each over-capacity queue (the 
> last one in the FIFO order)
> # it remove reservations from the most recently assigned app until the amount 
> of resource to reclaim is obtained, or until no more reservations exits
> # (if not enough) it issues preemptions for containers from the same 
> applications (reverse chronological order, last assigned container first) 
> again until necessary or until no containers except the AM container are left,
> # (if not enough) it moves onto unreserve and preempt from the next 
> application. 
> # containers that have been asked to preempt are tracked across executions. 
> If a containers is among the one to be preempted for more than a certain 
> time, the container is moved in a the list of containers to be forcibly 
> killed. 
> Notes:
> (*) at the moment, in order to avoid double-counting of the requests, we only 
> look at the "ANY" part of pending resource requests, which means we might not 
> preempt on behalf of AMs that ask only for specific locations but not any. 
> (**) The ideal balance state is one in which each queue has at least its 
> guaranteed capacity, and the spare capacity is distributed among queues (that 
> wants some) as a weighted fair share. Where the weighting is based on the 
> guaranteed capacity of a queue, and the function runs to a fix point.  
> Tunables of the ProportionalCapacityPreemptionPolicy:
> # observe-only mode (i.e., log the actions it would take, but behave as 
> read-only)
> # how frequently to run the policy
> # how long to wait between preemption and kill of a container
> # wh

[jira] [Commented] (YARN-883) Expose Fair Scheduler-specific queue metrics

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692581#comment-13692581
 ] 

Hadoop QA commented on YARN-883:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589510/YARN-883-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1392//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1392//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1392//console

This message is automatically generated.

> Expose Fair Scheduler-specific queue metrics
> 
>
> Key: YARN-883
> URL: https://issues.apache.org/jira/browse/YARN-883
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-883-1.patch, YARN-883.patch
>
>
> When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
> minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692542#comment-13692542
 ] 

Hadoop QA commented on YARN-569:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589506/YARN-569.10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1391//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1391//console

This message is automatically generated.

> CapacityScheduler: support for preemption (using a capacity monitor)
> 
>
> Key: YARN-569
> URL: https://issues.apache.org/jira/browse/YARN-569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, 
> preemption.2.patch, YARN-569.10.patch, YARN-569.1.patch, YARN-569.2.patch, 
> YARN-569.3.patch, YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch, 
> YARN-569.8.patch, YARN-569.9.patch, YARN-569.patch, YARN-569.patch
>
>
> There is a tension between the fast-pace reactive role of the 
> CapacityScheduler, which needs to respond quickly to 
> applications resource requests, and node updates, and the more introspective, 
> time-based considerations 
> needed to observe and correct for capacity balance. To this purpose we opted 
> instead of hacking the delicate
> mechanisms of the CapacityScheduler directly to add support for preemption by 
> means of a "Capacity Monitor",
> which can be run optionally as a separate service (much like the 
> NMLivelinessMonitor).
> The capacity monitor (similarly to equivalent functionalities in the fairness 
> scheduler) operates running on intervals 
> (e.g., every 3 seconds), observe the state of the assignment of resources to 
> queues from the capacity scheduler, 
> performs off-line computation to determine if preemption is needed, and how 
> best to "edit" the current schedule to 
> improve capacity, and generates events that produce four possible actions:
> # Container de-reservations
> # Resource-based preemptions
> # Container-based preemptions
> # Container killing
> The actions listed above are progressively more costly, and it is up to the 
> policy to use them as desired to achieve the rebalancing goals. 
> Note that due to the "lag" in the effect of these actions the policy should 
> operate at the macroscopic level (e.g., preempt tens of containers
> from a queue) and not trying to tightly and consistently micromanage 
> container allocations. 
> - Preemption policy  (ProportionalCapacityPreemptionPolicy): 
> - 
> Preemption policies are by design pluggable, in the following we present an 
> initial policy (ProportionalCapacityPreemptionPolicy) we have been 
> experimenting with.  The ProportionalCapacityPreemptionPolicy behaves as 
> follows:
> # it gathers from the scheduler the state of the queues, in particular, their 
> current capacity, guaranteed capacity and pending requests (*)
> # if there are pending requests from queues that are under capacity it 
> computes a new ideal balanced state (**)
> # it computes the set of preemptions needed to repair the current schedule 
> and achieve capacity balance (accounting for natural completion rates, and 
> respecting bounds on the amount of preemption we allow for each round)
> # it selects which applications to preempt from each over-capacity queue (the 
> last one in the FIFO order)
> # it remove reservations from 

[jira] [Updated] (YARN-883) Expose Fair Scheduler-specific queue metrics

2013-06-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-883:


Attachment: YARN-883-1.patch

> Expose Fair Scheduler-specific queue metrics
> 
>
> Key: YARN-883
> URL: https://issues.apache.org/jira/browse/YARN-883
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-883-1.patch, YARN-883.patch
>
>
> When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
> minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-649) Make container logs available over HTTP in plain text

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692526#comment-13692526
 ] 

Hadoop QA commented on YARN-649:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589505/YARN-649-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:

  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerResync
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.application.TestApplication
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.TestContainersMonitor
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService
  
org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerShutdown
  
org.apache.hadoop.yarn.server.nodemanager.containermanager.TestContainerManager

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1390//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1390//console

This message is automatically generated.

> Make container logs available over HTTP in plain text
> -
>
> Key: YARN-649
> URL: https://issues.apache.org/jira/browse/YARN-649
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.4-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649.patch, 
> YARN-752-1.patch
>
>
> It would be good to make container logs available over the REST API for 
> MAPREDUCE-4362 and so that they can be accessed programatically in general.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-06-24 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692525#comment-13692525
 ] 

Omkar Vinit Joshi commented on YARN-884:


Probably these two are unrelated. First if NM goes down then obviously if AM is 
running on it has gone down but vis-a-versa is not true. In work preserving 
environment we would like to restart/resume the AM which will not be possible 
if we configure liveness interval of am = smallest of {am,nm}.. For example nm 
might be facing problems to connect to RM and may just end up heart beating 
with RM just before RM took the decision about starting new application 
attempt, marking earlier as failed... even if AM heartbeats immediately after 
that it would be waste... right??

I think we need am = larget of {am,nm}

thoughts?

> AM expiry interval should be set to smaller of {am, 
> nm}.liveness-monitor.expiry-interval-ms
> ---
>
> Key: YARN-884
> URL: https://issues.apache.org/jira/browse/YARN-884
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.0.4-alpha
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>  Labels: configuration
>
> As the AM can't outlive the NM on which it is running, it is a good idea to 
> disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
> than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)

2013-06-24 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated YARN-569:
---

Attachment: YARN-569.10.patch

> CapacityScheduler: support for preemption (using a capacity monitor)
> 
>
> Key: YARN-569
> URL: https://issues.apache.org/jira/browse/YARN-569
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, 
> preemption.2.patch, YARN-569.10.patch, YARN-569.1.patch, YARN-569.2.patch, 
> YARN-569.3.patch, YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch, 
> YARN-569.8.patch, YARN-569.9.patch, YARN-569.patch, YARN-569.patch
>
>
> There is a tension between the fast-pace reactive role of the 
> CapacityScheduler, which needs to respond quickly to 
> applications resource requests, and node updates, and the more introspective, 
> time-based considerations 
> needed to observe and correct for capacity balance. To this purpose we opted 
> instead of hacking the delicate
> mechanisms of the CapacityScheduler directly to add support for preemption by 
> means of a "Capacity Monitor",
> which can be run optionally as a separate service (much like the 
> NMLivelinessMonitor).
> The capacity monitor (similarly to equivalent functionalities in the fairness 
> scheduler) operates running on intervals 
> (e.g., every 3 seconds), observe the state of the assignment of resources to 
> queues from the capacity scheduler, 
> performs off-line computation to determine if preemption is needed, and how 
> best to "edit" the current schedule to 
> improve capacity, and generates events that produce four possible actions:
> # Container de-reservations
> # Resource-based preemptions
> # Container-based preemptions
> # Container killing
> The actions listed above are progressively more costly, and it is up to the 
> policy to use them as desired to achieve the rebalancing goals. 
> Note that due to the "lag" in the effect of these actions the policy should 
> operate at the macroscopic level (e.g., preempt tens of containers
> from a queue) and not trying to tightly and consistently micromanage 
> container allocations. 
> - Preemption policy  (ProportionalCapacityPreemptionPolicy): 
> - 
> Preemption policies are by design pluggable, in the following we present an 
> initial policy (ProportionalCapacityPreemptionPolicy) we have been 
> experimenting with.  The ProportionalCapacityPreemptionPolicy behaves as 
> follows:
> # it gathers from the scheduler the state of the queues, in particular, their 
> current capacity, guaranteed capacity and pending requests (*)
> # if there are pending requests from queues that are under capacity it 
> computes a new ideal balanced state (**)
> # it computes the set of preemptions needed to repair the current schedule 
> and achieve capacity balance (accounting for natural completion rates, and 
> respecting bounds on the amount of preemption we allow for each round)
> # it selects which applications to preempt from each over-capacity queue (the 
> last one in the FIFO order)
> # it remove reservations from the most recently assigned app until the amount 
> of resource to reclaim is obtained, or until no more reservations exits
> # (if not enough) it issues preemptions for containers from the same 
> applications (reverse chronological order, last assigned container first) 
> again until necessary or until no containers except the AM container are left,
> # (if not enough) it moves onto unreserve and preempt from the next 
> application. 
> # containers that have been asked to preempt are tracked across executions. 
> If a containers is among the one to be preempted for more than a certain 
> time, the container is moved in a the list of containers to be forcibly 
> killed. 
> Notes:
> (*) at the moment, in order to avoid double-counting of the requests, we only 
> look at the "ANY" part of pending resource requests, which means we might not 
> preempt on behalf of AMs that ask only for specific locations but not any. 
> (**) The ideal balance state is one in which each queue has at least its 
> guaranteed capacity, and the spare capacity is distributed among queues (that 
> wants some) as a weighted fair share. Where the weighting is based on the 
> guaranteed capacity of a queue, and the function runs to a fix point.  
> Tunables of the ProportionalCapacityPreemptionPolicy:
> # observe-only mode (i.e., log the actions it would take, but behave as 
> read-only)
> # how frequently to run the policy
> # how long to wait between preemption and kill of a container
> # which fraction of the containers I would like to obtain should I preempt 
> (has to do with

[jira] [Updated] (YARN-649) Make container logs available over HTTP in plain text

2013-06-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-649:


Attachment: YARN-649-3.patch

> Make container logs available over HTTP in plain text
> -
>
> Key: YARN-649
> URL: https://issues.apache.org/jira/browse/YARN-649
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.4-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649.patch, 
> YARN-752-1.patch
>
>
> It would be good to make container logs available over the REST API for 
> MAPREDUCE-4362 and so that they can be accessed programatically in general.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-649) Make container logs available over HTTP in plain text

2013-06-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692501#comment-13692501
 ] 

Sandy Ryza commented on YARN-649:
-

Uploading a patch that takes Vinod's comments into account. It
* Fixes the SecureIOUtils hole (doh!)
* Makes separate ContainerLogsUtils#getContainerLogFile and getContainerLogDirs
* Throws appropriate error codes instead of just returning a string
* Uses StreamingOutput to avoid unbounded buffering
* Marks the API as evolving

I still need to add documentation.

Regarding logs for old jobs, is there a reason that the implementation choice 
would change the API?

> Make container logs available over HTTP in plain text
> -
>
> Key: YARN-649
> URL: https://issues.apache.org/jira/browse/YARN-649
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.0.4-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-649-2.patch, YARN-649-3.patch, YARN-649.patch, 
> YARN-752-1.patch
>
>
> It would be good to make container logs available over the REST API for 
> MAPREDUCE-4362 and so that they can be accessed programatically in general.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692476#comment-13692476
 ] 

Chris Riccomini commented on YARN-864:
--

Hey Jian,

I re-deployed my test cluster with YARN-600, YARN-799, and your latest patch 
(.2.patch) from YARN-864. I simulated the timeout using kill -STOP (as 
described above), and your patch worked! :)

I'm going to let the cluster run for 24h before declaring victory, but this 
looks promising. I'll follow up tomorrow, when I know more.

Cheers,
Chris

> YARN NM leaking containers with CGroups
> ---
>
> Key: YARN-864
> URL: https://issues.apache.org/jira/browse/YARN-864
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
> Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
> YARN-600.
>Reporter: Chris Riccomini
> Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
> seeing containers getting leaked by the NMs. I'm not quite sure what's going 
> on -- has anyone seen this before? I'm concerned that maybe it's a 
> mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
> This means that container container_1371141151815_0008_03_02 was killed 
> by YARN, either due to being released by the application master or being 
> 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
> container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
> new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1143)
> at java.lang.Thread.join(Thread.java:1196)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,314  WARN ContainersMonitorImpl:463 - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  is interrupted. Exiting.
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:12

[jira] [Commented] (YARN-883) Expose Fair Scheduler-specific queue metrics

2013-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692475#comment-13692475
 ] 

Hadoop QA commented on YARN-883:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12589493/YARN-883.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSLeafQueue
  
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1389//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1389//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1389//console

This message is automatically generated.

> Expose Fair Scheduler-specific queue metrics
> 
>
> Key: YARN-883
> URL: https://issues.apache.org/jira/browse/YARN-883
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-883.patch
>
>
> When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
> minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-884) AM expiry interval should be set to smaller of {am, nm}.liveness-monitor.expiry-interval-ms

2013-06-24 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-884:
-

 Summary: AM expiry interval should be set to smaller of {am, 
nm}.liveness-monitor.expiry-interval-ms
 Key: YARN-884
 URL: https://issues.apache.org/jira/browse/YARN-884
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 2.0.4-alpha
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


As the AM can't outlive the NM on which it is running, it is a good idea to 
disallow setting the am.liveness-monitor.expiry-interval-ms to a value higher 
than nm.liveness-monitor.expiry-interval-ms

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-883) Expose Fair Scheduler-specific queue metrics

2013-06-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692436#comment-13692436
 ] 

Sandy Ryza commented on YARN-883:
-

Submitted patch that adds an FSQueueMetrics, which extends QueueMetrics.  
Verified that the metrics show up on a pseudo-distributed cluster.

> Expose Fair Scheduler-specific queue metrics
> 
>
> Key: YARN-883
> URL: https://issues.apache.org/jira/browse/YARN-883
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-883.patch
>
>
> When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
> minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-883) Expose Fair Scheduler-specific queue metrics

2013-06-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated YARN-883:


Attachment: YARN-883.patch

> Expose Fair Scheduler-specific queue metrics
> 
>
> Key: YARN-883
> URL: https://issues.apache.org/jira/browse/YARN-883
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-883.patch
>
>
> When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
> minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Moved] (YARN-883) Expose Fair Scheduler-specific queue metrics

2013-06-24 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza moved MAPREDUCE-5350 to YARN-883:


  Component/s: (was: scheduler)
   scheduler
Affects Version/s: (was: 2.0.5-alpha)
   2.0.5-alpha
  Key: YARN-883  (was: MAPREDUCE-5350)
  Project: Hadoop YARN  (was: Hadoop Map/Reduce)

> Expose Fair Scheduler-specific queue metrics
> 
>
> Key: YARN-883
> URL: https://issues.apache.org/jira/browse/YARN-883
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.5-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-883.patch
>
>
> When the Fair Scheduler is enabled, QueueMetrics should include fair share, 
> minimum share, and maximum share.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692388#comment-13692388
 ] 

Jian He commented on YARN-864:
--

Hi Chris 
that failure was due to reboot starts even before stop fully completes.
Uploaded a new patch, tested locally. let me know if that works, thx

> YARN NM leaking containers with CGroups
> ---
>
> Key: YARN-864
> URL: https://issues.apache.org/jira/browse/YARN-864
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
> Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
> YARN-600.
>Reporter: Chris Riccomini
> Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
> seeing containers getting leaked by the NMs. I'm not quite sure what's going 
> on -- has anyone seen this before? I'm concerned that maybe it's a 
> mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
> This means that container container_1371141151815_0008_03_02 was killed 
> by YARN, either due to being released by the application master or being 
> 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
> container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
> new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1143)
> at java.lang.Thread.join(Thread.java:1196)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,314  WARN ContainersMonitorImpl:463 - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  is interrupted. Exiting.
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)

[jira] [Updated] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-864:
-

Attachment: YARN-864.2.patch

> YARN NM leaking containers with CGroups
> ---
>
> Key: YARN-864
> URL: https://issues.apache.org/jira/browse/YARN-864
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
> Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
> YARN-600.
>Reporter: Chris Riccomini
> Attachments: rm-log, YARN-864.1.patch, YARN-864.2.patch
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
> seeing containers getting leaked by the NMs. I'm not quite sure what's going 
> on -- has anyone seen this before? I'm concerned that maybe it's a 
> mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
> This means that container container_1371141151815_0008_03_02 was killed 
> by YARN, either due to being released by the application master or being 
> 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
> container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
> new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1143)
> at java.lang.Thread.join(Thread.java:1196)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,314  WARN ContainersMonitorImpl:463 - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  is interrupted. Exiting.
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanage

[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692296#comment-13692296
 ] 

Chris Riccomini commented on YARN-864:
--

Hey Jian,

With your patch applied, the new error (in the NM) is:

{noformat}
19:33:36,741  INFO NodeStatusUpdaterImpl:365 - Node is out of sync with 
ResourceManager, hence rebooting.
19:33:36,764  INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 
14751 for container-id container_1372091455469_0002_01_02: 779.3 MB of 1.3 
GB physical memory used; 1.6 GB of 10 GB virtual memory used
19:33:37,239  INFO NodeManager:315 - Rebooting the node manager.
19:33:37,261  INFO NodeManager:229 - Containers still running on shutdown: 
[container_1372091455469_0002_01_02]
19:33:37,278 FATAL AsyncDispatcher:137 - Error in dispatcher thread
org.apache.hadoop.metrics2.MetricsException: Metrics source JvmMetrics already 
exists!
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:126)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:107)
at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
at 
org.apache.hadoop.metrics2.source.JvmMetrics.create(JvmMetrics.java:79)
at 
org.apache.hadoop.yarn.server.nodemanager.metrics.NodeManagerMetrics.create(NodeManagerMetrics.java:49)
at 
org.apache.hadoop.yarn.server.nodemanager.metrics.NodeManagerMetrics.create(NodeManagerMetrics.java:45)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.(NodeManager.java:75)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.createNewNodeManager(NodeManager.java:357)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.reboot(NodeManager.java:316)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:348)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
at java.lang.Thread.run(Thread.java:619)
{noformat}

For the record, you can reproduce this yourself by:

1. Start a YARN RM and NM.
2. Run a YARN job on the cluster that uses at least one container.
3. Run kill -STOP  on the NM.
4. Wait 65 seconds (enough for the NM to time out).
5. Run kill -CONT 

You will see the NM trigger a reboot since it's out of sync with the RM.

> YARN NM leaking containers with CGroups
> ---
>
> Key: YARN-864
> URL: https://issues.apache.org/jira/browse/YARN-864
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
> Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
> YARN-600.
>Reporter: Chris Riccomini
> Attachments: rm-log, YARN-864.1.patch
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
> seeing containers getting leaked by the NMs. I'm not quite sure what's going 
> on -- has anyone seen this before? I'm concerned that maybe it's a 
> mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
> This means that container container_1371141151815_0008_03_02 was killed 
> by YARN, either due to being released by the application master or being 
> 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
> container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
> new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1143)
> at java.lang.Thread.join(Thread.java:1196)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
> at 
> org.apache.hadoop.yarn.event.AsyncDis

[jira] [Created] (YARN-882) Specify per user quota for private/application cache and user log files

2013-06-24 Thread Omkar Vinit Joshi (JIRA)
Omkar Vinit Joshi created YARN-882:
--

 Summary: Specify per user quota for private/application cache and 
user log files
 Key: YARN-882
 URL: https://issues.apache.org/jira/browse/YARN-882
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Omkar Vinit Joshi
Assignee: Omkar Vinit Joshi


At present there is no limit on the number of files / size of the files 
localized by single user. Similarly there is no limit on the size of the log 
files created by user via running containers.
We need to restrict the user for this. For LocalizedResources; this has serious 
concerns in case of secured environment where malicious user can start one 
container and localize resources whose total size >= 
DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB. Thereafter it will either fail (if 
no extra space is present on disk) or deletion service will keep removing 
localized files for other containers/applications. 
The limit for logs/localized resource should be decided by RM and sent to NM 
via secured containerToken. All these configurations should per container 
instead of per user or per nm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (YARN-882) Specify per user quota for private/application cache and user log files

2013-06-24 Thread Omkar Vinit Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Vinit Joshi updated YARN-882:
---

Description: 
At present there is no limit on the number of files / size of the files 
localized by single user. Similarly there is no limit on the size of the log 
files created by user via running containers.

We need to restrict the user for this.
For LocalizedResources; this has serious concerns in case of secured 
environment where malicious user can start one container and localize resources 
whose total size >= DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB. Thereafter it 
will either fail (if no extra space is present on disk) or deletion service 
will keep removing localized files for other containers/applications. 
The limit for logs/localized resources should be decided by RM and sent to NM 
via secured containerToken. All these configurations should per container 
instead of per user or per nm.

  was:
At present there is no limit on the number of files / size of the files 
localized by single user. Similarly there is no limit on the size of the log 
files created by user via running containers.
We need to restrict the user for this. For LocalizedResources; this has serious 
concerns in case of secured environment where malicious user can start one 
container and localize resources whose total size >= 
DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB. Thereafter it will either fail (if 
no extra space is present on disk) or deletion service will keep removing 
localized files for other containers/applications. 
The limit for logs/localized resource should be decided by RM and sent to NM 
via secured containerToken. All these configurations should per container 
instead of per user or per nm.


> Specify per user quota for private/application cache and user log files
> ---
>
> Key: YARN-882
> URL: https://issues.apache.org/jira/browse/YARN-882
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Omkar Vinit Joshi
>Assignee: Omkar Vinit Joshi
>
> At present there is no limit on the number of files / size of the files 
> localized by single user. Similarly there is no limit on the size of the log 
> files created by user via running containers.
> We need to restrict the user for this.
> For LocalizedResources; this has serious concerns in case of secured 
> environment where malicious user can start one container and localize 
> resources whose total size >= DEFAULT_NM_LOCALIZER_CACHE_TARGET_SIZE_MB. 
> Thereafter it will either fail (if no extra space is present on disk) or 
> deletion service will keep removing localized files for other 
> containers/applications. 
> The limit for logs/localized resources should be decided by RM and sent to NM 
> via secured containerToken. All these configurations should per container 
> instead of per user or per nm.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-339) TestResourceTrackerService is failing intermittently

2013-06-24 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692270#comment-13692270
 ] 

Ravi Prakash commented on YARN-339:
---

Hi Vinod! Nopes! I can't reproduce this anymore. Closing as fixed. Please 
re-open if you think the patch should still go in. Thanks Jianhe and Vinod!

> TestResourceTrackerService is failing intermittently
> 
>
> Key: YARN-339
> URL: https://issues.apache.org/jira/browse/YARN-339
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0, 0.23.5
>Reporter: Ravi Prakash
>Assignee: Jian He
> Attachments: YARN-339.patch
>
>
> The test after testReconnectNode() is failing usually. This might be a race 
> condition in Metrics2 code. 
> Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 3.127 sec <<< 
> FAILURE!
> testDecommissionWithIncludeHosts(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService)
>   Time elapsed: 55 sec  <<< ERROR!
> org.apache.hadoop.metrics2.MetricsException: Metrics source ClusterMetrics 
> already exists!
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:134)
>   at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:115)
>   at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:217)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClusterMetrics.registerMetrics(ClusterMetrics.java:71)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClusterMetrics.getMetrics(ClusterMetrics.java:58)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testDecommissionWithIncludeHosts(TestResourceTrackerService.java:74)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-736) Add a multi-resource fair sharing metric

2013-06-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692241#comment-13692241
 ] 

Hudson commented on YARN-736:
-

Integrated in Hadoop-trunk-Commit #4005 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/4005/])
YARN-736. Add a multi-resource fair sharing metric. (sandyr via tucu) 
(Revision 1496153)

 Result = SUCCESS
tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1496153
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/AppSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/Schedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/ComputeFairShares.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/DominantResourceFairnessPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FairSharePolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/FifoPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FakeSchedulable.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestComputeFairShares.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/policies/TestDominantResourceFairnessPolicy.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/FairScheduler.apt.vm


> Add a multi-resource fair sharing metric
> 
>
> Key: YARN-736
> URL: https://issues.apache.org/jira/browse/YARN-736
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Fix For: 2.2.0
>
> Attachments: YARN-736-1.patch, YARN-736-2.patch, YARN-736-3.patch, 
> YARN-736-4.patch, YARN-736.patch
>
>
> Currently, at a regular interval, the fair scheduler computes a fair memory 
> share for each queue and application inside it.  This fair share is not used 
> for scheduling decisions, but is displayed in the web UI, exposed as a 
> metric, and used for preemption decisions.
> With DRF and multi-resource scheduling, assigning a memory share as the fair 
> share metric to every queue no longer makes sense.  It's not obvious what the 
> replacement should be, but probably something like fractional fairness within 
> a queue, or distance from an ideal cluster state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-881) Priority#compareTo method seems to be wrong.

2013-06-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692224#comment-13692224
 ] 

Sandy Ryza commented on YARN-881:
-

There are places in the code that rely on the current ordering, 
AppSchedulingInfo, for example.  The thinking may have been that we most 
commonly want to traverse priorities from high to low, which is more 
straightforward if the higher ones are at the front of the list.

> Priority#compareTo method seems to be wrong.
> 
>
> Key: YARN-881
> URL: https://issues.apache.org/jira/browse/YARN-881
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jian He
>Assignee: Jian He
>
> if lower int value means higher priority, shouldn't we "return 
> other.getPriority() - this.getPriority() " 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-736) Add a multi-resource fair sharing metric

2013-06-24 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692212#comment-13692212
 ] 

Alejandro Abdelnur commented on YARN-736:
-

+1

> Add a multi-resource fair sharing metric
> 
>
> Key: YARN-736
> URL: https://issues.apache.org/jira/browse/YARN-736
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.0.4-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
> Attachments: YARN-736-1.patch, YARN-736-2.patch, YARN-736-3.patch, 
> YARN-736-4.patch, YARN-736.patch
>
>
> Currently, at a regular interval, the fair scheduler computes a fair memory 
> share for each queue and application inside it.  This fair share is not used 
> for scheduling decisions, but is displayed in the web UI, exposed as a 
> metric, and used for preemption decisions.
> With DRF and multi-resource scheduling, assigning a memory share as the fair 
> share metric to every queue no longer makes sense.  It's not obvious what the 
> replacement should be, but probably something like fractional fairness within 
> a queue, or distance from an ideal cluster state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-881) Priority#compareTo method seems to be wrong.

2013-06-24 Thread Jian He (JIRA)
Jian He created YARN-881:


 Summary: Priority#compareTo method seems to be wrong.
 Key: YARN-881
 URL: https://issues.apache.org/jira/browse/YARN-881
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He


if lower int value means higher priority, shouldn't we "return 
other.getPriority() - this.getPriority() " 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups

2013-06-24 Thread Chris Riccomini (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692138#comment-13692138
 ] 

Chris Riccomini commented on YARN-864:
--

Hey Jian,

Awesome. I've patched and started the cluster with YARN-600, YARN-799, and 
YARN-864. I'll keep you posted.

Cheers,
Chris

> YARN NM leaking containers with CGroups
> ---
>
> Key: YARN-864
> URL: https://issues.apache.org/jira/browse/YARN-864
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.0.5-alpha
> Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and 
> YARN-600.
>Reporter: Chris Riccomini
> Attachments: rm-log, YARN-864.1.patch
>
>
> Hey Guys,
> I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm 
> seeing containers getting leaked by the NMs. I'm not quite sure what's going 
> on -- has anyone seen this before? I'm concerned that maybe it's a 
> mis-understanding on my part about how YARN's lifecycle works.
> When I look in my AM logs for my app (not an MR app master), I see:
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. 
> This means that container container_1371141151815_0008_03_02 was killed 
> by YARN, either due to being released by the application master or being 
> 'lost' due to node failures etc.
> 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container 
> container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a 
> new container for the task.
> The AM has been running steadily the whole time. Here's what the NM logs say:
> {noformat}
> 05:34:59,783  WARN AsyncDispatcher:109 - Interrupted Exception while stopping
> java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1143)
> at java.lang.Thread.join(Thread.java:1196)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
> at 
> org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,314  WARN ContainersMonitorImpl:463 - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>  is interrupted. Exiting.
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598
> 05:35:00,434  WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup 
> at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> 05:35:00,434  WARN ContainerLaunch:247 - Failed to launch container.
> java.io.IOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:205)
> at org.apache.hadoop.util.Shell.run(Shell.java:129)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230)
> at 
> org

[jira] [Commented] (YARN-871) Failed to run MR example against latest trunk

2013-06-24 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692108#comment-13692108
 ] 

Zhijie Shen commented on YARN-871:
--

[~devaraj.k], the posted exception seems to be related to HADOOP-9421 and 
YARN-827. YARN-874 is tracking the issue.

> Failed to run MR example against latest trunk
> -
>
> Key: YARN-871
> URL: https://issues.apache.org/jira/browse/YARN-871
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
> Attachments: yarn-zshen-resourcemanager-ZShens-MacBook-Pro.local.log
>
>
> Built the latest trunk, deployed a single node cluster and ran examples, such 
> as
> {code}
>  hadoop jar 
> hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar
>  teragen 10 out1
> {code}
> The job failed with the following console message:
> {code}
> 13/06/21 12:51:25 INFO mapreduce.Job: Running job: job_1371844267731_0001
> 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 running in 
> uber mode : false
> 13/06/21 12:51:31 INFO mapreduce.Job:  map 0% reduce 0%
> 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 failed with 
> state FAILED due to: Application application_1371844267731_0001 failed 2 
> times due to AM Container for appattempt_1371844267731_0001_02 exited 
> with  exitCode: 127 due to: 
> .Failing this attempt.. Failing the application.
> 13/06/21 12:51:31 INFO mapreduce.Job: Counters: 0
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-880) Configuring map/reduce memory equal to nodemanager's memory, hangs the job execution

2013-06-24 Thread Nishan Shetty (JIRA)
Nishan Shetty created YARN-880:
--

 Summary: Configuring map/reduce memory equal to nodemanager's 
memory, hangs the job execution
 Key: YARN-880
 URL: https://issues.apache.org/jira/browse/YARN-880
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.1-alpha
Reporter: Nishan Shetty
Priority: Critical


Scenario:
=
Cluster is installed with 2 Nodemanagers 

Configuraiton:

NM memory (yarn.nodemanager.resource.memory-mb): 8 gb
map and reduce memory : 8 gb
Appmaster memory: 2 gb

If map task is reserved on the same nodemanager where appmaster of the same job 
is running then job execution hangs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-871) Failed to run MR example against latest trunk

2013-06-24 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13691868#comment-13691868
 ] 

Devaraj K commented on YARN-871:


{code:xml}
2013-06-24 20:58:05,102 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
application_1372087479441_0002 failed 2 times due to Error launching 
appattempt_1372087479441_0002_02. Got exception: java.io.IOException: 
Failed on local exception: java.io.IOException: java.io.IOException: Server 
asks us to fall back to SIMPLE auth, but this client is configured to only 
allow secure connections.; Host Details : local host is: 
"HOST-10-18-91-57/10.18.91.57"; destination host is: "HOST-10-18-91-57":12356; 
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
at org.apache.hadoop.ipc.Client.call(Client.java:1318)
at org.apache.hadoop.ipc.Client.call(Client.java:1266)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at $Proxy23.startContainer(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainer(ContainerManagementProtocolPBClientImpl.java:110)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:110)
at 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:228)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: java.io.IOException: Server asks us to fall 
back to SIMPLE auth, but this client is configured to only allow secure 
connections.
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:589)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489)
at 
org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:552)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:635)
at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:258)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1367)
at org.apache.hadoop.ipc.Client.call(Client.java:1285)
... 9 more
Caused by: java.io.IOException: Server asks us to fall back to SIMPLE auth, but 
this client is configured to only allow secure connections.
at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:250)
at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:464)
at org.apache.hadoop.ipc.Client$Connection.access$1500(Client.java:258)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:628)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:625)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:624)
... 12 more
. Failing the application.
{code}

> Failed to run MR example against latest trunk
> -
>
> Key: YARN-871
> URL: https://issues.apache.org/jira/browse/YARN-871
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
> Attachments: yarn-zshen-resourcemanager-ZShens-MacBook-Pro.local.log
>
>
> Built the latest trunk, deployed a single node cluster and ran examples, such 
> as
> {code}
>  hadoop jar 
> hadoop-3.0.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0-SNAPSHOT.jar
>  teragen 10 out1
> {code}
> The job failed with the following console message:
> {code}
> 13/06/21 12:51:25 INFO mapreduce.Job: Running job: job_1371844267731_0001
> 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 running in 
> uber mode : false
> 13/06/21 12:51:31 INFO mapreduce.Job:  map 0% reduce 0%
> 13/06/21 12:51:31 INFO mapreduce.Job: Job job_1371844267731_0001 failed with 
> state FAILED due to: Application application_1371844267731_0001 failed 2 
> times due to AM Container for appattempt_1371844267731_0001_02 exited 
> with  exitCode: 127 due to: 
> .Failing this attempt.. Failing the application.
> 13/06/21 12:51:31 INFO mapreduce.Job: Counters: 0
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JI