date:20140617

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034876#comment-14034876
 ] 

Bikas Saha commented on YARN-2052:
--

With 32 bits for epoch number we have 4 billion restarts before it overflows. 
We are probably safe without any handling.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034812#comment-14034812
 ] 

Yi Tian commented on YARN-2083:
---

[~ywskycn], thanks for your advice, YARN-2083-3.patch works fine in thunk 
,YARN-2083-2.patch works fine in branch-2.4.1.
 is it possible to apply this patch into yarn-project?

> In fair scheduler, Queue should not been assigned more containers when its 
> usedResource had reach the maxResource limit
> ---
>
> Key: YARN-2083
> URL: https://issues.apache.org/jira/browse/YARN-2083
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Yi Tian
>  Labels: assignContainer, fair, scheduler
> Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
> YARN-2083.patch
>
>
> In fair scheduler, FSParentQueue and FSLeafQueue do an 
> assignContainerPreCheck to guaranty this queue is not over its limit.
> But the fitsIn function in Resource.java did not return false when the 
> usedResource equals the maxResource.
> I think we should create a new Function "fitsInWithoutEqual" instead of 
> "fitsIn" in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated YARN-2083:
--

Fix Version/s: (was: 2.4.1)

> In fair scheduler, Queue should not been assigned more containers when its 
> usedResource had reach the maxResource limit
> ---
>
> Key: YARN-2083
> URL: https://issues.apache.org/jira/browse/YARN-2083
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Yi Tian
>  Labels: assignContainer, fair, scheduler
> Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
> YARN-2083.patch
>
>
> In fair scheduler, FSParentQueue and FSLeafQueue do an 
> assignContainerPreCheck to guaranty this queue is not over its limit.
> But the fitsIn function in Resource.java did not return false when the 
> usedResource equals the maxResource.
> I think we should create a new Function "fitsInWithoutEqual" instead of 
> "fitsIn" in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034777#comment-14034777
 ] 

Hadoop QA commented on YARN-2083:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650950/YARN-2083-3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4019//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4019//console

This message is automatically generated.

> In fair scheduler, Queue should not been assigned more containers when its 
> usedResource had reach the maxResource limit
> ---
>
> Key: YARN-2083
> URL: https://issues.apache.org/jira/browse/YARN-2083
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Yi Tian
>  Labels: assignContainer, fair, scheduler
> Fix For: 2.4.1
>
> Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
> YARN-2083.patch
>
>
> In fair scheduler, FSParentQueue and FSLeafQueue do an 
> assignContainerPreCheck to guaranty this queue is not over its limit.
> But the fitsIn function in Resource.java did not return false when the 
> usedResource equals the maxResource.
> I think we should create a new Function "fitsInWithoutEqual" instead of 
> "fitsIn" in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034746#comment-14034746
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

{quote}
We should make it a long in the same release as the epoch number addition so 
that we dont have to worry about that.
{quote}

+1 to do this in the same release. We'll plan to do the improvement on another 
JIRA. It's OK, but I think it's important for us that we decide the behavior 
when the overflow happens. We have 2 options: just aborting RM for now or 
starting apps from a clean state after the restart. We're planning to make id 
long just after this JIRA, so we can take aborting approach to prevent 
unexpected behavior for the simplicity. [~bikassaha], [~jianhe], what do you 
think about this?

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated YARN-2083:
--

Attachment: YARN-2083-3.patch

little change for YARN-1474. Make schedulers services. 


> In fair scheduler, Queue should not been assigned more containers when its 
> usedResource had reach the maxResource limit
> ---
>
> Key: YARN-2083
> URL: https://issues.apache.org/jira/browse/YARN-2083
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Yi Tian
>  Labels: assignContainer, fair, scheduler
> Fix For: 2.4.1
>
> Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083-3.patch, 
> YARN-2083.patch
>
>
> In fair scheduler, FSParentQueue and FSLeafQueue do an 
> assignContainerPreCheck to guaranty this queue is not over its limit.
> But the fitsIn function in Resource.java did not return false when the 
> usedResource equals the maxResource.
> I think we should create a new Function "fitsInWithoutEqual" instead of 
> "fitsIn" in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034732#comment-14034732
 ] 

Bikas Saha commented on YARN-2052:
--

Ah. I did not see the rest of the comment. Yes. Integer overflow is a problem. 
We should make it a long in the same release as the epoch number addition so 
that we dont have to worry about that.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034731#comment-14034731
 ] 

Bikas Saha commented on YARN-2052:
--

Why would ContainerId#compareTo fail? Existing containerId's should remain 
unchanged after RM restart. Only new container ids should have a different 
epoch number.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2144) Add logs when preemption occurs

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034725#comment-14034725
 ] 

Hadoop QA commented on YARN-2144:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650937/YARN-2144.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4018//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4018//console

This message is automatically generated.

> Add logs when preemption occurs
> ---
>
> Key: YARN-2144
> URL: https://issues.apache.org/jira/browse/YARN-2144
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.5.0
>Reporter: Tassapol Athiapinya
>Assignee: Wangda Tan
> Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
> YARN-2144.patch, YARN-2144.patch, YARN-2144.patch
>
>
> There should be easy-to-read logs when preemption does occur. 
> 1. For debugging purpose, RM should log this.
> 2. For administrative purpose, RM webpage should have a page to show recent 
> preemption events.
> RM logs should have following properties:
> * Logs are retrievable when an application is still running and often flushed.
> * Can distinguish between AM container preemption and task container 
> preemption with container ID shown.
> * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034722#comment-14034722
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

I meant starting apps from a clean state after the restart like RM restart 
phase 1. If the sequence numbers are reset to zero, some applications can do 
unexpected behavior because the {{ContainerId#compareTo}} doesn't work 
correctly. If the apps start from a clean state, we can avoid the situation.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034716#comment-14034716
 ] 

Jian He commented on YARN-2052:
---

bq. One simple way is to fallback to RM-restart implemented in YARN-128
Can you clarify more what you mean?

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034702#comment-14034702
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

[~bikassaha], Yes, I think it's same.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2144) Add logs when preemption occurs

2014-06-17 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-2144:
-

Attachment: YARN-2144.patch

Rebased patch to latest trunk.

> Add logs when preemption occurs
> ---
>
> Key: YARN-2144
> URL: https://issues.apache.org/jira/browse/YARN-2144
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.5.0
>Reporter: Tassapol Athiapinya
>Assignee: Wangda Tan
> Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
> YARN-2144.patch, YARN-2144.patch, YARN-2144.patch
>
>
> There should be easy-to-read logs when preemption does occur. 
> 1. For debugging purpose, RM should log this.
> 2. For administrative purpose, RM webpage should have a page to show recent 
> preemption events.
> RM logs should have following properties:
> * Logs are retrievable when an application is still running and often flushed.
> * Can distinguish between AM container preemption and task container 
> preemption with container ID shown.
> * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps

2014-06-17 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034700#comment-14034700
 ] 

Bikas Saha commented on YARN-1373:
--

Sorry I am not clear how this is a dup. This jira is tracking new behavior in 
the RM that will transition a recovered RMAppImpl/RMAppAttemptImpl (and still 
running for real) app to a RUNNING state instead of a terminal recovered state. 
This is to ensure that the state machines are in the correct state for the 
running AM to resync and continue as running. This is not related to killing 
the app master process on the NM.

> Transition RMApp and RMAppAttempt state to RUNNING after restart for 
> recovered running apps
> ---
>
> Key: YARN-1373
> URL: https://issues.apache.org/jira/browse/YARN-1373
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Omkar Vinit Joshi
>
> Currently the RM moves recovered app attempts to the a terminal recovered 
> state and starts a new attempt. Instead, it will have to transition the last 
> attempt to a running state such that it can proceed as normal once the 
> running attempt has resynced with the ApplicationMasterService (YARN-1365 and 
> YARN-1366). If the RM had started the application container before dying then 
> the AM would be up and trying to contact the RM. The RM may have had died 
> before launching the container. For this case, the RM should wait for AM 
> liveliness period and issue a kill container for the stored master container. 
> It should transition this attempt to some RECOVER_ERROR state and proceed to 
> start a new attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Bikas Saha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034691#comment-14034691
 ] 

Bikas Saha commented on YARN-2052:
--

bq. Had an offline discussion with Vinod. Maybe it's still better to persist 
some sequence number to indicate the number of RM restarts when RM starts up.
Is this the same as the epoch number that was mentioned earlier in this jira? 
https://issues.apache.org/jira/browse/YARN-2052?focusedCommentId=13996675&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13996675.
 Seems to me that its the same with epoch number changed to num-rm-restarts.



> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034637#comment-14034637
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

Basically, I agree with the approach. If we take the sequence-number approach, 
we should define the behavior when sequence number overflows. One simple way is 
to fallback to RM-restart implemented in YARN-128. After changing the 
containerId/appId from integer to long,  it'll happen very rarely. [~jianhe], 
what do you think about the behavior?

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2144) Add logs when preemption occurs

2014-06-17 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034624#comment-14034624
 ] 

Jian He commented on YARN-2144:
---

the patch needs rebase, can you update please? thx

> Add logs when preemption occurs
> ---
>
> Key: YARN-2144
> URL: https://issues.apache.org/jira/browse/YARN-2144
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Affects Versions: 2.5.0
>Reporter: Tassapol Athiapinya
>Assignee: Wangda Tan
> Attachments: AM-page-preemption-info.png, YARN-2144.patch, 
> YARN-2144.patch, YARN-2144.patch
>
>
> There should be easy-to-read logs when preemption does occur. 
> 1. For debugging purpose, RM should log this.
> 2. For administrative purpose, RM webpage should have a page to show recent 
> preemption events.
> RM logs should have following properties:
> * Logs are retrievable when an application is still running and often flushed.
> * Can distinguish between AM container preemption and task container 
> preemption with container ID shown.
> * Should be INFO level log.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2147) client lacks delegation token exception details when application submit fails

2014-06-17 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034612#comment-14034612
 ] 

Daryn Sharp commented on YARN-2147:
---

I don't think the patch handles the use case it's designed for.  If job 
submission failed with a bland "Read timed out", then logging all the tokens in 
the RM log doesn't help the end user, nor does the RM log even answer the 
question "which token timed out"? 

What you really want to do is change 
{{DelegationTokenRenewer#handleAppSubmitEvent}} to trap exceptions from 
{{renewToken}}.  Wrap the exception with a more descriptive exception that 
stringifies to the user as "Can't renew token : Read timed out".

> client lacks delegation token exception details when application submit fails
> -
>
> Key: YARN-2147
> URL: https://issues.apache.org/jira/browse/YARN-2147
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Jason Lowe
>Assignee: Chen He
>Priority: Minor
> Attachments: YARN-2147-v2.patch, YARN-2147.patch
>
>
> When an client submits an application and the delegation token process fails 
> the client can lack critical details needed to understand the nature of the 
> error.  Only the message of the error exception is conveyed to the client, 
> which sometimes isn't enough to debug.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034588#comment-14034588
 ] 

Hadoop QA commented on YARN-1341:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650914/YARN-1341v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 3 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4017//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4017//console

This message is automatically generated.

> Recover NMTokens upon nodemanager restart
> -
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034541#comment-14034541
 ] 

Jian He commented on YARN-2052:
---

Seems more problem with the randomId approach if user wants to the kill the 
container,  user has to be aware of the random ID..

Had an offline discussion with Vinod.  Maybe it's still better to persist  some 
sequence number to indicate the number of RM restarts when RM starts up. Today 
containerId#id is int (32 bits), we reserve some bits in the front for the 
number of RM restarts. e.g. 32bits divided as 8bits for the number of RM 
restarts and 24 bits for the number of containers. Each time RM restarts, we 
increase the RM sequence number. Also, We should have a followup jira to change 
the containerId/appId from integer to long and deprecate the old one.  
[~ozawa],  do you agree?

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-17 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-1341:
-

Attachment: YARN-1341v5.patch

Thanks for taking a look, Junping!  I've updated the patch to trunk.

> Recover NMTokens upon nodemanager restart
> -
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034474#comment-14034474
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

Vinod, OK. I'll create new JIRA to address it.

{quote}
Another question is how are we going to show the containerId string? 
specifically the toString() method.  If we just say  "original containerId 
string+UUID", it'll be inconvenient for debugging as the UUID has no meaning. 
{quote}

>From developer's point of view, you're right. One idea is showing RM_ID 
>instead of UUID. Validating RM_ID and confirming not to include underscore at 
>startup time. One concern of this approach is that we'll break backward 
>compatibility of yarn-site.xml. If we can accept it, it's better approach.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034456#comment-14034456
 ] 

Hadoop QA commented on YARN-2171:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650880/YARN-2171v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4016//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4016//console

This message is automatically generated.

> AMs block on the CapacityScheduler lock during allocate()
> -
>
> Key: YARN-2171
> URL: https://issues.apache.org/jira/browse/YARN-2171
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-2171.patch, YARN-2171v2.patch
>
>
> When AMs heartbeat into the RM via the allocate() call they are blocking on 
> the CapacityScheduler lock when trying to get the number of nodes in the 
> cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034452#comment-14034452
 ] 

Jian He commented on YARN-2052:
---

Another question is how are we going to show the containerId string? 
specifically the toString() method.  If we just say  "original containerId 
string+UUID", it'll be inconvenient for debugging as the UUID has no meaning. 


> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034448#comment-14034448
 ] 

Vinod Kumar Vavilapalli commented on YARN-2052:
---

bq. BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA. 
Ideally this should be container-allocation timestamp and we should depend on 
that instead of comparing container-IDs. IAC, let's fix it separately..

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2173) Enabling HTTPS for the reader REST APIs of TimelineServer

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2173:
--

Summary: Enabling HTTPS for the reader REST APIs of TimelineServer  (was: 
Enabling HTTPS for the reader REST APIs)

> Enabling HTTPS for the reader REST APIs of TimelineServer
> -
>
> Key: YARN-2173
> URL: https://issues.apache.org/jira/browse/YARN-2173
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API of TimelineServer

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-2174:
--

Summary: Enabling HTTPs for the writer REST API of TimelineServer  (was: 
Enabling HTTPs for the writer REST API)

> Enabling HTTPs for the writer REST API of TimelineServer
> 
>
> Key: YARN-2174
> URL: https://issues.apache.org/jira/browse/YARN-2174
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> Since we'd like to allow the application to put the timeline data at the 
> client, the AM and even the containers, we need to provide the way to 
> distribute the keystore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-1373) Transition RMApp and RMAppAttempt state to RUNNING after restart for recovered running apps

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved YARN-1373.
---

Resolution: Duplicate
  Assignee: Omkar Vinit Joshi  (was: Anubhav Dhoot)

bq. Currently the RM moves recovered app attempts to the a terminal recovered 
state and starts a new attempt.
This is no longer an issue - never been since YARN-1210. Even in 
non-work-preserving RM restart, RM explicitly never kills the AMs, it's the 
nodes that kill all containers - this was done in YARN-1210. The state-machines 
are already setup correctly and so no changes are needed here. Closing as 
duplicate of YARN-1210.

> Transition RMApp and RMAppAttempt state to RUNNING after restart for 
> recovered running apps
> ---
>
> Key: YARN-1373
> URL: https://issues.apache.org/jira/browse/YARN-1373
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Omkar Vinit Joshi
>
> Currently the RM moves recovered app attempts to the a terminal recovered 
> state and starts a new attempt. Instead, it will have to transition the last 
> attempt to a running state such that it can proceed as normal once the 
> running attempt has resynced with the ApplicationMasterService (YARN-1365 and 
> YARN-1366). If the RM had started the application container before dying then 
> the AM would be up and trying to contact the RM. The RM may have had died 
> before launching the container. For this case, the RM should wait for AM 
> liveliness period and issue a kill container for the stored master container. 
> It should transition this attempt to some RECOVER_ERROR state and proceed to 
> start a new attempt.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-06-17 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034405#comment-14034405
 ] 

Anubhav Dhoot commented on YARN-1367:
-

I am still working on it and will have it ready soon.





> After restart NM should resync with the RM without killing containers
> -
>
> Key: YARN-1367
> URL: https://issues.apache.org/jira/browse/YARN-1367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1367.prototype.patch
>
>
> After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
>  Upon receiving the resync response, the NM kills all containers and 
> re-registers with the RM. The NM should be changed to not kill the container 
> and instead inform the RM about all currently running containers including 
> their allocations etc. After the re-register, the NM should send all pending 
> container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2176) CapacityScheduler loops over all running applications rather than actively requesting apps

2014-06-17 Thread Jason Lowe (JIRA)

Jason Lowe created YARN-2176:


 Summary: CapacityScheduler loops over all running applications 
rather than actively requesting apps
 Key: YARN-2176
 URL: https://issues.apache.org/jira/browse/YARN-2176
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.4.0
Reporter: Jason Lowe


The capacity scheduler performance is primarily dominated by 
LeafQueue.assignContainers, and that currently loops over all applications that 
are running in the queue.  It would be more efficient if we looped over just 
the applications that are actively asking for resources rather than all 
applications, as there could be thousands of applications running but only a 
few hundred that are currently asking for resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-17 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot reassigned YARN-2175:
---

Assignee: Anubhav Dhoot

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>Assignee: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no way to kill an task if its stuck in these states. These may 
> have nothing to do with the task itself and could be an issue within the 
> platform. 
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request.
> This jira will be used to limit localization time and we open others if we 
> feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-17 Thread Anubhav Dhoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-2175:


Affects Version/s: 2.4.0

> Container localization has no timeouts and tasks can be stuck there for a 
> long time
> ---
>
> Key: YARN-2175
> URL: https://issues.apache.org/jira/browse/YARN-2175
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.4.0
>Reporter: Anubhav Dhoot
>
> There are no timeouts that can be used to limit the time taken by various 
> container startup operations. Localization for example could take a long time 
> and there is no way to kill an task if its stuck in these states. These may 
> have nothing to do with the task itself and could be an issue within the 
> platform. 
> Ideally there should be configurable limits for various states within the 
> NodeManager to limit various states. The RM does not care about most of these 
> and its only between AM and the NM. We can start by making these global 
> configurable defaults and in future we can make it fancier by letting AM 
> override them in the start container request.
> This jira will be used to limit localization time and we open others if we 
> feel we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2175) Container localization has no timeouts and tasks can be stuck there for a long time

2014-06-17 Thread Anubhav Dhoot (JIRA)

Anubhav Dhoot created YARN-2175:
---

 Summary: Container localization has no timeouts and tasks can be 
stuck there for a long time
 Key: YARN-2175
 URL: https://issues.apache.org/jira/browse/YARN-2175
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Anubhav Dhoot


There are no timeouts that can be used to limit the time taken by various 
container startup operations. Localization for example could take a long time 
and there is no way to kill an task if its stuck in these states. These may 
have nothing to do with the task itself and could be an issue within the 
platform. 

Ideally there should be configurable limits for various states within the 
NodeManager to limit various states. The RM does not care about most of these 
and its only between AM and the NM. We can start by making these global 
configurable defaults and in future we can make it fancier by letting AM 
override them in the start container request.

This jira will be used to limit localization time and we open others if we feel 
we need to limit other operations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2171:
-

Attachment: YARN-2171v2.patch

The point of the unit test was to catch regressions at a high level.  If anyone 
changes the code such that calling allocate() will grab the scheduler lock then 
the test will fail, whether that's a regression in this particular method or 
some new method that's added that ApplicationMasterService or CapacityScheduler 
itself calls and grabs the lock.

I added a separate unit test to exercise the getNumClusterNodes method.

The AHS unit test failure seems unrelated, and it passes for me locally even 
with this change.

> AMs block on the CapacityScheduler lock during allocate()
> -
>
> Key: YARN-2171
> URL: https://issues.apache.org/jira/browse/YARN-2171
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-2171.patch, YARN-2171v2.patch
>
>
> When AMs heartbeat into the RM via the allocate() call they are blocking on 
> the CapacityScheduler lock when trying to get the number of nodes in the 
> cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-06-17 Thread Anubhav Dhoot (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034359#comment-14034359
 ] 

Anubhav Dhoot commented on YARN-1367:
-

I am still working on it. Will have an update soon

> After restart NM should resync with the RM without killing containers
> -
>
> Key: YARN-1367
> URL: https://issues.apache.org/jira/browse/YARN-1367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1367.prototype.patch
>
>
> After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
>  Upon receiving the resync response, the NM kills all containers and 
> re-registers with the RM. The NM should be changed to not kill the container 
> and instead inform the RM about all currently running containers including 
> their allocations etc. After the re-register, the NM should send all pending 
> container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034268#comment-14034268
 ] 

Vinod Kumar Vavilapalli commented on YARN-1972:
---

That looks fine. I was suggesting we create one more document at 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/.

You can create that doc and add it to the patch together with addressing my 
review in the last comment.

Tx again for working on this, it's almost there.. 

> Implement secure Windows Container Executor
> ---
>
> Key: YARN-1972
> URL: https://issues.apache.org/jira/browse/YARN-1972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1972.1.patch, YARN-1972.2.patch
>
>
> h1. Windows Secure Container Executor (WCE)
> YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
> user as a solution for the problem of having a security boundary between 
> processes executed in YARN containers and the Hadoop services. The WCE is a 
> container executor that leverages the winutils capabilities introduced in 
> YARN-1063 and launches containers as an OS process running as the job 
> submitter user. A description of the S4U infrastructure used by YARN-1063 
> alternatives considered can be read on that JIRA.
> The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
> drive the flow of execution, but it overwrrides some emthods to the effect of:
> * change the DCE created user cache directories to be owned by the job user 
> and by the nodemanager group.
> * changes the actual container run command to use the 'createAsUser' command 
> of winutils task instead of 'create'
> * runs the localization as standalone process instead of an in-process Java 
> method call. This in turn relies on the winutil createAsUser feature to run 
> the localization as the job user.
>  
> When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
> differences:
> * it does no delegate the creation of the user cache directories to the 
> native implementation.
> * it does no require special handling to be able to delete user files
> The approach on the WCE came from a practical trial-and-error approach. I had 
> to iron out some issues around the Windows script shell limitations (command 
> line length) to get it to work, the biggest issue being the huge CLASSPATH 
> that is commonplace in Hadoop environment container executions. The job 
> container itself is already dealing with this via a so called 'classpath 
> jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
> as a separate container the same issue had to be resolved and I used the same 
> 'classpath jar' approach.
> h2. Deployment Requirements
> To use the WCE one needs to set the 
> `yarn.nodemanager.container-executor.class` to 
> `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
> and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
> Windows security group name that is the nodemanager service principal is a 
> member of (equivalent of LCE 
> `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
> does not require any configuration outside of the Hadoop own's yar-site.xml.
> For WCE to work the nodemanager must run as a service principal that is 
> member of the local Administrators group or LocalSystem. this is derived from 
> the need to invoke LoadUserProfile API which mention these requirements in 
> the specifications. This is in addition to the SE_TCB privilege mentioned in 
> YARN-1063, but this requirement will automatically imply that the SE_TCB 
> privilege is held by the nodemanager. For the Linux speakers in the audience, 
> the requirement is basically to run NM as root.
> h2. Dedicated high privilege Service
> Due to the high privilege required by the WCE we had discussed the need to 
> isolate the high privilege operations into a separate process, an 'executor' 
> service that is solely responsible to start the containers (incloding the 
> localizer). The NM would have to authenticate, authorize and communicate with 
> this service via an IPC mechanism and use this service to launch the 
> containers. I still believe we'll end up deploying such a service, but the 
> effort to onboard such a new platfrom specific new service on the project are 
> not trivial.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1365) ApplicationMasterService to allow Register and Unregister of an app that was running before restart

2014-06-17 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034201#comment-14034201
 ] 

Jian He commented on YARN-1365:
---

bq. allocateresponse would also use exceptions instead of AM commands. 
right, please open a new jira for that. For my other comment  "My point was we 
can do the same for both addApplication and addApplicationAttempt to not send 
dup events", I can open a new jira for this too. We can keep this patch 
minimal. 

> ApplicationMasterService to allow Register and Unregister of an app that was 
> running before restart
> ---
>
> Key: YARN-1365
> URL: https://issues.apache.org/jira/browse/YARN-1365
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1365.001.patch, YARN-1365.002.patch, 
> YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, 
> YARN-1365.005.patch, YARN-1365.initial.patch
>
>
> For an application that was running before restart, the 
> ApplicationMasterService currently throws an exception when the app tries to 
> make the initial register or final unregister call. These should succeed and 
> the RMApp state machine should transition to completed like normal. 
> Unregistration should succeed for an app that the RM considers complete since 
> the RM may have died after saving completion in the store but before 
> notifying the AM that the AM is free to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1367) After restart NM should resync with the RM without killing containers

2014-06-17 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034186#comment-14034186
 ] 

Jian He commented on YARN-1367:
---

[~adhoot], mind updating the patch please? I'm happy to work on it if you are 
busy.

> After restart NM should resync with the RM without killing containers
> -
>
> Key: YARN-1367
> URL: https://issues.apache.org/jira/browse/YARN-1367
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1367.prototype.patch
>
>
> After RM restart, the RM sends a resync response to NMs that heartbeat to it. 
>  Upon receiving the resync response, the NM kills all containers and 
> re-registers with the RM. The NM should be changed to not kill the container 
> and instead inform the RM about all currently running containers including 
> their allocations etc. After the re-register, the NM should send all pending 
> container completions to the RM as usual.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-06-17 Thread Remus Rusanu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034179#comment-14034179
 ] 

Remus Rusanu commented on YARN-1972:


Thanks for the update Vinod. I have updated the item description to act as 
documentation. Do you think anything more is needed?

> Implement secure Windows Container Executor
> ---
>
> Key: YARN-1972
> URL: https://issues.apache.org/jira/browse/YARN-1972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1972.1.patch, YARN-1972.2.patch
>
>
> h1. Windows Secure Container Executor (WCE)
> YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
> user as a solution for the problem of having a security boundary between 
> processes executed in YARN containers and the Hadoop services. The WCE is a 
> container executor that leverages the winutils capabilities introduced in 
> YARN-1063 and launches containers as an OS process running as the job 
> submitter user. A description of the S4U infrastructure used by YARN-1063 
> alternatives considered can be read on that JIRA.
> The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
> drive the flow of execution, but it overwrrides some emthods to the effect of:
> * change the DCE created user cache directories to be owned by the job user 
> and by the nodemanager group.
> * changes the actual container run command to use the 'createAsUser' command 
> of winutils task instead of 'create'
> * runs the localization as standalone process instead of an in-process Java 
> method call. This in turn relies on the winutil createAsUser feature to run 
> the localization as the job user.
>  
> When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
> differences:
> * it does no delegate the creation of the user cache directories to the 
> native implementation.
> * it does no require special handling to be able to delete user files
> The approach on the WCE came from a practical trial-and-error approach. I had 
> to iron out some issues around the Windows script shell limitations (command 
> line length) to get it to work, the biggest issue being the huge CLASSPATH 
> that is commonplace in Hadoop environment container executions. The job 
> container itself is already dealing with this via a so called 'classpath 
> jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
> as a separate container the same issue had to be resolved and I used the same 
> 'classpath jar' approach.
> h2. Deployment Requirements
> To use the WCE one needs to set the 
> `yarn.nodemanager.container-executor.class` to 
> `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
> and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
> Windows security group name that is the nodemanager service principal is a 
> member of (equivalent of LCE 
> `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
> does not require any configuration outside of the Hadoop own's yar-site.xml.
> For WCE to work the nodemanager must run as a service principal that is 
> member of the local Administrators group or LocalSystem. this is derived from 
> the need to invoke LoadUserProfile API which mention these requirements in 
> the specifications. This is in addition to the SE_TCB privilege mentioned in 
> YARN-1063, but this requirement will automatically imply that the SE_TCB 
> privilege is held by the nodemanager. For the Linux speakers in the audience, 
> the requirement is basically to run NM as root.
> h2. Dedicated high privilege Service
> Due to the high privilege required by the WCE we had discussed the need to 
> isolate the high privilege operations into a separate process, an 'executor' 
> service that is solely responsible to start the containers (incloding the 
> localizer). The NM would have to authenticate, authorize and communicate with 
> this service via an IPC mechanism and use this service to launch the 
> containers. I still believe we'll end up deploying such a service, but the 
> effort to onboard such a new platfrom specific new service on the project are 
> not trivial.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1972) Implement secure Windows Container Executor

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034160#comment-14034160
 ] 

Vinod Kumar Vavilapalli commented on YARN-1972:
---

bq. All in all a very high privilege required for NM. We are considering a 
future iteration in which we extract the privileged operations into a dedicated 
NT service (=daemon) and bestow the high privileges only to this service.
Thanks. Let's document this in a Windows specific docs page.

bq. You are launching so many commands for every container - to chown files, to 
copy files etc.
bq. We'll measure. [..]  I don't think that moving the localization into native 
code would result in much benefit over a proper Java implementation.
I'd file an investigation ticket to track this.

bq. DCE and WCE no longer create user file cache, this is done solely by the 
localizer initDirs. DCE Test modified to reflect this. DCE.createUserCacheDirs 
renamed to createUserAppCacheDirs accordingly
The division of responsibility between launching multiple commands before 
starting the localizer and the stuff that happens inside the localizer: 
Unfortunately, this still isn't ideal. Having userCache created by the 
ContainerExecutor but not file-cache is assymetric and confusing. I propose 
that we split this refactoring into a separate JIRA and stick to your original 
code. Apologies for the back-and-forth on this one.

bq. There is more feedback to address (DRY between LCE and WCE localization 
launch, proper place for localization classpath jar).
So, you will work on them here itself, right?

Looks fine otherwise, exception for the above comments and a request for some 
basic documentation.

> Implement secure Windows Container Executor
> ---
>
> Key: YARN-1972
> URL: https://issues.apache.org/jira/browse/YARN-1972
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-1972.1.patch, YARN-1972.2.patch
>
>
> h1. Windows Secure Container Executor (WCE)
> YARN-1063 adds the necessary infrasturcture to launch a process as a domain 
> user as a solution for the problem of having a security boundary between 
> processes executed in YARN containers and the Hadoop services. The WCE is a 
> container executor that leverages the winutils capabilities introduced in 
> YARN-1063 and launches containers as an OS process running as the job 
> submitter user. A description of the S4U infrastructure used by YARN-1063 
> alternatives considered can be read on that JIRA.
> The WCE is based on the DefaultContainerExecutor. It relies on the DCE to 
> drive the flow of execution, but it overwrrides some emthods to the effect of:
> * change the DCE created user cache directories to be owned by the job user 
> and by the nodemanager group.
> * changes the actual container run command to use the 'createAsUser' command 
> of winutils task instead of 'create'
> * runs the localization as standalone process instead of an in-process Java 
> method call. This in turn relies on the winutil createAsUser feature to run 
> the localization as the job user.
>  
> When compared to LinuxContainerExecutor (LCE), the WCE has some minor 
> differences:
> * it does no delegate the creation of the user cache directories to the 
> native implementation.
> * it does no require special handling to be able to delete user files
> The approach on the WCE came from a practical trial-and-error approach. I had 
> to iron out some issues around the Windows script shell limitations (command 
> line length) to get it to work, the biggest issue being the huge CLASSPATH 
> that is commonplace in Hadoop environment container executions. The job 
> container itself is already dealing with this via a so called 'classpath 
> jar', see HADOOP-8899 and YARN-316 for details. For the WCE localizer launch 
> as a separate container the same issue had to be resolved and I used the same 
> 'classpath jar' approach.
> h2. Deployment Requirements
> To use the WCE one needs to set the 
> `yarn.nodemanager.container-executor.class` to 
> `org.apache.hadoop.yarn.server.nodemanager.WindowsSecureContainerExecutor` 
> and set the `yarn.nodemanager.windows-secure-container-executor.group` to a 
> Windows security group name that is the nodemanager service principal is a 
> member of (equivalent of LCE 
> `yarn.nodemanager.linux-container-executor.group`). Unlike the LCE the WCE 
> does not require any configuration outside of the Hadoop own's yar-site.xml.
> For WCE to work the nodemanager must run as a service principal that is 
> member of the local Administrators group or LocalSystem. this is derived from 
> the need to invoke LoadUserProfile API which mention these re

[jira] [Commented] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034154#comment-14034154
 ] 

Hadoop QA commented on YARN-2083:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650834/YARN-2083-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFSQueue

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4015//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4015//console

This message is automatically generated.

> In fair scheduler, Queue should not been assigned more containers when its 
> usedResource had reach the maxResource limit
> ---
>
> Key: YARN-2083
> URL: https://issues.apache.org/jira/browse/YARN-2083
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Yi Tian
>  Labels: assignContainer, fair, scheduler
> Fix For: 2.4.1
>
> Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch
>
>
> In fair scheduler, FSParentQueue and FSLeafQueue do an 
> assignContainerPreCheck to guaranty this queue is not over its limit.
> But the fitsIn function in Resource.java did not return false when the 
> usedResource equals the maxResource.
> I think we should create a new Function "fitsInWithoutEqual" instead of 
> "fitsIn" in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-365) Each NM heartbeat should not generate an event for the Scheduler

2014-06-17 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-365:


Attachment: YARN-365.branch-0.23.patch

Patch for branch-0.23.  RM unit tests pass, and I manually tested it as well on 
a single-node cluster forcing the scheduler to run slower than the heartbeat 
interval.

> Each NM heartbeat should not generate an event for the Scheduler
> 
>
> Key: YARN-365
> URL: https://issues.apache.org/jira/browse/YARN-365
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager, scheduler
>Affects Versions: 0.23.5
>Reporter: Siddharth Seth
>Assignee: Xuan Gong
> Fix For: 2.1.0-beta
>
> Attachments: Prototype2.txt, Prototype3.txt, YARN-365.1.patch, 
> YARN-365.10.patch, YARN-365.2.patch, YARN-365.3.patch, YARN-365.4.patch, 
> YARN-365.5.patch, YARN-365.6.patch, YARN-365.7.patch, YARN-365.8.patch, 
> YARN-365.9.patch, YARN-365.branch-0.23.patch
>
>
> Follow up from YARN-275
> https://issues.apache.org/jira/secure/attachment/12567075/Prototype.txt



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034060#comment-14034060
 ] 

Vinod Kumar Vavilapalli commented on YARN-2171:
---

The code changes look fine enough to me.

The test is not so useful beyond validating this ticket, but that's okay. I see 
that we don't have any test validating the number of nodes itself explicitly, 
shall we add that here?

> AMs block on the CapacityScheduler lock during allocate()
> -
>
> Key: YARN-2171
> URL: https://issues.apache.org/jira/browse/YARN-2171
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-2171.patch
>
>
> When AMs heartbeat into the RM via the allocate() call they are blocking on 
> the CapacityScheduler lock when trying to get the number of nodes in the 
> cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()

2014-06-17 Thread Hitesh Shah (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated YARN-868:
-

Target Version/s: 2.5.0  (was: 2.1.0-beta)

> YarnClient should set the service address in tokens returned by 
> getRMDelegationToken()
> --
>
> Key: YARN-868
> URL: https://issues.apache.org/jira/browse/YARN-868
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Hitesh Shah
>
> Either the client should set this information into the token or the client 
> layer should expose an api that returns the service address.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2083) In fair scheduler, Queue should not been assigned more containers when its usedResource had reach the maxResource limit

2014-06-17 Thread Yi Tian (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated YARN-2083:
--

Attachment: YARN-2083-2.patch

move test code to TestFSQueue.java

> In fair scheduler, Queue should not been assigned more containers when its 
> usedResource had reach the maxResource limit
> ---
>
> Key: YARN-2083
> URL: https://issues.apache.org/jira/browse/YARN-2083
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.3.0
>Reporter: Yi Tian
>  Labels: assignContainer, fair, scheduler
> Fix For: 2.4.1
>
> Attachments: YARN-2083-1.patch, YARN-2083-2.patch, YARN-2083.patch
>
>
> In fair scheduler, FSParentQueue and FSLeafQueue do an 
> assignContainerPreCheck to guaranty this queue is not over its limit.
> But the fitsIn function in Resource.java did not return false when the 
> usedResource equals the maxResource.
> I think we should create a new Function "fitsInWithoutEqual" instead of 
> "fitsIn" in this case.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034034#comment-14034034
 ] 

Hadoop QA commented on YARN-2171:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650819/YARN-2171.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4014//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4014//console

This message is automatically generated.

> AMs block on the CapacityScheduler lock during allocate()
> -
>
> Key: YARN-2171
> URL: https://issues.apache.org/jira/browse/YARN-2171
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-2171.patch
>
>
> When AMs heartbeat into the RM via the allocate() call they are blocking on 
> the CapacityScheduler lock when trying to get the number of nodes in the 
> cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2102) More generalized timeline ACLs

2014-06-17 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2102:
--

Description: We need to differentiate the access controls of reading and 
writing operations, and we need to think about cross-entity access control. For 
example, if we are executing a workflow of MR jobs, which writing the timeline 
data of this workflow, we don't want other user to pollute the timeline data of 
the workflow by putting something under it.  (was: Like ApplicationACLsManager, 
we should also allow configured user/group to access the timeline data.)

> More generalized timeline ACLs
> --
>
> Key: YARN-2102
> URL: https://issues.apache.org/jira/browse/YARN-2102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> We need to differentiate the access controls of reading and writing 
> operations, and we need to think about cross-entity access control. For 
> example, if we are executing a workflow of MR jobs, which writing the 
> timeline data of this workflow, we don't want other user to pollute the 
> timeline data of the workflow by putting something under it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2102) More generalized timeline ACLs

2014-06-17 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2102:
--

Summary: More generalized timeline ACLs  (was: Extend access control for 
configured user/group list)

> More generalized timeline ACLs
> --
>
> Key: YARN-2102
> URL: https://issues.apache.org/jira/browse/YARN-2102
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> Like ApplicationACLsManager, we should also allow configured user/group to 
> access the timeline data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart

2014-06-17 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034021#comment-14034021
 ] 

Junping Du commented on YARN-1341:
--

[~jlowe], Thanks for the patch here. I am currently reviewing it and looks like 
some code like: LeveldbIterator, NMStateStoreService already get committed in 
other patches. Would you resync the patch here against trunk? Thanks!

> Recover NMTokens upon nodemanager restart
> -
>
> Key: YARN-1341
> URL: https://issues.apache.org/jira/browse/YARN-1341
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, 
> YARN-1341v4-and-YARN-1987.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2174) Enabling HTTPs for the writer REST API

2014-06-17 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2174:
--

Description: Since we'd like to allow the application to put the timeline 
data at the client, the AM and even the containers, we need to provide the way 
to distribute the keystore.

> Enabling HTTPs for the writer REST API
> --
>
> Key: YARN-2174
> URL: https://issues.apache.org/jira/browse/YARN-2174
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> Since we'd like to allow the application to put the timeline data at the 
> client, the AM and even the containers, we need to provide the way to 
> distribute the keystore.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (YARN-2174) Enabling HTTPs for the writer REST API

2014-06-17 Thread Zhijie Shen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-2174:
-

Assignee: Zhijie Shen

> Enabling HTTPs for the writer REST API
> --
>
> Key: YARN-2174
> URL: https://issues.apache.org/jira/browse/YARN-2174
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage

2014-06-17 Thread Ashwin Shankar (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034019#comment-14034019
 ] 

Ashwin Shankar commented on YARN-2162:
--

[~maysamyabandeh], yes that was the intention. Changed title and description to 
make it clear.

> Fair Scheduler :ability to optionally configure minResources and maxResources 
> in terms of percentage
> 
>
> Key: YARN-2162
> URL: https://issues.apache.org/jira/browse/YARN-2162
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Ashwin Shankar
>  Labels: scheduler
>
> minResources and maxResources in fair scheduler configs are expressed in 
> terms of absolute numbers X mb, Y vcores. 
> As a result, when we expand or shrink our hadoop cluster, we need to 
> recalculate and change minResources/maxResources accordingly, which is pretty 
> inconvenient.
> We can circumvent this problem if we can optionally configure these 
> properties in terms of percentage of cluster capacity. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2174) Enabling HTTPs for the writer REST API

2014-06-17 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-2174:
-

 Summary: Enabling HTTPs for the writer REST API
 Key: YARN-2174
 URL: https://issues.apache.org/jira/browse/YARN-2174
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2162) Fair Scheduler :ability to optionally configure minResources and maxResources in terms of percentage

2014-06-17 Thread Ashwin Shankar (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2162:
-

Summary: Fair Scheduler :ability to optionally configure minResources and 
maxResources in terms of percentage  (was: Fair Scheduler :ability to configure 
minResources and maxResources in terms of percentage)

> Fair Scheduler :ability to optionally configure minResources and maxResources 
> in terms of percentage
> 
>
> Key: YARN-2162
> URL: https://issues.apache.org/jira/browse/YARN-2162
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Ashwin Shankar
>  Labels: scheduler
>
> minResources and maxResources in fair scheduler configs are expressed in 
> terms of absolute numbers X mb, Y vcores. 
> As a result, when we expand or shrink our hadoop cluster, we need to 
> recalculate and change minResources/maxResources accordingly, which is pretty 
> inconvenient.
> We can circumvent this problem if we can optionally configure these 
> properties in terms of percentage of cluster capacity. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2173) Enabling HTTPS for the reader REST APIs

2014-06-17 Thread Zhijie Shen (JIRA)

Zhijie Shen created YARN-2173:
-

 Summary: Enabling HTTPS for the reader REST APIs
 Key: YARN-2173
 URL: https://issues.apache.org/jira/browse/YARN-2173
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2162) Fair Scheduler :ability to configure minResources and maxResources in terms of percentage

2014-06-17 Thread Ashwin Shankar (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin Shankar updated YARN-2162:
-

Description: 
minResources and maxResources in fair scheduler configs are expressed in terms 
of absolute numbers X mb, Y vcores. 
As a result, when we expand or shrink our hadoop cluster, we need to 
recalculate and change minResources/maxResources accordingly, which is pretty 
inconvenient.
We can circumvent this problem if we can optionally configure these properties 
in terms of percentage of cluster capacity. 

  was:
minResources and maxResources in fair scheduler configs are expressed in terms 
of absolute numbers X mb, Y vcores. 
As a result, when we expand or shrink our hadoop cluster, we need to 
recalculate and change minResources/maxResources accordingly, which is pretty 
inconvenient.
We can circumvent this problem if we can (optionally) configure these 
properties in terms of percentage of cluster capacity. 


> Fair Scheduler :ability to configure minResources and maxResources in terms 
> of percentage
> -
>
> Key: YARN-2162
> URL: https://issues.apache.org/jira/browse/YARN-2162
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Ashwin Shankar
>  Labels: scheduler
>
> minResources and maxResources in fair scheduler configs are expressed in 
> terms of absolute numbers X mb, Y vcores. 
> As a result, when we expand or shrink our hadoop cluster, we need to 
> recalculate and change minResources/maxResources accordingly, which is pretty 
> inconvenient.
> We can circumvent this problem if we can optionally configure these 
> properties in terms of percentage of cluster capacity. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (YARN-409) Allow apps to be killed via the RM REST API

2014-06-17 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza resolved YARN-409.
-

Resolution: Duplicate

> Allow apps to be killed via the RM REST API
> ---
>
> Key: YARN-409
> URL: https://issues.apache.org/jira/browse/YARN-409
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> The RM REST API currently allows getting information about running 
> applications.  Adding the capability to kill applications would allow systems 
> like Hue to perform their functions over HTTP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API

2014-06-17 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033950#comment-14033950
 ] 

Sandy Ryza commented on YARN-409:
-

definitely.  will close this because there seems to be more activity there.

> Allow apps to be killed via the RM REST API
> ---
>
> Key: YARN-409
> URL: https://issues.apache.org/jira/browse/YARN-409
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> The RM REST API currently allows getting information about running 
> applications.  Adding the capability to kill applications would allow systems 
> like Hue to perform their functions over HTTP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-2171:
-

Attachment: YARN-2171.patch

Patch to use AtomicInteger for the number of nodes so we can avoid grabbing the 
lock to access the value.  I also added a unit test to verify allocate doesn't 
try to grab the capacity scheduler lock.

> AMs block on the CapacityScheduler lock during allocate()
> -
>
> Key: YARN-2171
> URL: https://issues.apache.org/jira/browse/YARN-2171
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-2171.patch
>
>
> When AMs heartbeat into the RM via the allocate() call they are blocking on 
> the CapacityScheduler lock when trying to get the number of nodes in the 
> cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033909#comment-14033909
 ] 

Hudson commented on YARN-1339:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed 
by Jason Lowe) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603036)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java


> Recover DeletionService state upon nodemanager restart
> --
>
> Key: YARN-1339
> URL: https://issues.apache.org/jira/browse/YARN-1339
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Fix For: 2.5.0
>
> Attachments: YARN-1339.patch, YARN-1339v2.patch, 
> YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, 
> YARN-1339v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033905#comment-14033905
 ] 

Hudson commented on YARN-2159:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603003)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java


> Better logging in SchedulerNode#allocateContainer
> -
>
> Key: YARN-2159
> URL: https://issues.apache.org/jira/browse/YARN-2159
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>Priority: Trivial
>  Labels: newbie, supportability
> Fix For: 2.5.0
>
> Attachments: YARN2159-01.patch
>
>
> This bit of code:
> {quote}
> LOG.info("Assigned container " + container.getId() + " of capacity "
> + container.getResource() + " on host " + rmNode.getNodeAddress()
> + ", which currently has " + numContainers + " containers, "
> + getUsedResource() + " used and " + getAvailableResource()
> + " available");
> {quote}
> results in a line like:
> {quote}
> 2014-05-30 16:17:43,573 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_14000_0009_01_00 of capacity 
>  on host machine.host.domain.com:8041, which currently 
> has 18 containers,  used and  
> available
> {quote}
> That message is fine in most cases, but looks pretty bad after the last 
> available allocation, since it says something like "vCores:0 available".
> Here is one suggested phrasing
>   - "which has 18 containers,  used and 
>  available after allocation"



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033911#comment-14033911
 ] 

Hudson commented on YARN-1885:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs 
where the completed applications previously ran in case of RM restart. 
Contributed by Wangda Tan (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603028)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoo

[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033906#comment-14033906
 ] 

Hudson commented on YARN-2167:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1804 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1804/])
YARN-2167. LeveldbIterator should get closed in 
NMLeveldbStateStoreService#loadLocalizationState() within finally block. 
Contributed by Junping Du (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603039)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java


> LeveldbIterator should get closed in 
> NMLeveldbStateStoreService#loadLocalizationState() within finally block
> 
>
> Key: YARN-2167
> URL: https://issues.apache.org/jira/browse/YARN-2167
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 3.0.0, 2.5.0
>
> Attachments: YARN-2167.patch
>
>
> In NMLeveldbStateStoreService#loadLocalizationState(), we have 
> LeveldbIterator to read NM's localization state but it is not get closed in 
> finally block. We should close this connection to DB as a common practice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

My team has completed its implementation and our tests showed it is working in 
a rather solid way. 

  was:
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.


> Suspend/Resume Hadoop Jobs
> --
>
> Key: YARN-2172
> URL: https://issues.apache.org/jira/browse/YARN-2172
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager, webapp
>Affects Versions: 2.2.0
> Environment: CentOS 6.5, Hadoop 2.2.0
>Reporter: Richard Chen
>  Labels: hadoop, jobs, resume, suspend
> Fix For: 2.2.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In a multi-application cluster environment, jobs running inside Hadoop YARN 
> may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
> give way to other higher-priority jobs inside Hadoop, a user or some 
> cluster-level resource scheduling service should be able to suspend and/or 
> resume some particular jobs within Hadoop YARN.
> When target jobs inside Hadoop are suspended, those already allocated and 
> running task containers will continue to run until their completion or active 
> preemption by other ways. But no more new containers would be allocated to 
> the target jobs. In contrast, when suspended jobs are put into resume mode, 
> they will continue to run from the previous job progress and have new task 
> containers allocated to complete the rest of the jobs.
> My team has completed its implementation and our tests showed it is working 
> in a rather solid way. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

My team has completed its implementation and our tests showed it works in a 
rather solid way. 

  was:
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

My team has completed its implementation and our tests showed it is working in 
a rather solid way. 


> Suspend/Resume Hadoop Jobs
> --
>
> Key: YARN-2172
> URL: https://issues.apache.org/jira/browse/YARN-2172
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager, webapp
>Affects Versions: 2.2.0
> Environment: CentOS 6.5, Hadoop 2.2.0
>Reporter: Richard Chen
>  Labels: hadoop, jobs, resume, suspend
> Fix For: 2.2.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In a multi-application cluster environment, jobs running inside Hadoop YARN 
> may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
> give way to other higher-priority jobs inside Hadoop, a user or some 
> cluster-level resource scheduling service should be able to suspend and/or 
> resume some particular jobs within Hadoop YARN.
> When target jobs inside Hadoop are suspended, those already allocated and 
> running task containers will continue to run until their completion or active 
> preemption by other ways. But no more new containers would be allocated to 
> the target jobs. In contrast, when suspended jobs are put into resume mode, 
> they will continue to run from the previous job progress and have new task 
> containers allocated to complete the rest of the jobs.
> My team has completed its implementation and our tests showed it works in a 
> rather solid way. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

  was:
In a multi-application cluster environment, jobs running inside Hadoop 
application may be of lower-priority than jobs running inside other 
applications like HBase. To give way to other higher-priority jobs inside 
Hadoop, a user or some cluster-level resource scheduling service should be able 
to suspend and/or resume some particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.


> Suspend/Resume Hadoop Jobs
> --
>
> Key: YARN-2172
> URL: https://issues.apache.org/jira/browse/YARN-2172
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager, webapp
>Affects Versions: 2.2.0
> Environment: CentOS 6.5, Hadoop 2.2.0
>Reporter: Richard Chen
>  Labels: hadoop, jobs, resume, suspend
> Fix For: 2.2.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In a multi-application cluster environment, jobs running inside Hadoop YARN 
> may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
> give way to other higher-priority jobs inside Hadoop, a user or some 
> cluster-level resource scheduling service should be able to suspend and/or 
> resume some particular jobs within Hadoop application.
> When target jobs inside Hadoop are suspended, those already allocated and 
> running task containers will continue to run until their completion or active 
> preemption by other ways. But no more new containers would be allocated to 
> the target jobs. In contrast, when suspended jobs are put into resume mode, 
> they will continue to run from the previous job progress and have new task 
> containers allocated to complete the rest of the jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Chen updated YARN-2172:
---

Description: 
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop YARN.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.

  was:
In a multi-application cluster environment, jobs running inside Hadoop YARN may 
be of lower-priority than jobs running outside Hadoop YARN like HBase. To give 
way to other higher-priority jobs inside Hadoop, a user or some cluster-level 
resource scheduling service should be able to suspend and/or resume some 
particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.


> Suspend/Resume Hadoop Jobs
> --
>
> Key: YARN-2172
> URL: https://issues.apache.org/jira/browse/YARN-2172
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager, webapp
>Affects Versions: 2.2.0
> Environment: CentOS 6.5, Hadoop 2.2.0
>Reporter: Richard Chen
>  Labels: hadoop, jobs, resume, suspend
> Fix For: 2.2.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In a multi-application cluster environment, jobs running inside Hadoop YARN 
> may be of lower-priority than jobs running outside Hadoop YARN like HBase. To 
> give way to other higher-priority jobs inside Hadoop, a user or some 
> cluster-level resource scheduling service should be able to suspend and/or 
> resume some particular jobs within Hadoop YARN.
> When target jobs inside Hadoop are suspended, those already allocated and 
> running task containers will continue to run until their completion or active 
> preemption by other ways. But no more new containers would be allocated to 
> the target jobs. In contrast, when suspended jobs are put into resume mode, 
> they will continue to run from the previous job progress and have new task 
> containers allocated to complete the rest of the jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2172) Suspend/Resume Hadoop Jobs

2014-06-17 Thread Richard Chen (JIRA)

Richard Chen created YARN-2172:
--

 Summary: Suspend/Resume Hadoop Jobs
 Key: YARN-2172
 URL: https://issues.apache.org/jira/browse/YARN-2172
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager, webapp
Affects Versions: 2.2.0
 Environment: CentOS 6.5, Hadoop 2.2.0
Reporter: Richard Chen
 Fix For: 2.2.0


In a multi-application cluster environment, jobs running inside Hadoop 
application may be of lower-priority than jobs running inside other 
applications like HBase. To give way to other higher-priority jobs inside 
Hadoop, a user or some cluster-level resource scheduling service should be able 
to suspend and/or resume some particular jobs within Hadoop application.

When target jobs inside Hadoop are suspended, those already allocated and 
running task containers will continue to run until their completion or active 
preemption by other ways. But no more new containers would be allocated to the 
target jobs. In contrast, when suspended jobs are put into resume mode, they 
will continue to run from the previous job progress and have new task 
containers allocated to complete the rest of the jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-409) Allow apps to be killed via the RM REST API

2014-06-17 Thread Romain Rigaux (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033874#comment-14033874
 ] 

Romain Rigaux commented on YARN-409:


dup of https://issues.apache.org/jira/browse/YARN-1702?

> Allow apps to be killed via the RM REST API
> ---
>
> Key: YARN-409
> URL: https://issues.apache.org/jira/browse/YARN-409
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>
> The RM REST API currently allows getting information about running 
> applications.  Adding the capability to kill applications would allow systems 
> like Hue to perform their functions over HTTP.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033864#comment-14033864
 ] 

Jason Lowe commented on YARN-2171:
--

When the CapacityScheduler scheduler thread is running full-time due to a 
constant stream of events (e.g.: large number of running applications with a 
large number of cluster nodes) then the CapacityScheduler lock is held by that 
scheduler loop most of the time.  As AMs heartbeat into the RM to try to get 
their resources, the capacity scheduler code goes out of its way to try to 
avoid having the AMs grab the scheduler lock.  Unfortunately this one was 
missed to get this one integer value.  Therefore they end up piling up on the 
scheduler lock, filling all of the IPC handlers of the ApplicationMasterService 
and the others back up on the call queue.  Once the scheduler releases the lock 
it will quickly try to grab it again, so only a few AMs end up getting through 
the "gate" and the IPC handlers fill again with the next batch of AMs blocking 
on the scheduler lock.  This causes the average RPC response times to skyrocket 
for AMs.  AMs experience large delays getting their allocations which in turn 
leads to lower cluster utilization and increased application runtimes.

> AMs block on the CapacityScheduler lock during allocate()
> -
>
> Key: YARN-2171
> URL: https://issues.apache.org/jira/browse/YARN-2171
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
>
> When AMs heartbeat into the RM via the allocate() call they are blocking on 
> the CapacityScheduler lock when trying to get the number of nodes in the 
> cluster via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2171) AMs block on the CapacityScheduler lock during allocate()

2014-06-17 Thread Jason Lowe (JIRA)

Jason Lowe created YARN-2171:


 Summary: AMs block on the CapacityScheduler lock during allocate()
 Key: YARN-2171
 URL: https://issues.apache.org/jira/browse/YARN-2171
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.4.0, 0.23.10
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical


When AMs heartbeat into the RM via the allocate() call they are blocking on the 
CapacityScheduler lock when trying to get the number of nodes in the cluster 
via getNumClusterNodes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033831#comment-14033831
 ] 

Hudson commented on YARN-1885:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs 
where the completed applications previously ran in case of RM restart. 
Contributed by Wangda Tan (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603028)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/had

[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033825#comment-14033825
 ] 

Hudson commented on YARN-2159:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603003)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java


> Better logging in SchedulerNode#allocateContainer
> -
>
> Key: YARN-2159
> URL: https://issues.apache.org/jira/browse/YARN-2159
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>Priority: Trivial
>  Labels: newbie, supportability
> Fix For: 2.5.0
>
> Attachments: YARN2159-01.patch
>
>
> This bit of code:
> {quote}
> LOG.info("Assigned container " + container.getId() + " of capacity "
> + container.getResource() + " on host " + rmNode.getNodeAddress()
> + ", which currently has " + numContainers + " containers, "
> + getUsedResource() + " used and " + getAvailableResource()
> + " available");
> {quote}
> results in a line like:
> {quote}
> 2014-05-30 16:17:43,573 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_14000_0009_01_00 of capacity 
>  on host machine.host.domain.com:8041, which currently 
> has 18 containers,  used and  
> available
> {quote}
> That message is fine in most cases, but looks pretty bad after the last 
> available allocation, since it says something like "vCores:0 available".
> Here is one suggested phrasing
>   - "which has 18 containers,  used and 
>  available after allocation"



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033826#comment-14033826
 ] 

Hudson commented on YARN-2167:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-2167. LeveldbIterator should get closed in 
NMLeveldbStateStoreService#loadLocalizationState() within finally block. 
Contributed by Junping Du (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603039)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java


> LeveldbIterator should get closed in 
> NMLeveldbStateStoreService#loadLocalizationState() within finally block
> 
>
> Key: YARN-2167
> URL: https://issues.apache.org/jira/browse/YARN-2167
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 3.0.0, 2.5.0
>
> Attachments: YARN-2167.patch
>
>
> In NMLeveldbStateStoreService#loadLocalizationState(), we have 
> LeveldbIterator to read NM's localization state but it is not get closed in 
> finally block. We should close this connection to DB as a common practice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033829#comment-14033829
 ] 

Hudson commented on YARN-1339:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1777 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1777/])
YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed 
by Jason Lowe) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603036)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java


> Recover DeletionService state upon nodemanager restart
> --
>
> Key: YARN-1339
> URL: https://issues.apache.org/jira/browse/YARN-1339
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Fix For: 2.5.0
>
> Attachments: YARN-1339.patch, YARN-1339v2.patch, 
> YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, 
> YARN-1339v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2170) Fix components' version information in the web page 'About the Cluster'

2014-06-17 Thread Jun Gong (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Gong updated YARN-2170:
---

Attachment: YARN-2170.patch

> Fix components' version information in the web page 'About the Cluster'
> ---
>
> Key: YARN-2170
> URL: https://issues.apache.org/jira/browse/YARN-2170
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Priority: Minor
> Attachments: YARN-2170.patch
>
>
> In the web page 'About the Cluster', YARN's component's build version(e.g. 
> ResourceManager) is the same as Hadoop version now. It is caused by   calling 
> getVersion() instead of _getVersion() in VersionInfo.java by mistake.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2170) Fix components' version information in the web page 'About the Cluster'

2014-06-17 Thread Jun Gong (JIRA)

Jun Gong created YARN-2170:
--

 Summary: Fix components' version information in the web page 
'About the Cluster'
 Key: YARN-2170
 URL: https://issues.apache.org/jira/browse/YARN-2170
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jun Gong
Priority: Minor


In the web page 'About the Cluster', YARN's component's build version(e.g. 
ResourceManager) is the same as Hadoop version now. It is caused by   calling 
getVersion() instead of _getVersion() in VersionInfo.java by mistake.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2169) NMSimulator of sls should catch more Exception

2014-06-17 Thread Beckham007 (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Beckham007 updated YARN-2169:
-

Attachment: YARN-2169.patch

> NMSimulator of sls should catch more Exception
> --
>
> Key: YARN-2169
> URL: https://issues.apache.org/jira/browse/YARN-2169
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Beckham007
> Attachments: YARN-2169.patch
>
>
> In the method middleStep() of NMSimulator , sending heart beat may cause 
> InterruptedException or other Exception if the load is heavily. If not 
> handler these exceptions, the task of NMSimulator cloud not add to the 
> executor queue again. So the NM will lost.
> In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some 
> NMs will lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (YARN-2169) NMSimulator of sls should catch more Exception

2014-06-17 Thread Beckham007 (JIRA)

Beckham007 created YARN-2169:


 Summary: NMSimulator of sls should catch more Exception
 Key: YARN-2169
 URL: https://issues.apache.org/jira/browse/YARN-2169
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.4.0
Reporter: Beckham007


In the method middleStep() of NMSimulator , sending heart beat may cause 
InterruptedException or other Exception if the load is heavily. If not handler 
these exceptions, the task of NMSimulator cloud not add to the executor queue 
again. So the NM will lost.
In my situation, the pool size is 4000, nm size is 2000, and am is 1500. Some 
NMs will lost.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1339) Recover DeletionService state upon nodemanager restart

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033677#comment-14033677
 ] 

Hudson commented on YARN-1339:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-1339. Recover DeletionService state upon nodemanager restart. (Contributed 
by Jason Lowe) (junping_du: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603036)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/proto/yarn_server_nodemanager_recovery.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDeletionService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java


> Recover DeletionService state upon nodemanager restart
> --
>
> Key: YARN-1339
> URL: https://issues.apache.org/jira/browse/YARN-1339
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Affects Versions: 2.3.0
>Reporter: Jason Lowe
>Assignee: Jason Lowe
> Fix For: 2.5.0
>
> Attachments: YARN-1339.patch, YARN-1339v2.patch, 
> YARN-1339v3-and-YARN-1987.patch, YARN-1339v4.patch, YARN-1339v5.patch, 
> YARN-1339v6.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1885) RM may not send the app-finished signal after RM restart to some nodes where the application ran before RM restarts

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033679#comment-14033679
 ] 

Hudson commented on YARN-1885:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-1885. Fixed a bug that RM may not send application-clean-up signal to NMs 
where the completed applications previously ran in case of RM restart. 
Contributed by Wangda Tan (jianhe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603028)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestResourceTrackerOnHA.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/RegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/api/protocolrecords/impl/pb/RegisterNodeManagerRequestPBImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/proto/yarn_server_common_service_protos.proto
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestProtocolRecords.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/test/java/org/apache/hadoop/yarn/server/api/protocolrecords/TestRegisterNodeManagerRequest.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRunningOnNodeEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptEventType.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/event/RMAppAttemptContainerAcquiredEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeStartedEvent.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoo

[jira] [Commented] (YARN-2159) Better logging in SchedulerNode#allocateContainer

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033673#comment-14033673
 ] 

Hudson commented on YARN-2159:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-2159. Better logging in SchedulerNode#allocateContainer. (Ray Chiang via 
kasha) (kasha: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603003)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java


> Better logging in SchedulerNode#allocateContainer
> -
>
> Key: YARN-2159
> URL: https://issues.apache.org/jira/browse/YARN-2159
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>Priority: Trivial
>  Labels: newbie, supportability
> Fix For: 2.5.0
>
> Attachments: YARN2159-01.patch
>
>
> This bit of code:
> {quote}
> LOG.info("Assigned container " + container.getId() + " of capacity "
> + container.getResource() + " on host " + rmNode.getNodeAddress()
> + ", which currently has " + numContainers + " containers, "
> + getUsedResource() + " used and " + getAvailableResource()
> + " available");
> {quote}
> results in a line like:
> {quote}
> 2014-05-30 16:17:43,573 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode: 
> Assigned container container_14000_0009_01_00 of capacity 
>  on host machine.host.domain.com:8041, which currently 
> has 18 containers,  used and  
> available
> {quote}
> That message is fine in most cases, but looks pretty bad after the last 
> available allocation, since it says something like "vCores:0 available".
> Here is one suggested phrasing
>   - "which has 18 containers,  used and 
>  available after allocation"



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2167) LeveldbIterator should get closed in NMLeveldbStateStoreService#loadLocalizationState() within finally block

2014-06-17 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033674#comment-14033674
 ] 

Hudson commented on YARN-2167:
--

SUCCESS: Integrated in Hadoop-Yarn-trunk #586 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/586/])
YARN-2167. LeveldbIterator should get closed in 
NMLeveldbStateStoreService#loadLocalizationState() within finally block. 
Contributed by Junping Du (jlowe: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1603039)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java


> LeveldbIterator should get closed in 
> NMLeveldbStateStoreService#loadLocalizationState() within finally block
> 
>
> Key: YARN-2167
> URL: https://issues.apache.org/jira/browse/YARN-2167
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Junping Du
>Assignee: Junping Du
> Fix For: 3.0.0, 2.5.0
>
> Attachments: YARN-2167.patch
>
>
> In NMLeveldbStateStoreService#loadLocalizationState(), we have 
> LeveldbIterator to read NM's localization state but it is not get closed in 
> finally block. We should close this connection to DB as a common practice. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033662#comment-14033662
 ] 

Hadoop QA commented on YARN-2052:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650774/YARN-2052.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4013//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4013//console

This message is automatically generated.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033654#comment-14033654
 ] 

Tsuyoshi OZAWA commented on YARN-2088:
--

LGTM(non-binding).

> Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
> 
>
> Key: YARN-2088
> URL: https://issues.apache.org/jira/browse/YARN-2088
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: YARN-2088.v1.patch
>
>
> Some fields(set,list) are added to proto builders many times, we need to 
> clear those fields before add, otherwise the result proto contains more 
> contents.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033640#comment-14033640
 ] 

Tsuyoshi OZAWA commented on YARN-2052:
--

{quote}
BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA.
{quote}

It means that we should use {{compareTo}} instead of calculating the value 
directly.

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2052) ContainerId creation after work preserving restart is broken

2014-06-17 Thread Tsuyoshi OZAWA (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2052:
-

Attachment: YARN-2052.3.patch

[~jianhe], thank you for the comment.

{code}
Application itself may possibly use Container.getId to differentiate the 
containers, two containers allocated by two RMs may have the same id integer, 
then the application logic will break. will this be fine?
{code}

Good point. Added doc to {{ContainerId#getId}}. In addition to it, implemented 
{{compareTo}} and {{equals}} to distinguish containers. I think this 
alternative is acceptable. What do you think?

BTW, I think we should update CheckpointAMPreemptionPolicy after this JIRA. 
{code}
  Collections.sort(listOfCont, new Comparator() {
@Override
public int compare(final Container o1, final Container o2) {
  return o2.getId().getId() - o1.getId().getId();
}
  });
{code}

> ContainerId creation after work preserving restart is broken
> 
>
> Key: YARN-2052
> URL: https://issues.apache.org/jira/browse/YARN-2052
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2052.1.patch, YARN-2052.2.patch, YARN-2052.3.patch
>
>
> Container ids are made unique by using the app identifier and appending a 
> monotonically increasing sequence number to it. Since container creation is a 
> high churn activity the RM does not store the sequence number per app. So 
> after restart it does not know what the new sequence number should be for new 
> allocations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder

2014-06-17 Thread Binglin Chang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033586#comment-14033586
 ] 

Binglin Chang commented on YARN-2088:
-

Hi [~djp], could you help review this patch?  I am doing YARN-2051, and it 
depend on this code, else the test is failed.

> Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
> 
>
> Key: YARN-2088
> URL: https://issues.apache.org/jira/browse/YARN-2088
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Binglin Chang
>Assignee: Binglin Chang
> Attachments: YARN-2088.v1.patch
>
>
> Some fields(set,list) are added to proto builders many times, we need to 
> clear those fields before add, otherwise the result proto contains more 
> contents.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2142) Add one service to check the nodes' TRUST status

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033570#comment-14033570
 ] 

Hadoop QA commented on YARN-2142:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650759/trust.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4012//console

This message is automatically generated.

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler
>Affects Versions: 2.2.0
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Fix For: 2.2.0
>
> Attachments: trust.patch, trust.patch, trust.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's healthcheckservice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-06-17 Thread anders (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: trust.patch

Test weather this patch can work

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler
>Affects Versions: 2.2.0
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Fix For: 2.2.0
>
> Attachments: trust.patch, trust.patch, trust.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's healthcheckservice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (YARN-2142) Add one service to check the nodes' TRUST status

2014-06-17 Thread anders (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anders updated YARN-2142:
-

Attachment: trust.patch

Test  weather this patch can wrok

> Add one service to check the nodes' TRUST status 
> -
>
> Key: YARN-2142
> URL: https://issues.apache.org/jira/browse/YARN-2142
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager, resourcemanager, scheduler
>Affects Versions: 2.2.0
> Environment: OS:Ubuntu 13.04; 
> JAVA:OpenJDK 7u51-2.4.4-0
>Reporter: anders
>Priority: Minor
>  Labels: patch
> Fix For: 2.2.0
>
> Attachments: trust.patch, trust.patch
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> Because of critical computing environment ,we must test every node's TRUST 
> status in the cluster (We can get the TRUST status by the API of OAT 
> sever),So I add this feature into hadoop's schedule .
> By the TRUST check service ,node can get the TRUST status of itself,
> then through the heartbeat ,send the TRUST status to resource manager for 
> scheduling.
> In the scheduling step,if the node's TRUST status is 'false', it will be 
> abandoned until it's TRUST status turn to 'true'.
> ***The logic of this feature is similar to node's healthcheckservice.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-2074) Preemption of AM containers shouldn't count towards AM failures

2014-06-17 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033531#comment-14033531
 ] 

Hadoop QA commented on YARN-2074:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12650742/YARN-2074.7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4011//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4011//console

This message is automatically generated.

> Preemption of AM containers shouldn't count towards AM failures
> ---
>
> Key: YARN-2074
> URL: https://issues.apache.org/jira/browse/YARN-2074
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Jian He
> Attachments: YARN-2074.1.patch, YARN-2074.2.patch, YARN-2074.3.patch, 
> YARN-2074.4.patch, YARN-2074.5.patch, YARN-2074.6.patch, YARN-2074.6.patch, 
> YARN-2074.7.patch, YARN-2074.7.patch
>
>
> One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
> containers getting preempted shouldn't count towards AM failures and thus 
> shouldn't eventually fail applications.
> We should explicitly handle AM container preemption/kill as a separate issue 
> and not count it towards the limit on AM failures.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (YARN-1480) RM web services getApps() accepts many more filters than ApplicationCLI "list" command

2014-06-17 Thread Zhijie Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033502#comment-14033502
 ] 

Zhijie Shen commented on YARN-1480:
---

Hi [~kj-ki], thanks for the patch. Here're some meta comments on it:

1. I looked into the current RMWebServices#getApps(), and below is the list of 
missing options in ApplicationCLI. "queue" (current "queue" option is for the 
"movetoqueue" command) and "tags" are not covered in the patch. If it's not a 
big addition, is it better to include these two options into the option list?
{code}
  @QueryParam("finalStatus") String finalStatusQuery,
  @QueryParam("user") String userQuery,
  @QueryParam("queue") String queueQuery,
  @QueryParam("limit") String count,
  @QueryParam("startedTimeBegin") String startedBegin,
  @QueryParam("startedTimeEnd") String startedEnd,
  @QueryParam("finishedTimeBegin") String finishBegin,
  @QueryParam("finishedTimeEnd") String finishEnd,
  @QueryParam("applicationTags") Set applicationTags
{code}

2. ApplicationClientProtocol#getApplications already support full filters, 
while YarnClient seems not to support the full options now. IMHO, the right way 
here  is to make YarnClient to support full filters, and ApplicationCLI simply 
calls the API. It is an inefficient way to pull a long app list from RM and do 
local filtering.

> RM web services getApps() accepts many more filters than ApplicationCLI 
> "list" command
> --
>
> Key: YARN-1480
> URL: https://issues.apache.org/jira/browse/YARN-1480
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zhijie Shen
>Assignee: Kenji Kikushima
> Attachments: YARN-1480-2.patch, YARN-1480-3.patch, YARN-1480-4.patch, 
> YARN-1480-5.patch, YARN-1480.patch
>
>
> Nowadays RM web services getApps() accepts many more filters than 
> ApplicationCLI "list" command, which only accepts "state" and "type". IMHO, 
> ideally, different interfaces should provide consistent functionality. Is it 
> better to allow more filters in ApplicationCLI?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

93 matches

Mail list logo