[jira] [Commented] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129611#comment-14129611
 ] 

Hadoop QA commented on YARN-2534:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667934/YARN-2534.000.patch
  against trunk revision 4be9517.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4882//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4882//console

This message is automatically generated.

> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal
> -
>
> Key: YARN-2534
> URL: https://issues.apache.org/jira/browse/YARN-2534
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2534.000.patch
>
>
> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal for some cases.
> If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE 
> ,but each individual MAX share is not equal to Integer.MAX_VALUE. then 
> totalMaxShare will be a negative value, which will cause all fairShare are 
> wrongly calculated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM

2014-09-10 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-611:
---
Attachment: YARN-611.9.rebase.patch

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
> YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, 
> YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch, 
> YARN-611.9.rebase.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels

2014-09-10 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129610#comment-14129610
 ] 

Jian He commented on YARN-2496:
---

briefly looked at the patch:
- CSQueueUtils.java format change only, we can revert
- why checking {{labelManager != null}} every where  ? we only need to check 
where it’s needed.
- We may not need to change the method signature to add one more parameter, 
just pass the queues map into NodeLabelManager#reinitializeQueueLabels, to 
avoid a number of test changes.
{code}
parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, 
queues, queues, noop, queueToLabels);
{code}
- label initialization code is duplicated between ParentQueue and LeafQueue, 
how about creating an AbstractCSQueue and put common initilazation methods 
there ?

> [YARN-796] Changes for capacity scheduler to support allocate resource 
> respect labels
> -
>
> Key: YARN-2496
> URL: https://issues.apache.org/jira/browse/YARN-2496
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-2496.patch
>
>
> This JIRA Includes:
> - Add/parse labels option to {{capacity-scheduler.xml}} similar to other 
> options of queue like capacity/maximum-capacity, etc.
> - Include a "default-label-expression" option in queue config, if an app 
> doesn't specify label-expression, "default-label-expression" of queue will be 
> used.
> - Check if labels can be accessed by the queue when submit an app with 
> labels-expression to queue or update ResourceRequest with label-expression
> - Check labels on NM when trying to allocate ResourceRequest on the NM with 
> label-expression
> - Respect  labels when calculate headroom/user-limit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129609#comment-14129609
 ] 

Hadoop QA commented on YARN-611:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667940/YARN-611.9.patch
  against trunk revision 4be9517.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4883//console

This message is automatically generated.

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
> YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, 
> YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-09-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129606#comment-14129606
 ] 

Xuan Gong commented on YARN-611:


bq. The name sliding window: Sliding window of what? We should make this clear 
in the API. How about attempt_failures_sliding_window_size? Or should we call 
it attempt_failures_validity_interval? Any other ideas? Zhijie Shen?

Changed to attempt_failures_validity_interval

bq. Either ways You will have to change all of the following
yarn_protos.proto: sliding_window_size
ApplicationSubmissionContext: Rename slidingWindowSize, setters and getters
RMAppImpl.slidingWindowSize

DONE

bq. It is not clear what units the window-size is measured in from the API. 
Secs? Millis? We should javadoc this everywhere.

ADDED

bq. RMAppImpl.isAttemptFailureExceedMaxAttempt -> isNumAttemptsBeyondThreshold.

Changed

bq. TestAMRestart: The tests are very brittle because of the sleeps. Can we 
instead use a Clock and use it everywhere? That way you can inject manual 
clock-advance and test deterministically. See SystemClock for main line code 
usage and the ControlledClock for tests.

FIXED

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
> YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, 
> YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM

2014-09-10 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-611:
---
Attachment: YARN-611.9.patch

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
> YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, 
> YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal

2014-09-10 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2534:

Fix Version/s: (was: 2.6.0)

> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal
> -
>
> Key: YARN-2534
> URL: https://issues.apache.org/jira/browse/YARN-2534
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-2534.000.patch
>
>
> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal for some cases.
> If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE 
> ,but each individual MAX share is not equal to Integer.MAX_VALUE. then 
> totalMaxShare will be a negative value, which will cause all fairShare are 
> wrongly calculated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal

2014-09-10 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129582#comment-14129582
 ] 

zhihai xu commented on YARN-2534:
-

I uploaded a patch YARN-2534.000.patch for review.
I added a test case in this patch to prove this issue exit:
Two queues: QueueA's maxShare is 1073741824 and QueueB's  maxShare is 
1073741824,
the sum of two maxShare is more than Integer.MAX_VALUE.
Without the fix, the test will fail.

> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal
> -
>
> Key: YARN-2534
> URL: https://issues.apache.org/jira/browse/YARN-2534
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-2534.000.patch
>
>
> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal for some cases.
> If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE 
> ,but each individual MAX share is not equal to Integer.MAX_VALUE. then 
> totalMaxShare will be a negative value, which will cause all fairShare are 
> wrongly calculated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal

2014-09-10 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated YARN-2534:

Attachment: YARN-2534.000.patch

> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal
> -
>
> Key: YARN-2534
> URL: https://issues.apache.org/jira/browse/YARN-2534
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.5.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.6.0
>
> Attachments: YARN-2534.000.patch
>
>
> FairScheduler: totalMaxShare is not calculated correctly in 
> computeSharesInternal for some cases.
> If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE 
> ,but each individual MAX share is not equal to Integer.MAX_VALUE. then 
> totalMaxShare will be a negative value, which will cause all fairShare are 
> wrongly calculated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal

2014-09-10 Thread zhihai xu (JIRA)
zhihai xu created YARN-2534:
---

 Summary: FairScheduler: totalMaxShare is not calculated correctly 
in computeSharesInternal
 Key: YARN-2534
 URL: https://issues.apache.org/jira/browse/YARN-2534
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler
Affects Versions: 2.5.0
Reporter: zhihai xu
Assignee: zhihai xu
 Fix For: 2.6.0


FairScheduler: totalMaxShare is not calculated correctly in 
computeSharesInternal for some cases.
If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE ,but 
each individual MAX share is not equal to Integer.MAX_VALUE. then totalMaxShare 
will be a negative value, which will cause all fairShare are wrongly calculated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129557#comment-14129557
 ] 

Tsuyoshi OZAWA commented on YARN-2229:
--

The latest v16 patch is ready for review. [~jianhe], could you check it?

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129541#comment-14129541
 ] 

Hadoop QA commented on YARN-2229:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667919/YARN-2229.16.patch
  against trunk revision 83be3ad.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4881//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4881//console

This message is automatically generated.

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-09-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129533#comment-14129533
 ] 

Vinod Kumar Vavilapalli commented on YARN-2440:
---

This looks good, +1. Checking this in..

> Cgroups should allow YARN containers to be limited to allocated cores
> -
>
> Key: YARN-2440
> URL: https://issues.apache.org/jira/browse/YARN-2440
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
> apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, 
> apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, 
> screenshot-current-implementation.jpg
>
>
> The current cgroups implementation does not limit YARN containers to the 
> cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129528#comment-14129528
 ] 

Hadoop QA commented on YARN-2229:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667915/YARN-2229.16.patch
  against trunk revision 5ec7fcd.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart
  
org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerApplicationAttempt

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4880//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4880//console

This message is automatically generated.

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-09-10 Thread Wangda Tan (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-796:

Attachment: YARN-796.node-label.consolidate.2.patch

Attached updated consolidated patch, named 
"YARN-796.node-label.consolidate.2.patch", it contains several bug fixes, and 
support admin changes node label when RM is not running. 

Please feel free to try and review.

Thanks,
Wangda

> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, 
> Node-labels-Requirements-Design-doc-V1.pdf, 
> Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, 
> YARN-796.node-label.consolidate.1.patch, 
> YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.demo.patch.1, 
> YARN-796.patch, YARN-796.patch4
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2499) [YARN-796] Respect labels in preemption policy of fair scheduler

2014-09-10 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-2499:
---

Assignee: Naganarasimha G R

> [YARN-796] Respect labels in preemption policy of fair scheduler
> 
>
> Key: YARN-2499
> URL: https://issues.apache.org/jira/browse/YARN-2499
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (YARN-2495) [YARN-796] Allow admin specify labels in each NM (Distributed configuration)

2014-09-10 Thread Naganarasimha G R (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naganarasimha G R reassigned YARN-2495:
---

Assignee: Naganarasimha G R

> [YARN-796] Allow admin specify labels in each NM (Distributed configuration)
> 
>
> Key: YARN-2495
> URL: https://issues.apache.org/jira/browse/YARN-2495
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Wangda Tan
>Assignee: Naganarasimha G R
>
> Target of this JIRA is to allow admin specify labels in each NM, this covers
> - User can set labels in each NM (by setting yarn-site.xml or using script 
> suggested by [~aw])
> - NM will send labels to RM via ResourceTracker API
> - RM will set labels in NodeLabelManager when NM register/update labels



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-09-10 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129480#comment-14129480
 ] 

Craig Welch commented on YARN-796:
--

Good, what you describe wrt the cli is what I was trying to describe, I just 
might not have been very clear about it.  I'm going to go ahead then and make 
the changes for the service side to match what we've described.

> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, 
> Node-labels-Requirements-Design-doc-V1.pdf, 
> Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, 
> YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, 
> YARN-796.patch, YARN-796.patch4
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-
Attachment: YARN-2229.16.patch

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-
Attachment: (was: YARN-2229.16.patch)

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, 
> YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, 
> YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-
Attachment: YARN-2229.16.patch

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-09-10 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129464#comment-14129464
 ] 

Wangda Tan commented on YARN-796:
-

Hi Craig,
I think when RM is running, the solution should be exactly as you described, we 
should only check if the caller is user on the admin list, and RM will write 
file itself, by default it's "yarn" user.
But when RM is not running, and we need execute a tool to directly modify data 
in store, we cannot use this way. Because the ACL is retrieved from local 
configuration file, a malicious user can create a configuration to indicate 
itself is a admin user and use the configuration to launch tool. 
IMHO, I think we don't need check ACL when we running a standalone tool, it 
will modify the file, and the file directory has permission already (like it 
belongs yarn user). So HDFS will do the check for us. But we should only run 
such standalone command as same as the user launches RM.

Thanks,
Wangda

> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, 
> Node-labels-Requirements-Design-doc-V1.pdf, 
> Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, 
> YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, 
> YARN-796.patch, YARN-796.patch4
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-
Attachment: (was: YARN-2229.16.patch)

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, 
> YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, 
> YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-09-10 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129459#comment-14129459
 ] 

Jian He commented on YARN-1372:
---

Thanks for updating the patch. Some comments and naming suggestions:
- NodeStatusUpdater import changes only, we can revert
- indentation format of the second line.
{code}
  public void removeCompletedContainersFromContext(List
   containerIds) throws
  public RMNodeCleanedupContainerNotifiedEvent(NodeId nodeId,
   ContainerId contId) {
{code}
- why adding {{context.getContainers().remove(cid);}}  in 
removeVeryOldStoppedContainersFromContext method? won’t this remove the 
containers from context immediately when we send the container statuses across, 
which contradicts the rest of the changes?
- In NodeStatusUpdaterImpl, previousCompletedContainers cache is not needed any 
more, as we make NM remove containers from context only after it gets the 
notification. We can remove this; Instead, in 
NodeStatusUpdater#getContainerStatuses, while we are looping all the 
containers, we can check whether the corresponding application exists, if Not, 
remove it from context.
- make sure {{context.getNMStateStore().removeContainer(cid);}} is called after 
receiving the notification from RM as well.
- {{RMNodeEventType#CLEANEDUP_CONTAINER_NOTIFIED}}: put in a new section where 
source is RMAppAttempt. how about rename to FINISHED_CONTAINERS_PULLED_BY_AM; 
similarly RMNodeCleanedupContainerNotifiedEvent -> 
RMNodeFinishedContainersPulledByAMEvent
- In RMAppAttemptImpl#BaseFinalTransition, we can clear 
finishedContainersSentToAM, in case that AM unexpectedly crashes.
- I think Map> is more space efficient than: 
{code}
  private Map finishedContainersSentToAM =
  new HashMap();
{code}
- :  format convention is to have method body in a different line from the 
method head.
{code}
public NodeId getNodeId() { return this.nodeId; }
{code}
- RMNodeImpl#cleanupContainersNotified, may be rename to 
finishedContainersPulledByAM. similarly CleanedupContainerNotifiedTransition to 
FinishedContainersPulledByAMTransition.
- NodeHeartbeatResponse#addCleanedupContainersNotified, how about  
addFinishedContainersPulledByAM; similarly for the getter 
NodeHeartbeatResponse#getCleanedupContainersNotified and the proto file. Also 
add some code comments to explain why adding this new API.

> Ensure all completed containers are reported to the AMs across RM restart
> -
>
> Key: YARN-1372
> URL: https://issues.apache.org/jira/browse/YARN-1372
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
> YARN-1372.002_NMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, 
> YARN-1372.prelim.patch, YARN-1372.prelim2.patch
>
>
> Currently the NM informs the RM about completed containers and then removes 
> those containers from the RM notification list. The RM passes on that 
> completed container information to the AM and the AM pulls this data. If the 
> RM dies before the AM pulls this data then the AM may not be able to get this 
> information again. To fix this, NM should maintain a separate list of such 
> completed container notifications sent to the RM. After the AM has pulled the 
> containers from the RM then the RM will inform the NM about it and the NM can 
> remove the completed container from the new list. Upon re-register with the 
> RM (after RM restart) the NM should send the entire list of completed 
> containers to the RM along with any other containers that completed while the 
> RM was dead. This ensures that the RM can inform the AM's about all completed 
> containers. Some container completions may be reported more than once since 
> the AM may have pulled the container but the RM may die before notifying the 
> NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-
Attachment: YARN-2229.16.patch

Fixed to pass tests.

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, 
> YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, 
> YARN-2229.8.patch, YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2529) Generic history service RPC interface doesn't work when service authorization is enabled

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129430#comment-14129430
 ] 

Hadoop QA commented on YARN-2529:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667892/YARN-2529.1.patch
  against trunk revision 5ec7fcd.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4879//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4879//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-applicationhistoryservice.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4879//console

This message is automatically generated.

> Generic history service RPC interface doesn't work when service authorization 
> is enabled
> 
>
> Key: YARN-2529
> URL: https://issues.apache.org/jira/browse/YARN-2529
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2529.1.patch
>
>
> Here's the problem shown in the log:
> {code}
> 14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for 
> protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is 
> unauthorized for user zshen (auth:SIMPLE)
> 14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: 
> readAndProcess from client 10.22.2.109 threw exception 
> [org.apache.hadoop.security.authorize.AuthorizationException: Protocol 
> interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not 
> known.]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129429#comment-14129429
 ] 

Hadoop QA commented on YARN-2229:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667886/YARN-2229.15.patch
  against trunk revision 7f80e14.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.util.TestConverterUtils

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4878//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/4878//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-api.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4878//console

This message is automatically generated.

> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, 
> YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, 
> YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129384#comment-14129384
 ] 

Hadoop QA commented on YARN-415:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12667877/YARN-415.201409102216.txt
  against trunk revision 7f80e14.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 12 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4877//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4877//console

This message is automatically generated.

> Capture aggregate memory allocation at the app-level for chargeback
> ---
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
> YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
> YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
> YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
> YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
> YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
> YARN-415.201409040036.txt, YARN-415.201409092204.txt, 
> YARN-415.201409102216.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-09-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129378#comment-14129378
 ] 

Karthik Kambatla commented on YARN-1492:


I am good with fixing the in-memory store so store-specific details don't creep 
into the code elsewhere. 

Personally, I am okay with working on leveldb and zk stores post merge. My main 
concern is with providing a way to initialize the store, as we don't have a 
good answer for long-running apps and it will not be required when using 
leveldb and zk implementations for non-HA and HA cases. I would rather avoid 
that piece completely. I am okay with having an in-memory store that the tests 
exercise and has a trivial recovery. Having a "real" store though would 
definitely boost people's confidence at merge time :)

> truly shared cache for jars (jobjar/libjar)
> ---
>
> Key: YARN-1492
> URL: https://issues.apache.org/jira/browse/YARN-1492
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.0.4-alpha
>Reporter: Sangjin Lee
>Assignee: Chris Trezzo
> Attachments: YARN-1492-all-trunk-v1.patch, 
> YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
> YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
> shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf, shared_cache_design_v6.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2464) Provide Hadoop as a local resource (on HDFS) which can be used by other projcets

2014-09-10 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129379#comment-14129379
 ] 

Junping Du commented on YARN-2464:
--

[~sseth], I will assign to myself to work on it if you haven't start to work on 
it.

> Provide Hadoop as a local resource (on HDFS) which can be used by other 
> projcets
> 
>
> Key: YARN-2464
> URL: https://issues.apache.org/jira/browse/YARN-2464
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>
> DEFAULT_YARN_APPLICATION_CLASSPATH are used by YARN projects to setup their 
> AM / task classpaths if they have a dependency on Hadoop libraries.
> It'll be useful to provide similar access to a Hadoop tarball (Hadoop libs, 
> native libraries) etc, which could be used instead - for applications which 
> do not want to rely upon Hadoop versions from a cluster node. This would also 
> require functionality to update the classpath/env for the apps based on the 
> structure of the tar.
> As an example, MR has support for a full tar (for rolling upgrades). 
> Similarly, Tez ships hadoop libraries along with it's build. I'm not sure 
> about the Spark / Storm / HBase model for this - but using a common copy 
> instead of everyone localizing Hadoop libraries would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129359#comment-14129359
 ] 

Hadoop QA commented on YARN-1372:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667876/YARN-1372.003.patch
  against trunk revision 7f80e14.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 7 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4876//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4876//console

This message is automatically generated.

> Ensure all completed containers are reported to the AMs across RM restart
> -
>
> Key: YARN-1372
> URL: https://issues.apache.org/jira/browse/YARN-1372
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
> YARN-1372.002_NMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, 
> YARN-1372.prelim.patch, YARN-1372.prelim2.patch
>
>
> Currently the NM informs the RM about completed containers and then removes 
> those containers from the RM notification list. The RM passes on that 
> completed container information to the AM and the AM pulls this data. If the 
> RM dies before the AM pulls this data then the AM may not be able to get this 
> information again. To fix this, NM should maintain a separate list of such 
> completed container notifications sent to the RM. After the AM has pulled the 
> containers from the RM then the RM will inform the NM about it and the NM can 
> remove the completed container from the new list. Upon re-register with the 
> RM (after RM restart) the NM should send the entire list of completed 
> containers to the RM along with any other containers that completed while the 
> RM was dead. This ensures that the RM can inform the AM's about all completed 
> containers. Some container completions may be reported more than once since 
> the AM may have pulled the container but the RM may die before notifying the 
> NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-10 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129354#comment-14129354
 ] 

Junping Du commented on YARN-2033:
--

Sure. Will commit it soon. Thanks [~zjshen] for the patch and [~vinodkv] for 
review!

> Investigate merging generic-history into the Timeline Store
> ---
>
> Key: YARN-2033
> URL: https://issues.apache.org/jira/browse/YARN-2033
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
> YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, 
> YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
> YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
> YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, 
> YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch
>
>
> Having two different stores isn't amicable to generic insights on what's 
> happening with applications. This is to investigate porting generic-history 
> into the Timeline Store.
> One goal is to try and retain most of the client side interfaces as close to 
> what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-09-10 Thread Chris Trezzo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129342#comment-14129342
 ] 

Chris Trezzo commented on YARN-1492:


Thanks [~kasha]! A couple of questions:

bq. 2. The choice of SCM store should be transparent to the rest of SCM code. 
It would be better to define an interface for the SCMStore similar to the 
RMStateStore today.

To clarify the above point. An interface does exist in the current 
implementation (see SCMStore.java in YARN-2180), and all SCMStore 
implementations should be based off of that. Unfortunately some implementation 
details from the in-memory store have leaked through via the SCMContext object. 
I am working on an update to improve the interface so that an SCMContext object 
is no longer needed and all implementation details are hidden behind 
SCMStore.java. Does your above point mean that you are looking for a state 
machine-based interface like RMStateStore, or do you see additional issues with 
the SCMStore interface outside of the SCMContext fix?

bq. 3. Defaulting to the in-memory store requires providing a way to initialize 
the store with currently running applications and cached jars, which is quite 
involved and not so elegant either. I propose implementing leveldb and zk 
stores. We could default to leveldb on non-HA clusters, and ZK store for HA 
clusters if we choose to embed the SCM in the RM.

Do you see the leveldb and zk stores as blockers to merging into trunk/2.6 or 
would an in-memory store with the interface fixes mentioned above be enough 
initially? Leveldb and ZK stores could be easily added post-merge in an 
incremental way as additional SCMStore implementations.

> truly shared cache for jars (jobjar/libjar)
> ---
>
> Key: YARN-1492
> URL: https://issues.apache.org/jira/browse/YARN-1492
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.0.4-alpha
>Reporter: Sangjin Lee
>Assignee: Chris Trezzo
> Attachments: YARN-1492-all-trunk-v1.patch, 
> YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
> YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
> shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf, shared_cache_design_v6.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2529) Generic history service RPC interface doesn't work when service authorization is enabled

2014-09-10 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2529:
--
Attachment: YARN-2529.1.patch

I created a patch to make application history protocol use timeline policy when 
service authorization is enabled. It's not straightforward to add the test 
cases on top of TestApplicationHistoryClientService, but I've manually verify 
it on my local single-node cluster.

> Generic history service RPC interface doesn't work when service authorization 
> is enabled
> 
>
> Key: YARN-2529
> URL: https://issues.apache.org/jira/browse/YARN-2529
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
> Attachments: YARN-2529.1.patch
>
>
> Here's the problem shown in the log:
> {code}
> 14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for 
> protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is 
> unauthorized for user zshen (auth:SIMPLE)
> 14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: 
> readAndProcess from client 10.22.2.109 threw exception 
> [org.apache.hadoop.security.authorize.AuthorizationException: Protocol 
> interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not 
> known.]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-121) Yarn services to throw a YarnException on invalid state changs

2014-09-10 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-121:
--
Fix Version/s: (was: 3.0.0)

> Yarn services to throw a YarnException on invalid state changs
> --
>
> Key: YARN-121
> URL: https://issues.apache.org/jira/browse/YARN-121
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> the {{EnsureCurrentState()}} checks of services throw an 
> {{IllegalStateException}}  if the state is wrong. If this was changed to 
> {{YarnException}}. wrapper services such as CompositeService could relay this 
> direct, instead of wrapping it in their own.
> Time to implement mainly in changing the lifecycle test cases of 
> MAPREDUCE-3939 subtasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-120) Make yarn-common services robust

2014-09-10 Thread Allen Wittenauer (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allen Wittenauer updated YARN-120:
--
Fix Version/s: (was: 3.0.0)

> Make yarn-common services robust
> 
>
> Key: YARN-120
> URL: https://issues.apache.org/jira/browse/YARN-120
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>  Labels: yarn
> Attachments: MAPREDUCE-4014.patch
>
>
> Review the yarn common services ({{CompositeService}}, 
> {{AbstractLivelinessMonitor}} and make their service startup _and especially 
> shutdown_ more robust against out-of-lifecycle invocation and partially 
> complete initialization.
> Write tests for these where possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsuyoshi OZAWA updated YARN-2229:
-
Attachment: YARN-2229.15.patch

Talked with Jian offline.  Updated to reduce epoch bits from 32 bits to 24 bits 
and increase id bits from 32 bits to 40 bits because 32 bits for epoch is too 
much. It is allowed to truncate int32/64 values by the spec of protobuf.
https://developers.google.com/protocol-buffers/docs/proto


> ContainerId can overflow with RM restart
> 
>
> Key: YARN-2229
> URL: https://issues.apache.org/jira/browse/YARN-2229
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Tsuyoshi OZAWA
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2229.1.patch, YARN-2229.10.patch, 
> YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, 
> YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, 
> YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, 
> YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, 
> YARN-2229.9.patch
>
>
> On YARN-2052, we changed containerId format: upper 10 bits are for epoch, 
> lower 22 bits are for sequence number of Ids. This is for preserving 
> semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, 
> {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and 
> {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM 
> restarts 1024 times.
> To avoid the problem, its better to make containerId long. We need to define 
> the new format of container Id with preserving backward compatibility on this 
> JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2533) Redirect stdout and stderr to a file for all applications/frameworks

2014-09-10 Thread Kannan Rajah (JIRA)
Kannan Rajah created YARN-2533:
--

 Summary: Redirect stdout and stderr to a file for all 
applications/frameworks
 Key: YARN-2533
 URL: https://issues.apache.org/jira/browse/YARN-2533
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 2.4.1
Reporter: Kannan Rajah
Priority: Minor


Today, we have the capability to redirect stdout and stderr of shell commands 
(launched tasks) to a file and also apply a tail length. This logic exists in 
TaskLog, YARNRunner. But these reside in map reduce specific packages. Every 
framework has to duplicate this logic. It would be nice to abstract this at 
YARN level and apply to shell commands that are launched by any framework.

ContainerLaunch.call method looks like a good candidate. Does anyone have 
suggestions or guidelines?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-10 Thread Eric Payne (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-415:

Attachment: YARN-415.201409102216.txt

Thanks a lot, [~jianhe].

I have added comment headers for the new APIs in ApplicationResourceUsageReport.

> Capture aggregate memory allocation at the app-level for chargeback
> ---
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
> YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
> YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
> YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
> YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
> YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
> YARN-415.201409040036.txt, YARN-415.201409092204.txt, 
> YARN-415.201409102216.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart

2014-09-10 Thread Anubhav Dhoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anubhav Dhoot updated YARN-1372:

Attachment: YARN-1372.003.patch

As per feedback, remove containers when the corresponding application does not 
exist. That simplified a lot of code from the second iteration.
Also added unit tests.
Also renamed the previousJustFinishedContainers to finishedContainersSentToAM 
to clarify the difference. As discussed earlier, this avoids the problem that 
there is a failure between RM acking this to NM and AM successfully processing 
this set. By waiting for the next allocate call before acking to NM, we 
guarantee the AM has successfully received this list.

> Ensure all completed containers are reported to the AMs across RM restart
> -
>
> Key: YARN-1372
> URL: https://issues.apache.org/jira/browse/YARN-1372
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Bikas Saha
>Assignee: Anubhav Dhoot
> Attachments: YARN-1372.001.patch, YARN-1372.001.patch, 
> YARN-1372.002_NMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, 
> YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, 
> YARN-1372.prelim.patch, YARN-1372.prelim2.patch
>
>
> Currently the NM informs the RM about completed containers and then removes 
> those containers from the RM notification list. The RM passes on that 
> completed container information to the AM and the AM pulls this data. If the 
> RM dies before the AM pulls this data then the AM may not be able to get this 
> information again. To fix this, NM should maintain a separate list of such 
> completed container notifications sent to the RM. After the AM has pulled the 
> containers from the RM then the RM will inform the NM about it and the NM can 
> remove the completed container from the new list. Upon re-register with the 
> RM (after RM restart) the NM should send the entire list of completed 
> containers to the RM along with any other containers that completed while the 
> RM was dead. This ensures that the RM can inform the AM's about all completed 
> containers. Some container completions may be reported more than once since 
> the AM may have pulled the container but the RM may die before notifying the 
> NM about the pull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2056) Disable preemption at Queue level

2014-09-10 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129181#comment-14129181
 ] 

Jason Lowe commented on YARN-2056:
--

Sorry for coming in late.  I think there's an issue with this part of the patch:

{code}
  // The per-queue disablePreemption defaults to false (preemption enabled).
  // Inherit parent's per-queue disablePreemption value.
  boolean parentQueueDisablePreemption = false;
  boolean queueDisablePreemption = false;

  if (root.getParent() != null) {
String parentQueuePropName = BASE_YARN_RM_PREEMPTION
 + root.getParent().getQueuePath()
 + SUFFIX_DISABLE_PREEMPTION;
parentQueueDisablePreemption =
this.conf.getBoolean(parentQueuePropName, false);
  }

  String queuePropName = BASE_YARN_RM_PREEMPTION + root.getQueuePath()
 + SUFFIX_DISABLE_PREEMPTION;
  queueDisablePreemption =
 this.conf.getBoolean(queuePropName, parentQueueDisablePreemption);
{code}

I think it only handles examining the immediate parent for a default value.  If 
preemption is disabled at a parent two levels or more removed from the leaf 
queue then it appears we won't honor that.

> Disable preemption at Queue level
> -
>
> Key: YARN-2056
> URL: https://issues.apache.org/jira/browse/YARN-2056
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Mayank Bansal
>Assignee: Eric Payne
> Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, 
> YARN-2056.201408310117.txt, YARN-2056.201409022208.txt
>
>
> We need to be able to disable preemption at individual queue level



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-10 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129141#comment-14129141
 ] 

Jian He commented on YARN-415:
--

Eric, thanks for your explanation. sounds good to me.  
One nit: I found the new APIs added in ApplicationResourceUsageReport don't 
have code comments. Could you add that too ? 
I'd like to commit this once this this fixed.  thanks for all your patience !

> Capture aggregate memory allocation at the app-level for chargeback
> ---
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
> YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
> YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
> YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
> YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
> YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
> YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM

2014-09-10 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129133#comment-14129133
 ] 

Zhijie Shen commented on YARN-611:
--

bq. How about attempt_failures_sliding_window_size? Or should we call it 
attempt_failures_validity_interval? Any other ideas?

attempt_failures_validity_interval sounds good to me.

> Add an AM retry count reset window to YARN RM
> -
>
> Key: YARN-611
> URL: https://issues.apache.org/jira/browse/YARN-611
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.0.3-alpha
>Reporter: Chris Riccomini
>Assignee: Xuan Gong
> Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, 
> YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, 
> YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch
>
>
> YARN currently has the following config:
> yarn.resourcemanager.am.max-retries
> This config defaults to 2, and defines how many times to retry a "failed" AM 
> before failing the whole YARN job. YARN counts an AM as failed if the node 
> that it was running on dies (the NM will timeout, which counts as a failure 
> for the AM), or if the AM dies.
> This configuration is insufficient for long running (or infinitely running) 
> YARN jobs, since the machine (or NM) that the AM is running on will 
> eventually need to be restarted (or the machine/NM will fail). In such an 
> event, the AM has not done anything wrong, but this is counted as a "failure" 
> by the RM. Since the retry count for the AM is never reset, eventually, at 
> some point, the number of machine/NM failures will result in the AM failure 
> count going above the configured value for 
> yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the 
> job as failed, and shut it down. This behavior is not ideal.
> I propose that we add a second configuration:
> yarn.resourcemanager.am.retry-count-window-ms
> This configuration would define a window of time that would define when an AM 
> is "well behaved", and it's safe to reset its failure count back to zero. 
> Every time an AM fails the RmAppImpl would check the last time that the AM 
> failed. If the last failure was less than retry-count-window-ms ago, and the 
> new failure count is > max-retries, then the job should fail. If the AM has 
> never failed, the retry count is < max-retries, or if the last failure was 
> OUTSIDE the retry-count-window-ms, then the job should be restarted. 
> Additionally, if the last failure was outside the retry-count-window-ms, then 
> the failure count should be set back to 0.
> This would give developers a way to have well-behaved AMs run forever, while 
> still failing mis-behaving AMs after a short period of time.
> I think the work to be done here is to change the RmAppImpl to actually look 
> at app.attempts, and see if there have been more than max-retries failures in 
> the last retry-count-window-ms milliseconds. If there have, then the job 
> should fail, if not, then the job should go forward. Additionally, we might 
> also need to add an endTime in either RMAppAttemptImpl or 
> RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the 
> failure.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-10 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129119#comment-14129119
 ] 

Eric Payne commented on YARN-415:
-

Thanks for clarifying [~jianhe].
{quote}
is this {{currentAttempt.getAppAttemptId().equals(attemptId)}} still necessary 
? since the return value of {{scheduler#getAppResourceUsageReport}} for 
non-active attempt is anyways empty/null.
{quote}
I believe that the check is necessary. Here are a couple of points.
- First, {{RMAppAttemptMetrics#getAggregateAppResourceUsage}} is called from 
multiple places, including {{RMAppImpl#getRMAppMetrics}}, which loops through 
all attempts for any given app. If the app is running and has multiple 
attempts, we want to charge the current attempt for both the running container 
stats and those that finished for that attempt. But, in this scenario, when 
{{RMAppImpl#getRMAppMetrics}} loops through and calls 
{{RMAppAttemptMetrics#getAggregateAppResourceUsage}} for the finished attempts, 
{{RMAppAttemptMetrics#getAggregateAppResourceUsage}} needs to know that the 
attempt ID is not the current attempt so that it doesn't count the running 
container stats again.
- Second, from my tests and my reading of the code, I'm pretty sure that 
{{scheduler#getAppResourceUsageReport}} always returns the 
{{ApplicationResourceUsageReport}} for the current attempt, even if you give it 
a finished attempt. It uses the attemptId to get the app object, and then uses 
that to get the current attempt. I've tested this, and by taking a look at 
{{AbstractYarnScheduler#getApplicationAttempt}} (which is called by 
{{getAppResourceUsageReport}} for both CapacityScheduler and FairScheduler), we 
can see that it only uses the attemptId to get the app, and then calls 
app.getCurrentAttempt().

I hope that helps to clarify this.
Thank you

> Capture aggregate memory allocation at the app-level for chargeback
> ---
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
> YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
> YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
> YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
> YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
> YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
> YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2456) Possible lovelock in CapacityScheduler when RM is recovering apps

2014-09-10 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2456:
--
Summary: Possible lovelock in CapacityScheduler when RM is recovering apps  
(was: Possible deadlock in CapacityScheduler when RM is recovering apps)

> Possible lovelock in CapacityScheduler when RM is recovering apps
> -
>
> Key: YARN-2456
> URL: https://issues.apache.org/jira/browse/YARN-2456
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2456.1.patch
>
>
> Consider this scenario:
> 1. RM is configured with a single queue and only one application can be 
> active at a time.
> 2. Submit App1 which uses up the queue's whole capacity
> 3. Submit App2 which remains pending.
> 4. Restart RM.
> 5. App2 is recovered before App1, so App2 is added to the activeApplications 
> list. Now App1 remains pending (because of max-active-app limit)
> 6. All containers of App1 are now recovered when NM registers, and use up the 
> whole queue capacity again.
> 7. Since the queue is full, App2 cannot proceed to allocate AM container.
> 8. In the meanwhile, App1 cannot proceed to become active because of the 
> max-active-app limit 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2456) Possible livelock in CapacityScheduler when RM is recovering apps

2014-09-10 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-2456:
--
Summary: Possible livelock in CapacityScheduler when RM is recovering apps  
(was: Possible lovelock in CapacityScheduler when RM is recovering apps)

> Possible livelock in CapacityScheduler when RM is recovering apps
> -
>
> Key: YARN-2456
> URL: https://issues.apache.org/jira/browse/YARN-2456
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2456.1.patch
>
>
> Consider this scenario:
> 1. RM is configured with a single queue and only one application can be 
> active at a time.
> 2. Submit App1 which uses up the queue's whole capacity
> 3. Submit App2 which remains pending.
> 4. Restart RM.
> 5. App2 is recovered before App1, so App2 is added to the activeApplications 
> list. Now App1 remains pending (because of max-active-app limit)
> 6. All containers of App1 are now recovered when NM registers, and use up the 
> whole queue capacity again.
> 7. Since the queue is full, App2 cannot proceed to allocate AM container.
> 8. In the meanwhile, App1 cannot proceed to become active because of the 
> max-active-app limit 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps

2014-09-10 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129066#comment-14129066
 ] 

Jian He commented on YARN-2456:
---

Folks, thanks for the comments. Renamed the title as suggested by Wangda. 
I agree that too many other factors may affect this issue, e.g. NM resync time. 
This patch really just mitigates the issue, not solving the issue completely.

> Possible deadlock in CapacityScheduler when RM is recovering apps
> -
>
> Key: YARN-2456
> URL: https://issues.apache.org/jira/browse/YARN-2456
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Jian He
>Assignee: Jian He
> Attachments: YARN-2456.1.patch
>
>
> Consider this scenario:
> 1. RM is configured with a single queue and only one application can be 
> active at a time.
> 2. Submit App1 which uses up the queue's whole capacity
> 3. Submit App2 which remains pending.
> 4. Restart RM.
> 5. App2 is recovered before App1, so App2 is added to the activeApplications 
> list. Now App1 remains pending (because of max-active-app limit)
> 6. All containers of App1 are now recovered when NM registers, and use up the 
> whole queue capacity again.
> 7. Since the queue is full, App2 cannot proceed to allocate AM container.
> 8. In the meanwhile, App1 cannot proceed to become active because of the 
> max-active-app limit 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled

2014-09-10 Thread Xuan Gong (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuan Gong updated YARN-2459:

Fix Version/s: 2.6.0

> RM crashes if App gets rejected for any reason and HA is enabled
> 
>
> Key: YARN-2459
> URL: https://issues.apache.org/jira/browse/YARN-2459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Fix For: 2.6.0
>
> Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
> YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch
>
>
> If RM HA is enabled and used Zookeeper store for RM State Store.
> If for any reason Any app gets rejected and directly goes to NEW to FAILED
> then final transition makes that to RMApps and Completed Apps memory 
> structure but that doesn't make it to State store.
> Now when RMApps default limit reaches it starts deleting apps from memory and 
> store. In that case it try to delete this app from store and fails which 
> causes RM to crash.
> Thanks,
> Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2458) Add file handling features to the Windows Secure Container Executor LRPC service

2014-09-10 Thread Remus Rusanu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129061#comment-14129061
 ] 

Remus Rusanu commented on YARN-2458:


The solution proposed here is to have the Windows Secure container Executor use 
its own FileContext and FileSystem. The WSCE filesystem is derived from the 
RawLocalFilesystem and overrides the actual creation of directories, 
setPermissions, setOwner and createOutputStream operations. These operations 
are executed via the JNI/LRPC by calling corresponding remote methods offered 
by the hadoopwinutilsvc service. This service runs as a privileged user 
(LocalSystem) and thus can execute certain operations forbidden to the NM, like 
writing into the container dirs (owned by the container user). The actual 
implementation of methods like setOwner/setPermissions is the same as previous 
ones, whether it was invoked via winutils chown/chmod or via the Native 
Hadoop.dll JNI, the code is exactly the same and is shared via libwinutils. 
This changes simply offer a mechanism to exect this code in an elevated process.
The patches also contain some changes around classpath jar creation: prviosuly 
it was created directly into the destination dir (the container private dirs). 
this is not forbidden because the NM doe snot have the right to do it. Instead 
the classpath jars are create in the private nmPrivate folder and then moved 
into the container dirs (via a copy/move API offered by hadoopwinutilsvc).

> Add file handling features to the Windows Secure Container Executor LRPC 
> service
> 
>
> Key: YARN-2458
> URL: https://issues.apache.org/jira/browse/YARN-2458
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager
>Reporter: Remus Rusanu
>Assignee: Remus Rusanu
>  Labels: security, windows
> Attachments: YARN-2458.1.patch, YARN-2458.2.patch
>
>
> In the WSCE design the nodemanager needs to do certain privileged operations 
> like change file ownership to arbitrary users or delete files owned by the 
> task container user after completion of the task. As we want to remove the 
> Administrator privilege  requirement from the nodemanager service, we have to 
> move these operations into the privileged LRPC helper service. 
> Extend the RPC interface to contain methods for change file ownership and 
> manipulate files, add JNI client side and implement the server side. This 
> will piggyback on the existing LRPC service so is not much infrastructure to 
> add (run as service, RPC init, authentictaion and authorization are already 
> solved). It just needs to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129058#comment-14129058
 ] 

Hadoop QA commented on YARN-2440:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12667797/apache-yarn-2440.6.patch
  against trunk revision cbfe263.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4875//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4875//console

This message is automatically generated.

> Cgroups should allow YARN containers to be limited to allocated cores
> -
>
> Key: YARN-2440
> URL: https://issues.apache.org/jira/browse/YARN-2440
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
> apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, 
> apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, 
> screenshot-current-implementation.jpg
>
>
> The current cgroups implementation does not limit YARN containers to the 
> cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129052#comment-14129052
 ] 

Vinod Kumar Vavilapalli commented on YARN-2033:
---

+1, looks good to me. [~djp], can you please do the honours given you did early 
reviews?

> Investigate merging generic-history into the Timeline Store
> ---
>
> Key: YARN-2033
> URL: https://issues.apache.org/jira/browse/YARN-2033
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
> YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, 
> YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
> YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
> YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, 
> YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch
>
>
> Having two different stores isn't amicable to generic insights on what's 
> happening with applications. This is to investigate porting generic-history 
> into the Timeline Store.
> One goal is to try and retain most of the client side interfaces as close to 
> what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129042#comment-14129042
 ] 

Hadoop QA commented on YARN-2033:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667818/YARN-2033.12.patch
  against trunk revision 47bdfa0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 17 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4874//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4874//console

This message is automatically generated.

> Investigate merging generic-history into the Timeline Store
> ---
>
> Key: YARN-2033
> URL: https://issues.apache.org/jira/browse/YARN-2033
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
> YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, 
> YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
> YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
> YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, 
> YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch
>
>
> Having two different stores isn't amicable to generic insights on what's 
> happening with applications. This is to investigate porting generic-history 
> into the Timeline Store.
> One goal is to try and retain most of the client side interfaces as close to 
> what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2475) ReservationSystem: replan upon capacity reduction

2014-09-10 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129041#comment-14129041
 ] 

Chris Douglas commented on YARN-2475:
-

{{SimpleCapacityReplanner}}
* The Clock can be initialized in the constructor, declared private and final
* The exception refers to an InventorySizeAdjusmentPolicy
* nit: redundant parenthesis in the main loop, exceeds 80 char
* {{curSessions}} cannot be null; prefer {{!isEmpty()}} to {{size() > 0}}
** Is this check even necessary? {{sort}} and the following loop should be noops
* A brief comment about the natural order of {{ReservationAllocations}} would 
help readability of this loop. It's in the class doc, but something inline 
would be helpful
* An internal {{Resource(0,0)}} could be reused, instead of creating it in the 
loop
* Could the inner loop be more readable? The embedded function calls in the 
{{Resource}} arithmetic are hard to read (pseudo):
{code}
ArrayList<> curSessions = new ArrayList<>(plan.getResourcesAtTime(t));
Collections.sort(curSessions);
for (Iterator<> i = curSessions.iterator(); i.hasNext() && excessCap > 0;) {
  InMemoryReservationAllocation a = (InMemoryReservationAllocation) i.next();
  plan.deleteReservation(a.getReservationId());
  excessCap -= a.getResourcesAtTime(t);
}
{code}
* Why is the enforcement window tied to {{CapacitySchedulerConfiguration}}?

{{TestSimpleCapacityReplanner}}
* Tests should not call {{Thread.sleep}}; instead update the mock
* Passing in a mocked {{Clock}} to the cstr rather than assigning it in the 
test is cleaner
* Instead of {{assertTrue(cond != null)}} use {{assertNotNull(cond)}} (same for 
positive null check)
* The test should not catch and discard {{PlanningException}}

> ReservationSystem: replan upon capacity reduction
> -
>
> Key: YARN-2475
> URL: https://issues.apache.org/jira/browse/YARN-2475
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-2475.patch
>
>
> In the context of YARN-1051, if capacity of the cluster drops significantly 
> upon machine failures we need to trigger a reorganization of the planned 
> reservations. As reservations are "absolute" it is possible that they will 
> not all fit, and some need to be rejected a-posteriori.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1710) Admission Control: agents to allocate reservation

2014-09-10 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129038#comment-14129038
 ] 

Chris Douglas commented on YARN-1710:
-

{{GreedyReservationAgent}}
* Consider {{@link}} for {{ReservationRequest}} in class javadoc
* An inline comment could replace the {{adjustContract()}} method
* Most of the javadoc on private methods can be cut
* {{currentReservationStage}} does not need to be declared outside the loop
* {{allocations}} cannot be null
* An internal {{Resource(0, 0)}} could be reused
* {{li}} should be part of the loop ({{for}} not {{while}}). Its initialization 
is unreadable; please use temp vars.
* Generally, embedded calls are difficult to read:
{code}
if (findEarliestTime(allocations.keySet()) > earliestStart) {
  allocations.put(new ReservationInterval(earliestStart,
  findEarliestTime(allocations.keySet())), ReservationRequest
  .newInstance(Resource.newInstance(0, 0), 0));
  // consider to add trailing zeros at the end for simmetry
}
{code}
Assuming the {{ReservationRequest}} is never modified by the plan:
{code}
private final ZERO_RSRC =
ReservationRequest.newInstance(Resource.newInstance(0, 0), 0);
// ...
long allocStart = findEarliestTime(allocations.keySet());
if (allocStart > earliestStart) {
  ReservationInterval preAlloc =
new ReservationInterval(earliestStart, allocStart);
  allocations.put(preAlloc, ZERO_RSRC);
}
{code}
* {{findEarliestTime(allocations.keySet())}} is called several times and should 
be memoized
** Would a {{TreeSet}} be more appropriate, given this access pattern?
* Instead of:
{code}
boolean result = false;
if (oldReservation != null) {
  result = plan.updateReservation(capReservation);
} else {
  result = plan.addReservation(capReservation);
}
return result;
{code}
Consider:
{code}
if (oldReservation != null) {
  return plan.updateReservation(capReservation);
}
return plan.addReservation(capReservation);
{code}
* A comment unpacking the arithmetic for calculating {{curMaxGang}} would help 
readability

{{TestGreedyReservationAgent}}
* Instead of fixing the seed, consider setting and logging it for each run.
* {{testStress}} is brittle, as it verifies only the timeout; {{testBig}} and 
{{testSmall}} don't verify anything. Both tests are useful, but probably not as 
part of the build. Dropping the annotation and adding a {{main()}} that calls 
each fo them would be one alternative.

> Admission Control: agents to allocate reservation
> -
>
> Key: YARN-1710
> URL: https://issues.apache.org/jira/browse/YARN-1710
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Attachments: YARN-1710.1.patch, YARN-1710.patch
>
>
> This JIRA tracks the algorithms used to allocate a user ReservationRequest 
> coming in from the new reservation API (YARN-1708), in the inventory 
> subsystem (YARN-1709) maintaining the current plan for the cluster. The focus 
> of this "agents" is to quickly find a solution for the set of contraints 
> provided by the user, and the physical constraints of the plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests

2014-09-10 Thread Craig Welch (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129024#comment-14129024
 ] 

Craig Welch commented on YARN-796:
--

So, I'm adding code to check whether a user should be able to modify labels (is 
an admin) and I think that we should be checking the UserGroup information but 
not executing the operation using "doAs".  This is because, ultimately, the 
process is writing data into hdfs and for permissions reasons I think it should 
always be written as the same user - the user yarn runs as - if we do the doAs 
there will be a mishmash of users there, and to have the directory be secure 
there would need to be a group with rights which contains all the admin users, 
which is extra overhead (otherwise, it has to be world writable, which tends to 
compromise the security model...)  I think the same is true if we use other 
datastores down the line for holding the label info - really, our interest in 
the user it to verify access, but we don't really need or want to perform 
actions on their behalf (like you would when launching a job, etc), this is not 
one of those cases.  So, I propose enforcing the check but executing whatever 
changes as the user the process is running under (the resourcemanager/yarn 
user, basically, just dropping the doAs).  This means that entry points will 
need to do the verification, but that's not really an issue, the already have 
to be aware to gather the info regarding who the user is / are aware of the 
need for doAs, now, etc.  It means that the user will need to be careful if 
executing a tool which directly modifies the data in hdfs to do that as an 
appropriate user, but they already have to do that, it's not a new issue which 
is being created with this approach (it doesn't really make that any better or 
worse, imho).  Thoughts?

> Allow for (admin) labels on nodes and resource-requests
> ---
>
> Key: YARN-796
> URL: https://issues.apache.org/jira/browse/YARN-796
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 2.4.1
>Reporter: Arun C Murthy
>Assignee: Wangda Tan
> Attachments: LabelBasedScheduling.pdf, 
> Node-labels-Requirements-Design-doc-V1.pdf, 
> Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, 
> YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, 
> YARN-796.patch, YARN-796.patch4
>
>
> It will be useful for admins to specify labels for nodes. Examples of labels 
> are OS, processor architecture etc.
> We should expose these labels and allow applications to specify labels on 
> resource-requests.
> Obviously we need to support admin operations on adding/removing node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback

2014-09-10 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129022#comment-14129022
 ] 

Jian He commented on YARN-415:
--

Hi [~eepayne],  sorry for being  unclear. my main question was,  is this 
{{currentAttempt.getAppAttemptId().equals(attemptId)}} still necessary ?  since 
the return value of scheduler#getAppResourceUsageReport for non-active attempt 
is anyways empty/null.

> Capture aggregate memory allocation at the app-level for chargeback
> ---
>
> Key: YARN-415
> URL: https://issues.apache.org/jira/browse/YARN-415
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Kendall Thrapp
>Assignee: Andrey Klochkov
> Attachments: YARN-415--n10.patch, YARN-415--n2.patch, 
> YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, 
> YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, 
> YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, 
> YARN-415.201406262136.txt, YARN-415.201407042037.txt, 
> YARN-415.201407071542.txt, YARN-415.201407171553.txt, 
> YARN-415.201407172144.txt, YARN-415.201407232237.txt, 
> YARN-415.201407242148.txt, YARN-415.201407281816.txt, 
> YARN-415.201408062232.txt, YARN-415.201408080204.txt, 
> YARN-415.201408092006.txt, YARN-415.201408132109.txt, 
> YARN-415.201408150030.txt, YARN-415.201408181938.txt, 
> YARN-415.201408181938.txt, YARN-415.201408212033.txt, 
> YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch
>
>
> For the purpose of chargeback, I'd like to be able to compute the cost of an
> application in terms of cluster resource usage.  To start out, I'd like to 
> get the memory utilization of an application.  The unit should be MB-seconds 
> or something similar and, from a chargeback perspective, the memory amount 
> should be the memory reserved for the application, as even if the app didn't 
> use all that memory, no one else was able to use it.
> (reserved ram for container 1 * lifetime of container 1) + (reserved ram for
> container 2 * lifetime of container 2) + ... + (reserved ram for container n 
> * lifetime of container n)
> It'd be nice to have this at the app level instead of the job level because:
> 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't 
> appear on the job history server).
> 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm).
> This new metric should be available both through the RM UI and RM Web 
> Services REST API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-09-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129020#comment-14129020
 ] 

Vinod Kumar Vavilapalli commented on YARN-2440:
---

The build machine ran into an issue which [~gkesavan] helped fixing on my 
offline request. Rekicked the build manually..

> Cgroups should allow YARN containers to be limited to allocated cores
> -
>
> Key: YARN-2440
> URL: https://issues.apache.org/jira/browse/YARN-2440
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
> apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, 
> apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, 
> screenshot-current-implementation.jpg
>
>
> The current cgroups implementation does not limit YARN containers to the 
> cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2532) Track pending resources at the application level

2014-09-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129013#comment-14129013
 ] 

Karthik Kambatla commented on YARN-2532:


bq. For the FS at least, this is just FSAppAttempt.getDemand() - 
FSAppAttempt.getResourceUsage()
Yes, it is. Tracking pending resources separately is not necessary for 
YARN-2353. However, demand for a queue or an app-attempt changes when the app 
requests more resources (increase in pending resources) or containers complete 
(consumption goes down). Since we want to track the pending resources 
information for YARN-2333, I thought we might as well do that first and use 
that as a trigger to update the demand in YARN-2353. 

> Track pending resources at the application level 
> -
>
> Key: YARN-2532
> URL: https://issues.apache.org/jira/browse/YARN-2532
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> SchedulerApplicationAttempt keeps track of current consumption of an app. It 
> would be nice to have a similar value tracked for pending requests. 
> The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and 
> (2) updating demand in FS in an event-driven style (YARN-2353)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129010#comment-14129010
 ] 

Hadoop QA commented on YARN-2033:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667808/YARN-2033.11.patch
  against trunk revision 47bdfa0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 17 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4873//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4873//console

This message is automatically generated.

> Investigate merging generic-history into the Timeline Store
> ---
>
> Key: YARN-2033
> URL: https://issues.apache.org/jira/browse/YARN-2033
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
> YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, 
> YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
> YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
> YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, 
> YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch
>
>
> Having two different stores isn't amicable to generic insights on what's 
> happening with applications. This is to investigate porting generic-history 
> into the Timeline Store.
> One goal is to try and retain most of the client side interfaces as close to 
> what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2532) Track pending resources at the application level

2014-09-10 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128997#comment-14128997
 ] 

Sandy Ryza commented on YARN-2532:
--

For the FS at least, this is just FSAppAttempt.getDemand() - 
FSAppAttempt.getResourceUsage(), no?

> Track pending resources at the application level 
> -
>
> Key: YARN-2532
> URL: https://issues.apache.org/jira/browse/YARN-2532
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Affects Versions: 2.5.1
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> SchedulerApplicationAttempt keeps track of current consumption of an app. It 
> would be nice to have a similar value tracked for pending requests. 
> The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and 
> (2) updating demand in FS in an event-driven style (YARN-2353)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk

2014-09-10 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128985#comment-14128985
 ] 

Jian He commented on YARN-2158:
---

looks good, committing

> TestRMWebServicesAppsModification sometimes fails in trunk
> --
>
> Key: YARN-2158
> URL: https://issues.apache.org/jira/browse/YARN-2158
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Ted Yu
>Assignee: Varun Vasudev
>Priority: Minor
> Attachments: apache-yarn-2158.0.patch, apache-yarn-2158.1.patch
>
>
> From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console :
> {code}
> Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification
> testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification)
>   Time elapsed: 2.297 sec  <<< FAILURE!
> java.lang.AssertionError: app state incorrect
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-09-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128984#comment-14128984
 ] 

Karthik Kambatla commented on YARN-1492:


Thanks for updating the design, Chris. 

Chris and I discussed the design and current implementation offline. A couple 
of comments in that discussion:
# I like the idea of having a separate daemon for SCM, but if it is not very 
resource (memory) intensive, it might make sense to embed it in the RM by 
default. This takes care of HA etc. for free. We can do this at the end. 
# The choice of SCM store should be transparent to the rest of SCM code. It 
would be better to define an interface for the SCMStore similar to the 
RMStateStore today.
# Defaulting to the in-memory store requires providing a way to initialize the 
store with currently running applications and cached jars, which is quite 
involved and not so elegant either. I propose implementing leveldb and zk 
stores. We could default to leveldb on non-HA clusters, and ZK store for HA 
clusters if we choose to embed the SCM in the RM.

> truly shared cache for jars (jobjar/libjar)
> ---
>
> Key: YARN-1492
> URL: https://issues.apache.org/jira/browse/YARN-1492
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.0.4-alpha
>Reporter: Sangjin Lee
>Assignee: Chris Trezzo
> Attachments: YARN-1492-all-trunk-v1.patch, 
> YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
> YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
> shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf, shared_cache_design_v6.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2532) Track pending resources at the application level

2014-09-10 Thread Karthik Kambatla (JIRA)
Karthik Kambatla created YARN-2532:
--

 Summary: Track pending resources at the application level 
 Key: YARN-2532
 URL: https://issues.apache.org/jira/browse/YARN-2532
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scheduler
Affects Versions: 2.5.1
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla


SchedulerApplicationAttempt keeps track of current consumption of an app. It 
would be nice to have a similar value tracked for pending requests. 

The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and 
(2) updating demand in FS in an event-driven style (YARN-2353)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-10 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2033:
--
Attachment: YARN-2033.12.patch

Missed the changes in YarnConfiguration in the last patch, added them in the 
newer one

> Investigate merging generic-history into the Timeline Store
> ---
>
> Key: YARN-2033
> URL: https://issues.apache.org/jira/browse/YARN-2033
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
> YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, 
> YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, 
> YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, 
> YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, 
> YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch
>
>
> Having two different stores isn't amicable to generic insights on what's 
> happening with applications. This is to investigate porting generic-history 
> into the Timeline Store.
> One goal is to try and retain most of the client side interfaces as close to 
> what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled

2014-09-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128929#comment-14128929
 ] 

Xuan Gong commented on YARN-2459:
-

Also, Thanks Mayank for the initial patch.

> RM crashes if App gets rejected for any reason and HA is enabled
> 
>
> Key: YARN-2459
> URL: https://issues.apache.org/jira/browse/YARN-2459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
> YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch
>
>
> If RM HA is enabled and used Zookeeper store for RM State Store.
> If for any reason Any app gets rejected and directly goes to NEW to FAILED
> then final transition makes that to RMApps and Completed Apps memory 
> structure but that doesn't make it to State store.
> Now when RMApps default limit reaches it starts deleting apps from memory and 
> store. In that case it try to delete this app from store and fails which 
> causes RM to crash.
> Thanks,
> Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled

2014-09-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128927#comment-14128927
 ] 

Xuan Gong commented on YARN-2459:
-

Committed into trunk and branch-2. Thanks, Jian.

> RM crashes if App gets rejected for any reason and HA is enabled
> 
>
> Key: YARN-2459
> URL: https://issues.apache.org/jira/browse/YARN-2459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.4.1
>Reporter: Mayank Bansal
>Assignee: Mayank Bansal
> Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, 
> YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch
>
>
> If RM HA is enabled and used Zookeeper store for RM State Store.
> If for any reason Any app gets rejected and directly goes to NEW to FAILED
> then final transition makes that to RMApps and Completed Apps memory 
> structure but that doesn't make it to State store.
> Now when RMApps default limit reaches it starts deleting apps from memory and 
> store. In that case it try to delete this app from store and fails which 
> causes RM to crash.
> Thanks,
> Mayank



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2531) CGroups - Admins should be allowed to enforce strict cpu limits

2014-09-10 Thread Varun Vasudev (JIRA)
Varun Vasudev created YARN-2531:
---

 Summary: CGroups - Admins should be allowed to enforce strict cpu 
limits
 Key: YARN-2531
 URL: https://issues.apache.org/jira/browse/YARN-2531
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev


>From YARN-2440 -
{quote} 
The other dimension to this is determinism w.r.t performance. Limiting to 
allocated cores overall (as well as per container later) helps orgs run 
workloads and reason about them deterministically. One of the examples is 
benchmarking apps, but deterministic execution is a desired option beyond 
benchmarks too.
{quote}

It would be nice to have an option to let admins to enforce strict cpu limits 
for apps for things like benchmarking, etc. By default this flag should be off 
so that containers can use available cpu but admin can turn the flag on to 
determine worst case performance, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-10 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2033:
--
Attachment: YARN-2033.11.patch

Good catch! I uploaded a new patch with the updated config names.

> Investigate merging generic-history into the Timeline Store
> ---
>
> Key: YARN-2033
> URL: https://issues.apache.org/jira/browse/YARN-2033
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
> YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, YARN-2033.2.patch, 
> YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, 
> YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, 
> YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, 
> YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch
>
>
> Having two different stores isn't amicable to generic insights on what's 
> happening with applications. This is to investigate porting generic-history 
> into the Timeline Store.
> One goal is to try and retain most of the client side interfaces as close to 
> what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators

2014-09-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128911#comment-14128911
 ] 

Hudson commented on YARN-2526:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1892 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1892/])
YARN-2526. SLS can deadlock when all the threads are taken by AMSimulators. 
(Wei Yan via kasha) (kasha: rev 28d99db99236ff2a6e4a605802820e2b512225f9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java


> SLS can deadlock when all the threads are taken by AMSimulators
> ---
>
> Key: YARN-2526
> URL: https://issues.apache.org/jira/browse/YARN-2526
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 2.5.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Fix For: 2.6.0
>
> Attachments: YARN-2526-1.patch
>
>
> The simulation may enter deadlock if all application simulators hold all 
> threads provided by the thread pool, and all wait for AM container 
> allocation. In that case, all AM simulators wait for NM simulators to do 
> heartbeat to allocate resource, and all NM simulators wait for AM simulators 
> to release some threads. The simulator is deadlocked.
> To solve this deadlock, need to remove the while() loop in the MRAMSimulator.
> {code}
> // waiting until the AM container is allocated
> while (true) {
>   if (response != null && ! response.getAllocatedContainers().isEmpty()) {
> // get AM container
> .
> break;
>   }
>   // this sleep time is different from HeartBeat
>   Thread.sleep(1000);
>   // send out empty request
>   sendContainerRequest();
>   response = responseQueue.take();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1471) The SLS simulator is not running the preemption policy for CapacityScheduler

2014-09-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128915#comment-14128915
 ] 

Hudson commented on YARN-1471:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1892 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1892/])
Add missing YARN-1471 to the CHANGES.txt (aw: rev 
9b8104575444ed2de9b44fe902f86f7395f249ed)
* hadoop-yarn-project/CHANGES.txt


> The SLS simulator is not running the preemption policy for CapacityScheduler
> 
>
> Key: YARN-1471
> URL: https://issues.apache.org/jira/browse/YARN-1471
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: SLSCapacityScheduler.java, YARN-1471.2.patch, 
> YARN-1471.patch, YARN-1471.patch
>
>
> The simulator does not run the ProportionalCapacityPreemptionPolicy monitor.  
> This is because the policy needs to interact with a CapacityScheduler, and 
> the wrapping done by the simulator breaks this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store

2014-09-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128877#comment-14128877
 ] 

Vinod Kumar Vavilapalli commented on YARN-2033:
---

Looks mostly good. Rename yarn.resourcemanager.metrics-publisher.enabled -> 
also to say system-metrics-publisher too? Similarly rename 
yarn.resourcemanager.metrics-publisher.dispatcher.pool-size?

> Investigate merging generic-history into the Timeline Store
> ---
>
> Key: YARN-2033
> URL: https://issues.apache.org/jira/browse/YARN-2033
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Vinod Kumar Vavilapalli
>Assignee: Zhijie Shen
> Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, 
> YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.2.patch, YARN-2033.3.patch, 
> YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, 
> YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, 
> YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, 
> YARN-2033_ALL.4.patch
>
>
> Having two different stores isn't amicable to generic insights on what's 
> happening with applications. This is to investigate porting generic-history 
> into the Timeline Store.
> One goal is to try and retain most of the client side interfaces as close to 
> what we have today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128865#comment-14128865
 ] 

Hadoop QA commented on YARN-1492:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12667798/shared_cache_design_v6.pdf
  against trunk revision b67d5ba.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4872//console

This message is automatically generated.

> truly shared cache for jars (jobjar/libjar)
> ---
>
> Key: YARN-1492
> URL: https://issues.apache.org/jira/browse/YARN-1492
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.0.4-alpha
>Reporter: Sangjin Lee
>Assignee: Chris Trezzo
> Attachments: YARN-1492-all-trunk-v1.patch, 
> YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
> YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
> shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf, shared_cache_design_v6.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2530) MapReduce should take cpu into account when doing headroom calculations

2014-09-10 Thread Varun Vasudev (JIRA)
Varun Vasudev created YARN-2530:
---

 Summary: MapReduce should take cpu into account when doing 
headroom calculations
 Key: YARN-2530
 URL: https://issues.apache.org/jira/browse/YARN-2530
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Varun Vasudev
Assignee: Varun Vasudev


Currently the MapReduce AM only uses memory when doing headroom calculation as 
well calculations about launching reducers. It would be preferable to account 
for CPU as well if the scheduler on the YARN side is using CPU when scheduling. 
YARN-2448 lets AMs know what resources are being considered when scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-09-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128843#comment-14128843
 ] 

Vinod Kumar Vavilapalli edited comment on YARN-2440 at 9/10/14 6:16 PM:


bq. As I mentioned before, I think most users would rather not use the 
functionality proposed by this JIRA but instead setup peer cgroups for other 
systems and set their relative cgroup shares appropriately. With this JIRA the 
CPUs could sit idle despite demand from YARN containers, while a peer cgroup 
setup allows CPU guarantees without idle CPUs if the demand is there.
[~jlowe], agree with the general philosophy. Though we are not yet there in 
practice - datanodes, region servers don't yet live in cgroups in many sites. 
Looking back at this JIRA, I see a good use for this. Having the overall YARN 
limit will help ensure that apps' containers don't thrash cpu once we start 
enabling cgroups support.

The other dimension to this is determinism w.r.t performance. Limiting to 
allocated cores overall (as well as per container later) helps orgs run 
workloads and reason about them deterministically. One of the examples is 
benchmarking apps, but deterministic execution is a desired option beyond 
benchmarks too.


was (Author: vinodkv):
bq. As I mentioned before, I think most users would rather not use the 
functionality proposed by this JIRA but instead setup peer cgroups for other 
systems and set their relative cgroup shares appropriately. With this JIRA the 
CPUs could sit idle despite demand from YARN containers, while a peer cgroup 
setup allows CPU guarantees without idle CPUs if the demand is there.
[~jlowe], agree with the general philosophy. Though we are not yet there in 
practice - datanodes, region servers don't yet live in cgroups in many sites. 
Looking back at this JIRA, I see a good use for this. Having the overall YARN 
limit will help ensure that apps' containers don't thrash cpu once we start 
enabling support.

The other dimension to this is determinism w.r.t performance. Limiting to 
allocated cores overall (as well as per container later) helps orgs run 
workloads and reason about them deterministically. One of the examples is 
benchmarking apps, but deterministic execution is a desired option beyond 
benchmarks too.

> Cgroups should allow YARN containers to be limited to allocated cores
> -
>
> Key: YARN-2440
> URL: https://issues.apache.org/jira/browse/YARN-2440
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
> apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, 
> apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, 
> screenshot-current-implementation.jpg
>
>
> The current cgroups implementation does not limit YARN containers to the 
> cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2014-09-10 Thread Chris Trezzo (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Trezzo updated YARN-1492:
---
Attachment: shared_cache_design_v6.pdf

Attached v6 design doc to reflect the current implementation.

> truly shared cache for jars (jobjar/libjar)
> ---
>
> Key: YARN-1492
> URL: https://issues.apache.org/jira/browse/YARN-1492
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.0.4-alpha
>Reporter: Sangjin Lee
>Assignee: Chris Trezzo
> Attachments: YARN-1492-all-trunk-v1.patch, 
> YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
> YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
> shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf, shared_cache_design_v6.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-09-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128843#comment-14128843
 ] 

Vinod Kumar Vavilapalli commented on YARN-2440:
---

bq. As I mentioned before, I think most users would rather not use the 
functionality proposed by this JIRA but instead setup peer cgroups for other 
systems and set their relative cgroup shares appropriately. With this JIRA the 
CPUs could sit idle despite demand from YARN containers, while a peer cgroup 
setup allows CPU guarantees without idle CPUs if the demand is there.
[~jlowe], agree with the general philosophy. Though we are not yet there in 
practice - datanodes, region servers don't yet live in cgroups in many sites. 
Looking back at this JIRA, I see a good use for this. Having the overall YARN 
limit will help ensure that apps' containers don't thrash cpu once we start 
enabling support.

The other dimension to this is determinism w.r.t performance. Limiting to 
allocated cores overall (as well as per container later) helps orgs run 
workloads and reason about them deterministically. One of the examples is 
benchmarking apps, but deterministic execution is a desired option beyond 
benchmarks too.

> Cgroups should allow YARN containers to be limited to allocated cores
> -
>
> Key: YARN-2440
> URL: https://issues.apache.org/jira/browse/YARN-2440
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
> apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, 
> apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, 
> screenshot-current-implementation.jpg
>
>
> The current cgroups implementation does not limit YARN containers to the 
> cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128842#comment-14128842
 ] 

Hadoop QA commented on YARN-2440:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12667797/apache-yarn-2440.6.patch
  against trunk revision b67d5ba.

{color:red}-1 patch{color}.  Trunk compilation may be broken.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4871//console

This message is automatically generated.

> Cgroups should allow YARN containers to be limited to allocated cores
> -
>
> Key: YARN-2440
> URL: https://issues.apache.org/jira/browse/YARN-2440
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
> apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, 
> apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, 
> screenshot-current-implementation.jpg
>
>
> The current cgroups implementation does not limit YARN containers to the 
> cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores

2014-09-10 Thread Varun Vasudev (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Vasudev updated YARN-2440:

Attachment: apache-yarn-2440.6.patch

Uploaded new patch to address Vinod's comments.

{quote}
{noformat}
   
  +Percentage of CPU that can be allocated
  +for containers. This setting allows users to limit the number of
  +physical cores that YARN containers use. Currently functional only
  +on Linux using cgroups. The default is to use 100% of CPU.
  +
  +yarn.nodemanager.resource.percentage-physical-cpu-limit
  +100
  +  
{noformat}

"the number of physical cores" part isn't really right. It actually is 75% 
across all cores, for e.g. We have this sort of "number of physical cores" 
description in multiple places, let's fix that? For instance, in 
NodeManagerHardwareUtils, yarn-default.xml etc.

{quote}

Fixed.

{quote}
Also,
NM_CONTAINERS_CPU_PERC -> NM_RESOURCE_PHYSICAL_CPU_LIMIT
Similarly rename DEFAULT_NM_CONTAINERS_CPU_PERC
{quote}

Done, I'd prefer to have percentage as part of the name. I've changed it to 
NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT and 
DEFAULT_NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT.

> Cgroups should allow YARN containers to be limited to allocated cores
> -
>
> Key: YARN-2440
> URL: https://issues.apache.org/jira/browse/YARN-2440
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Varun Vasudev
>Assignee: Varun Vasudev
> Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, 
> apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, 
> apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, 
> screenshot-current-implementation.jpg
>
>
> The current cgroups implementation does not limit YARN containers to the 
> cores allocated in yarn-site.xml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

2014-09-10 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128819#comment-14128819
 ] 

Vinod Kumar Vavilapalli commented on YARN-2517:
---

I am +1 about a client that makes async calls. The question is whether we need 
a a new client class (and thus a public interface) or not.

Clearly, async calls need call-back handlers _just_ for errors. As of today, 
there are no APIs that really need to send back *results* (not error) 
asynchronously. The way you usually handle it is through one of the following
{code}
// Sync call
Result call(Input);
// Async call - Type (1)
void asyncCall(Input, CallBackHandler);
// Async call - Type (2)
Future asyncCall(Input);
{code}
You can do type (1). Having an entire separate client side interface isn't 
mandatory.

If you guys think there is a lot more functionality coming in an async class in 
the future, can we hear about some of them here?

> Implement TimelineClientAsync
> -
>
> Key: YARN-2517
> URL: https://issues.apache.org/jira/browse/YARN-2517
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2517.1.patch
>
>
> In some scenarios, we'd like to put timeline entities in another thread no to 
> block the current one.
> It's good to have a TimelineClientAsync like AMRMClientAsync and 
> NMClientAsync. It can buffer entities, put them in a separate thread, and 
> have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2529) Generic history service RPC interface doesn't work when service authorization is enabled

2014-09-10 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-2529:
--
Summary: Generic history service RPC interface doesn't work when service 
authorization is enabled  (was: Generic history service RPC interface doesn't 
work wen service authorization is enabled)

> Generic history service RPC interface doesn't work when service authorization 
> is enabled
> 
>
> Key: YARN-2529
> URL: https://issues.apache.org/jira/browse/YARN-2529
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: timelineserver
>Reporter: Zhijie Shen
>Assignee: Zhijie Shen
>
> Here's the problem shown in the log:
> {code}
> 14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for 
> protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is 
> unauthorized for user zshen (auth:SIMPLE)
> 14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: 
> readAndProcess from client 10.22.2.109 threw exception 
> [org.apache.hadoop.security.authorize.AuthorizationException: Protocol 
> interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not 
> known.]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-2529) Generic history service RPC interface doesn't work wen service authorization is enabled

2014-09-10 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-2529:
-

 Summary: Generic history service RPC interface doesn't work wen 
service authorization is enabled
 Key: YARN-2529
 URL: https://issues.apache.org/jira/browse/YARN-2529
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: timelineserver
Reporter: Zhijie Shen
Assignee: Zhijie Shen


Here's the problem shown in the log:

{code}
14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for 
protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is 
unauthorized for user zshen (auth:SIMPLE)
14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: 
readAndProcess from client 10.22.2.109 threw exception 
[org.apache.hadoop.security.authorize.AuthorizationException: Protocol 
interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not known.]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager

2014-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128687#comment-14128687
 ] 

Hadoop QA commented on YARN-2527:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12667768/YARN-2527.patch
  against trunk revision 3072c83.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/4870//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4870//console

This message is automatically generated.

> NPE in ApplicationACLsManager
> -
>
> Key: YARN-2527
> URL: https://issues.apache.org/jira/browse/YARN-2527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
> Attachments: YARN-2527.patch, YARN-2527.patch
>
>
> NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error.
> The relevant stacktrace snippet from the ResourceManager logs is as below
> {code}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> {code}
> This issue was reported by [~miguenther].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager

2014-09-10 Thread Benoy Antony (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated YARN-2527:
---
Attachment: YARN-2527.patch

Attaching a new patch. Added one more test case to test the case of partial set 
of ACLS.

> NPE in ApplicationACLsManager
> -
>
> Key: YARN-2527
> URL: https://issues.apache.org/jira/browse/YARN-2527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
> Attachments: YARN-2527.patch, YARN-2527.patch
>
>
> NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error.
> The relevant stacktrace snippet from the ResourceManager logs is as below
> {code}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> {code}
> This issue was reported by [~miguenther].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager

2014-09-10 Thread Benoy Antony (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128622#comment-14128622
 ] 

Benoy Antony commented on YARN-2527:


Thank you [~zjshen].
I am investigating on how that happened and will probably open another jira 
with the root cause. 
But I believe, the NullPointer issue in _ApplicationACLsManager_ should be 
fixed regardless of that. Based on the current logic, Admin and application 
owner should be able to perform actions on the Application regardless of ACLS. 
The NullPointer issue prevents it.

> NPE in ApplicationACLsManager
> -
>
> Key: YARN-2527
> URL: https://issues.apache.org/jira/browse/YARN-2527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.0
>Reporter: Benoy Antony
>Assignee: Benoy Antony
> Attachments: YARN-2527.patch
>
>
> NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error.
> The relevant stacktrace snippet from the ResourceManager logs is as below
> {code}
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66)
> at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76)
> at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
> {code}
> This issue was reported by [~miguenther].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2014-09-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128619#comment-14128619
 ] 

Karthik Kambatla commented on YARN-1458:


Thanks Zhihai and [~qingwu.fu] for working on this, and Sandy for the reviews. 

Just committed this to trunk and branch-2. 

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
>  Labels: patch
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, 
> YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, 
> yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1458) FairScheduler: Zero weight can lead to livelock

2014-09-10 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1458:
---
Summary: FairScheduler: Zero weight can lead to livelock  (was: In Fair 
Scheduler, size based weight can cause update thread to hold lock indefinitely)

> FairScheduler: Zero weight can lead to livelock
> ---
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
>  Labels: patch
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, 
> YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, 
> yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-09-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128590#comment-14128590
 ] 

Karthik Kambatla commented on YARN-1458:


+1. Committing version 8. 

> In Fair Scheduler, size based weight can cause update thread to hold lock 
> indefinitely
> --
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
>  Labels: patch
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, 
> YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, 
> yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely

2014-09-10 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1458:
---
Target Version/s: 2.6.0  (was: 2.2.0)
   Fix Version/s: (was: 2.2.1)

> In Fair Scheduler, size based weight can cause update thread to hold lock 
> indefinitely
> --
>
> Key: YARN-1458
> URL: https://issues.apache.org/jira/browse/YARN-1458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.2.0
> Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>Reporter: qingwu.fu
>Assignee: zhihai xu
>  Labels: patch
> Attachments: YARN-1458.001.patch, YARN-1458.002.patch, 
> YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, 
> YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, 
> YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, 
> yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when 
> clients submit lots jobs, it is not easy to reapear. We run the test cluster 
> for days to reapear it. The output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 
> waiting for monitor entry [0x43aa9000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
> - waiting to lock <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
> at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 
> runnable [0x433a2000]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
> - locked <0x00070026b6e0> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
> at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

2014-09-10 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128567#comment-14128567
 ] 

Tsuyoshi OZAWA commented on YARN-2517:
--

Thanks for your review, Zhijie. I think batch optimization and persisting 
entities can be done in sync client since the async client uses sync client. 

Submitting a patch again for merging. Please let me know if you have additional 
review comments.

> Implement TimelineClientAsync
> -
>
> Key: YARN-2517
> URL: https://issues.apache.org/jira/browse/YARN-2517
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2517.1.patch
>
>
> In some scenarios, we'd like to put timeline entities in another thread no to 
> block the current one.
> It's good to have a TimelineClientAsync like AMRMClientAsync and 
> NMClientAsync. It can buffer entities, put them in a separate thread, and 
> have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1471) The SLS simulator is not running the preemption policy for CapacityScheduler

2014-09-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128494#comment-14128494
 ] 

Hudson commented on YARN-1471:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1867 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1867/])
Add missing YARN-1471 to the CHANGES.txt (aw: rev 
9b8104575444ed2de9b44fe902f86f7395f249ed)
* hadoop-yarn-project/CHANGES.txt


> The SLS simulator is not running the preemption policy for CapacityScheduler
> 
>
> Key: YARN-1471
> URL: https://issues.apache.org/jira/browse/YARN-1471
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: SLSCapacityScheduler.java, YARN-1471.2.patch, 
> YARN-1471.patch, YARN-1471.patch
>
>
> The simulator does not run the ProportionalCapacityPreemptionPolicy monitor.  
> This is because the policy needs to interact with a CapacityScheduler, and 
> the wrapping done by the simulator breaks this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators

2014-09-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128491#comment-14128491
 ] 

Hudson commented on YARN-2526:
--

SUCCESS: Integrated in Hadoop-Hdfs-trunk #1867 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1867/])
YARN-2526. SLS can deadlock when all the threads are taken by AMSimulators. 
(Wei Yan via kasha) (kasha: rev 28d99db99236ff2a6e4a605802820e2b512225f9)
* hadoop-yarn-project/CHANGES.txt
* 
hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java


> SLS can deadlock when all the threads are taken by AMSimulators
> ---
>
> Key: YARN-2526
> URL: https://issues.apache.org/jira/browse/YARN-2526
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 2.5.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Fix For: 2.6.0
>
> Attachments: YARN-2526-1.patch
>
>
> The simulation may enter deadlock if all application simulators hold all 
> threads provided by the thread pool, and all wait for AM container 
> allocation. In that case, all AM simulators wait for NM simulators to do 
> heartbeat to allocate resource, and all NM simulators wait for AM simulators 
> to release some threads. The simulator is deadlocked.
> To solve this deadlock, need to remove the while() loop in the MRAMSimulator.
> {code}
> // waiting until the AM container is allocated
> while (true) {
>   if (response != null && ! response.getAllocatedContainers().isEmpty()) {
> // get AM container
> .
> break;
>   }
>   // this sleep time is different from HeartBeat
>   Thread.sleep(1000);
>   // send out empty request
>   sendContainerRequest();
>   response = responseQueue.take();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1471) The SLS simulator is not running the preemption policy for CapacityScheduler

2014-09-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128357#comment-14128357
 ] 

Hudson commented on YARN-1471:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #676 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/676/])
Add missing YARN-1471 to the CHANGES.txt (aw: rev 
9b8104575444ed2de9b44fe902f86f7395f249ed)
* hadoop-yarn-project/CHANGES.txt


> The SLS simulator is not running the preemption policy for CapacityScheduler
> 
>
> Key: YARN-1471
> URL: https://issues.apache.org/jira/browse/YARN-1471
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Carlo Curino
>Assignee: Carlo Curino
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: SLSCapacityScheduler.java, YARN-1471.2.patch, 
> YARN-1471.patch, YARN-1471.patch
>
>
> The simulator does not run the ProportionalCapacityPreemptionPolicy monitor.  
> This is because the policy needs to interact with a CapacityScheduler, and 
> the wrapping done by the simulator breaks this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators

2014-09-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128354#comment-14128354
 ] 

Hudson commented on YARN-2526:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #676 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/676/])
YARN-2526. SLS can deadlock when all the threads are taken by AMSimulators. 
(Wei Yan via kasha) (kasha: rev 28d99db99236ff2a6e4a605802820e2b512225f9)
* 
hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java
* hadoop-yarn-project/CHANGES.txt


> SLS can deadlock when all the threads are taken by AMSimulators
> ---
>
> Key: YARN-2526
> URL: https://issues.apache.org/jira/browse/YARN-2526
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 2.5.1
>Reporter: Wei Yan
>Assignee: Wei Yan
>Priority: Critical
> Fix For: 2.6.0
>
> Attachments: YARN-2526-1.patch
>
>
> The simulation may enter deadlock if all application simulators hold all 
> threads provided by the thread pool, and all wait for AM container 
> allocation. In that case, all AM simulators wait for NM simulators to do 
> heartbeat to allocate resource, and all NM simulators wait for AM simulators 
> to release some threads. The simulator is deadlocked.
> To solve this deadlock, need to remove the while() loop in the MRAMSimulator.
> {code}
> // waiting until the AM container is allocated
> while (true) {
>   if (response != null && ! response.getAllocatedContainers().isEmpty()) {
> // get AM container
> .
> break;
>   }
>   // this sleep time is different from HeartBeat
>   Thread.sleep(1000);
>   // send out empty request
>   sendContainerRequest();
>   response = responseQueue.take();
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data

2014-09-10 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128140#comment-14128140
 ] 

Zhijie Shen edited comment on YARN-1530 at 9/10/14 7:36 AM:


[~bcwalrus], thanks for your interests in the timeline server and sharing your 
idea. Here’re some of my opinions and our previous rationales.

bq. Let's have reliability before speed. I think one of the requirement of ATS 
is: The channel for writing events should be reliable.

I agree reliability is an important requirement of the timeline server, but the 
other requirements such as scalability and efficiency should be orthogonal to 
it, such that there’s no order of which should come first. We can pursue both 
enhancement, can’t we?

bq. I'm using reliable here in a strong sense, not the TCP-best-effort style 
reliability. HDFS is reliable. Kafka is reliable. (They are also scalable and 
robust.)

IMHO, it may be unfair to compare the reliability between TCP and HDFS, Kafka, 
because they’re on the different layer of the communication stack. HDFS and 
Kafka are also built on top of TCP for communication, right? In my previous 
[comments|https://issues.apache.org/jira/browse/YARN-1530?focusedCommentId=14125238&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14125238],
 I’ve mentioned that we need to clearly define *reliability, and I’d like to 
highlight it here again:

1. Server is reliable: when timeline entities is passed to the timeline server, 
it should prevent them from being lost. After YARN-2032, we’re going to have 
HBase timeline store to ensure it.

2. Client is reliable: once the timeline entities are hand over to the timeline 
client, before the timeline client successfully put in to the timeline sever, 
it should prevent them being lost at the client side. We may use some 
techniques to cache the entities locally. I opened YANR-2521 to track the 
dissuasion along this direction.

Between client and server, TCP is the trustworthy protocol. If client gets ACK 
from server, we should be confident that the server already gets the entities.

bq. A normal RPC connection is not. I don't want the ATS to be able to slow 
down my writes, and therefore, my applications, at all.

I’m not sure there's the direct relationship between reliability and 
nonblocking writing. For example, submitting app via YarnClient to HA RM is 
reliable, but the user is still likely to blocked until the app submission is 
responded. Whether writing events is blocking or non-blocking depends on how 
the user uses the client. In YARN-2033, I make RM put the entities on a 
separate thread to prevent blocking the dispatcher for managing YARN app 
lifecycle. And I can see that nonblocking writing is a useful optimization, 
such that I’ve opened YARN-2517 to implement TimelineClientAsync for general 
usage.

bq. Yes, you could make a distributed reliable scalable "ATS service" to accept 
writing events. But that seems a lot of work, while we can leverage existing 
technologies.

AFAIK, the timeline server is a stateless machine, it should not be difficult 
to use Zookeeper to manage a number instances and writing to the same HBase 
cluster. We may need to pay attention to load balancing, and concurrent 
writing. I’m not sure it will really be a lot of work. Please let me know if 
I’ve neglected some important pieces. And in the scope of YARN, we already 
accumulated similar experience when making HA RM, and it turns out to be a 
practical solution. Again, this is about scalability, which is orthogonal to 
reliability. Even we pass the timeline entities via Kafka/HDFS to the timeline 
server, the single server is still going to be the bottleneck of processing a 
large volume of requests, no matter how big the Kafaka/HDFS cluster is.

bq. If the channel itself is pluggable, then we have lots of options. Kafka is 
a very good choice, for sites that already deploy Kafka and know how to operate 
it. Using HDFS as a channel is also a good default implementation, for people 
already know how to scale and manage HDFS.

I’m not object to having different entity publishing channels, but my concern 
is the effort to maintain the timeline client is going to be folded per number 
of the channels. As the timeline server is going to to be long term project, we 
can not neglect the additional workload of evolving all channels. And this is 
the similar concern that we want to remove the FS-based history store (see 
YARN-2320). Maybe cooperatively improving the current channel is a more 
cost-efficient choice. It’s good to think more before opening a new channel.

In addition, the default solution is good to be simple and self-contained. A 
heavy solution with complex configuration and and large dependency is likely to 
prolong the learning curve to keep new adopters away, and complicate fast, 
small-scale deployment.


was (

[jira] [Commented] (YARN-2517) Implement TimelineClientAsync

2014-09-10 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128156#comment-14128156
 ] 

Zhijie Shen commented on YARN-2517:
---

Scan through the patch, the approach is quite close to 
AMRMClientAsync/NMClientAsync. It looks fine to me in general. Later on, we can 
improve the client step by step. For example, according to the discussion on 
the umbrella, we can persist the queued entities to be reliable. And we may 
want to allow multiple threads to put entities.

> Implement TimelineClientAsync
> -
>
> Key: YARN-2517
> URL: https://issues.apache.org/jira/browse/YARN-2517
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhijie Shen
>Assignee: Tsuyoshi OZAWA
> Attachments: YARN-2517.1.patch
>
>
> In some scenarios, we'd like to put timeline entities in another thread no to 
> block the current one.
> It's good to have a TimelineClientAsync like AMRMClientAsync and 
> NMClientAsync. It can buffer entities, put them in a separate thread, and 
> have callback to handle the responses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)