[jira] [Commented] (YARN-4106) NodeLabels for NM in distributed mode is not updated even after clusterNodelabel addition in RM

2015-09-06 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733251#comment-14733251
 ] 

Bibin A Chundatt commented on YARN-4106:


[~leftnoteasy] Thnks for looking into the issue and [~Naganarasimha] for 
clearing first point.
Will try to update testcase for trigger check too. Testcases is working 
perfectly fine locally for me in eclipse.
i will recheck and update.

> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM 
> 
>
> Key: YARN-4106
> URL: https://issues.apache.org/jira/browse/YARN-4106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: 0001-YARN-4106.patch, 0002-YARN-4106.patch, 
> 0003-YARN-4106.patch, 0004-YARN-4106.patch
>
>
> NodeLabels for NM in distributed mode is not updated even after 
> clusterNodelabel addition in RM
> Steps to reproduce
> ===
> # Configure nodelabel in distributed mode
> yarn.node-labels.configuration-type=distributed
> provider = config
> yarn.nodemanager.node-labels.provider.fetch-interval-ms=12ms
> # Start RM the NM
> # Once NM is registration is done add nodelabels in RM
> Nodelabels not getting updated in RM side 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-06 Thread Kuhu Shukla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla updated YARN-2410:
--
Attachment: YARN-2410-v6.patch

Thank you so much [~jlowe] for the detailed feedback. I have made all but 2 
changes and request your further comments on that.

{quote}
Actually I'm not really sure why SendMapOutputParams exists separate from 
ReduceContext. There should be a one-to-one relationship there. 
{quote}

I totally agree. The only reason was findbugs which does not allow more than 7 
parameters in a function call( or the constructor that would populate these 
values). If this is not an issue, I can move them into a single class. For now 
I have made SendMapOutputParams an inner class to ReduceContext.

{quote}
Why was reduceContext added as a TestShuffleHandler instance variable? It's 
specific to the new test.
{quote}

The reduceContext is a variable holds the value set by the setAttachment() 
method and is used by the getAttachment() answer. If I declare it in the test 
method, it needs be final which cannot be done due to it being used by the 
setter. I am looking for another way. Let me know what you think.

All other items have been done. 

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4120) FSAppAttempt.getResourceUsage() should not take preemptedResource into account

2015-09-06 Thread Xianyin Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733235#comment-14733235
 ] 

Xianyin Xin commented on YARN-4120:
---

Thanks [~kasha]. How about distinguishing getResourceUsage() (the current gross 
resource usage) and getNetResourceUsage() (the current gross resource usage 
minus preempted)? The latter are used for preemption related calculations and 
the former for others?

> FSAppAttempt.getResourceUsage() should not take preemptedResource into account
> --
>
> Key: YARN-4120
> URL: https://issues.apache.org/jira/browse/YARN-4120
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Xianyin Xin
>
> When compute resource usage for Schedulables, the following code is envolved,
> {{FSAppAttempt.getResourceUsage}},
> {code}
> public Resource getResourceUsage() {
>   return Resources.subtract(getCurrentConsumption(), getPreemptedResources());
> }
> {code}
> and this value is aggregated to FSLeafQueues and FSParentQueues. In my 
> opinion, taking {{preemptedResource}} into account here is not reasonable, 
> there are two main reasons,
> # it is something in future, i.e., even though these resources are marked as 
> preempted, it is currently used by app, and these resources will be 
> subtracted from {{currentCosumption}} once the preemption is finished. it's 
> not reasonable to make arrange for it ahead of time. 
> # there's another problem here, consider following case,
> {code}
> root
>/\
>   queue1   queue2
>   /\
> queue1.3, queue1.4
> {code}
> suppose queue1.3 need resource and it can preempt resources from queue1.4, 
> the preemption happens in the interior of queue1. But when compute resource 
> usage of queue1, {{queue1.resourceUsage = it's_current_resource_usage - 
> preemption}} according to the current code, which is unfair to queue2 when 
> doing resource allocating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2410) Nodemanager ShuffleHandler can possible exhaust file descriptors

2015-09-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14733149#comment-14733149
 ] 

Hadoop QA commented on YARN-2410:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  15m 44s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 44s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 51s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   0m 21s | The applied patch generated  
14 new checkstyle issues (total was 60, now 74). |
| {color:red}-1{color} | whitespace |   0m  1s | The patch has 1  line(s) that 
end in whitespace. Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 28s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:red}-1{color} | findbugs |   0m 47s | The patch appears to introduce 1 
new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | mapreduce tests |   0m 21s | Tests passed in 
hadoop-mapreduce-client-shuffle. |
| | |  37m 15s | |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-mapreduce-client-shuffle |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754414/YARN-2410-v6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 9b68577 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/9019/artifact/patchprocess/diffcheckstylehadoop-mapreduce-client-shuffle.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/9019/artifact/patchprocess/whitespace.txt
 |
| Findbugs warnings | 
https://builds.apache.org/job/PreCommit-YARN-Build/9019/artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-shuffle.html
 |
| hadoop-mapreduce-client-shuffle test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9019/artifact/patchprocess/testrun_hadoop-mapreduce-client-shuffle.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9019/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9019/console |


This message was automatically generated.

> Nodemanager ShuffleHandler can possible exhaust file descriptors
> 
>
> Key: YARN-2410
> URL: https://issues.apache.org/jira/browse/YARN-2410
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.0
>Reporter: Nathan Roberts
>Assignee: Kuhu Shukla
> Attachments: YARN-2410-v1.patch, YARN-2410-v2.patch, 
> YARN-2410-v3.patch, YARN-2410-v4.patch, YARN-2410-v5.patch, YARN-2410-v6.patch
>
>
> The async nature of the shufflehandler can cause it to open a huge number of
> file descriptors, when it runs out it crashes.
> Scenario:
> Job with 6K reduces, slow start set to 0.95, about 40 map outputs per node.
> Let's say all 6K reduces hit a node at about same time asking for their
> outputs. Each reducer will ask for all 40 map outputs over a single socket in 
> a
> single request (not necessarily all 40 at once, but with coalescing it is
> likely to be a large number).
> sendMapOutput() will open the file for random reading and then perform an 
> async transfer of the particular portion of this file(). This will 
> theoretically
> happen 6000*40=24 times which will run the NM out of file descriptors and 
> cause it to crash.
> The algorithm should be refactored a little to not open the fds until they're
> actually needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4120) FSAppAttempt.getResourceUsage() should not take preemptedResource into account

2015-09-06 Thread Xianyin Xin (JIRA)
Xianyin Xin created YARN-4120:
-

 Summary: FSAppAttempt.getResourceUsage() should not take 
preemptedResource into account
 Key: YARN-4120
 URL: https://issues.apache.org/jira/browse/YARN-4120
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Reporter: Xianyin Xin


When compute resource usage for Schedulables, the following code is envolved,
{{FSAppAttempt.getResourceUsage}},
{code}
public Resource getResourceUsage() {
  return Resources.subtract(getCurrentConsumption(), getPreemptedResources());
}
{code}
and this value is aggregated to FSLeafQueues and FSParentQueues. In my opinion, 
taking {{preemptedResource}} into account here is not reasonable, there are two 
main reasons,
# it is something in future, i.e., even though these resources are marked as 
preempted, it is currently used by app, and these resources will be subtracted 
from {{currentCosumption}} once the preemption is finished. it's not reasonable 
to make arrange for it ahead of time. 
# there's another problem here, consider following case,
{code}
root
   /\
  queue1   queue2
  /\
queue1.3, queue1.4
{code}
suppose queue1.3 need resource and it can preempt resources from queue1.4, the 
preemption happens in the interior of queue1. But when compute resource usage 
of queue1, {{queue1.resourceUsage = it's_current_resource_usage - preemption}} 
according to the current code, which is unfair to queue2 when doing resource 
allocating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4119) Expose the NM bind address as an env, so that AM can make use of it for exposing tracking URL

2015-09-06 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732313#comment-14732313
 ] 

Varun Saxena commented on YARN-4119:


This is duplicate of MAPREDUCE-6402

>  Expose the NM bind address as an env, so that AM can make use of it for 
> exposing tracking URL
> --
>
> Key: YARN-4119
> URL: https://issues.apache.org/jira/browse/YARN-4119
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> As described in MAPREDUCE-5938, In many security scanning tools its not 
> advisable to bind on all network addresses and would be good to bind only on 
> the desired address. As AM's can run on any of the nodes it would be better 
> for NM to share its bind address as part of Environment variables to the 
> container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4121) Typos in capacity scheduler documentation.

2015-09-06 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated YARN-4121:
-
Attachment: YARN-4121.00.patch

> Typos in capacity scheduler documentation.
> --
>
> Key: YARN-4121
> URL: https://issues.apache.org/jira/browse/YARN-4121
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Kai Sasaki
>Priority: Trivial
> Attachments: YARN-4121.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.

2015-09-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732396#comment-14732396
 ] 

Karthik Kambatla commented on YARN-1680:


MAPREDUCE-6302 would resolve deadlocks, but only reactively. I believe we 
should use that only as a safeguard to guard against future headroom issues.

We should fix this regardless. Went through the discussions here. I vote for 
the scheduler accounting for the blacklisted nodes in the headroom calculation. 
If the app is to subtract these resources from the headroom, it might as well 
maintain the blacklist itself and relieve the scheduler of those details as 
well. Also, as Jian mentioned, it is better to do this in one place (in the 
scheduler) than have each app handle it. 

[~vinodkv] - do you agree? 

Accordingly, we would like to make progress on YARN-3446. 

> availableResources sent to applicationMaster in heartbeat should exclude 
> blacklistedNodes free memory.
> --
>
> Key: YARN-1680
> URL: https://issues.apache.org/jira/browse/YARN-1680
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Affects Versions: 2.2.0, 2.3.0
> Environment: SuSE 11 SP2 + Hadoop-2.3 
>Reporter: Rohith Sharma K S
> Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, 
> YARN-1680-v2.patch, YARN-1680.patch
>
>
> There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster 
> slow start is set to 1.
> Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is 
> become unstable(3 Map got killed), MRAppMaster blacklisted unstable 
> NodeManager(NM-4). All reducer task are running in cluster now.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes memory. This makes 
> jobs to hang forever(ResourceManager does not assing any new containers on 
> blacklisted nodes but returns availableResouce considers cluster free 
> memory). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2170) Fix components' version information in the web page 'About the Cluster'

2015-09-06 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732375#comment-14732375
 ] 

Jun Gong commented on YARN-2170:


Thanks [~zxu] for the review.

> Fix components' version information in the web page 'About the Cluster'
> ---
>
> Key: YARN-2170
> URL: https://issues.apache.org/jira/browse/YARN-2170
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Minor
> Attachments: YARN-2170.patch
>
>
> In the web page 'About the Cluster', YARN's component's build version(e.g. 
> ResourceManager) is the same as Hadoop version now. It is caused by   calling 
> getVersion() instead of _getVersion() in VersionInfo.java by mistake.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4121) Typos in capacity scheduler documentation.

2015-09-06 Thread Kai Sasaki (JIRA)
Kai Sasaki created YARN-4121:


 Summary: Typos in capacity scheduler documentation.
 Key: YARN-4121
 URL: https://issues.apache.org/jira/browse/YARN-4121
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Kai Sasaki
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4122) Add support for GPU as a resource

2015-09-06 Thread Jun Gong (JIRA)
Jun Gong created YARN-4122:
--

 Summary: Add support for GPU as a resource
 Key: YARN-4122
 URL: https://issues.apache.org/jira/browse/YARN-4122
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jun Gong
Assignee: Jun Gong


Use [cgroups 
devcies|https://www.kernel.org/doc/Documentation/cgroups/devices.txt] to 
isolate GPUs for containers. For docker containers, we could use 'docker run 
--device=...'.

Reference: [SLURM Resources isolation through 
cgroups|http://slurm.schedmd.com/slurm_ug_2011/SLURM_UserGroup2011_cgroups.pdf].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4120) FSAppAttempt.getResourceUsage() should not take preemptedResource into account

2015-09-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732394#comment-14732394
 ] 

Karthik Kambatla commented on YARN-4120:


Good catch, [~xinxianyin]. 

I believe the reason we are subtracting preempted resources is so we don't 
preempt more resources from the same queue. We might have to track that 
information separately. [~asuresh], [~ashwinshankar77] - thoughts? 

> FSAppAttempt.getResourceUsage() should not take preemptedResource into account
> --
>
> Key: YARN-4120
> URL: https://issues.apache.org/jira/browse/YARN-4120
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Xianyin Xin
>
> When compute resource usage for Schedulables, the following code is envolved,
> {{FSAppAttempt.getResourceUsage}},
> {code}
> public Resource getResourceUsage() {
>   return Resources.subtract(getCurrentConsumption(), getPreemptedResources());
> }
> {code}
> and this value is aggregated to FSLeafQueues and FSParentQueues. In my 
> opinion, taking {{preemptedResource}} into account here is not reasonable, 
> there are two main reasons,
> # it is something in future, i.e., even though these resources are marked as 
> preempted, it is currently used by app, and these resources will be 
> subtracted from {{currentCosumption}} once the preemption is finished. it's 
> not reasonable to make arrange for it ahead of time. 
> # there's another problem here, consider following case,
> {code}
> root
>/\
>   queue1   queue2
>   /\
> queue1.3, queue1.4
> {code}
> suppose queue1.3 need resource and it can preempt resources from queue1.4, 
> the preemption happens in the interior of queue1. But when compute resource 
> usage of queue1, {{queue1.resourceUsage = it's_current_resource_usage - 
> preemption}} according to the current code, which is unfair to queue2 when 
> doing resource allocating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4121) Typos in capacity scheduler documentation.

2015-09-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732397#comment-14732397
 ] 

Hadoop QA commented on YARN-4121:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |   2m 57s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | release audit |   0m 19s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | site |   2m 59s | Site still builds. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| | |   6m 17s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754395/YARN-4121.00.patch |
| Optional Tests | site |
| git revision | trunk / 9b68577 |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9018/console |


This message was automatically generated.

> Typos in capacity scheduler documentation.
> --
>
> Key: YARN-4121
> URL: https://issues.apache.org/jira/browse/YARN-4121
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: Kai Sasaki
>Priority: Trivial
> Attachments: YARN-4121.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3446) FairScheduler HeadRoom calculation should exclude nodes in the blacklist.

2015-09-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732398#comment-14732398
 ] 

Karthik Kambatla commented on YARN-3446:


Discussing the approach on YARN-1680. Let us finalize the approach there 
quickly, and make progress here. 

> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> -
>
> Key: YARN-3446
> URL: https://issues.apache.org/jira/browse/YARN-3446
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: YARN-3446.000.patch, YARN-3446.001.patch
>
>
> FairScheduler HeadRoom calculation should exclude nodes in the blacklist.
> MRAppMaster does not preempt the reducers because for Reducer preemption 
> calculation, headRoom is considering blacklisted nodes. This makes jobs to 
> hang forever(ResourceManager does not assign any new containers on 
> blacklisted nodes but availableResource AM get from RM includes blacklisted 
> nodes available resource).
> This issue is similar as YARN-1680 which is for Capacity Scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3433) Jersey tests failing with Port in Use -again

2015-09-06 Thread Brahma Reddy Battula (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732329#comment-14732329
 ] 

Brahma Reddy Battula commented on YARN-3433:


Attached the patch.. Kindly Review..

> Jersey tests failing with Port in Use -again
> 
>
> Key: YARN-3433
> URL: https://issues.apache.org/jira/browse/YARN-3433
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 3.0.0
> Environment: ASF Jenkins
>Reporter: Steve Loughran
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3433.patch
>
>
> ASF Jenkins jersey tests failing with port in use exceptions.
> The YARN-2912 patch tried to fix it, but it defaults to port 9998 and doesn't 
> scan for a spare port —so is too brittle on a busy server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-3433) Jersey tests failing with Port in Use -again

2015-09-06 Thread Brahma Reddy Battula (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brahma Reddy Battula updated YARN-3433:
---
Attachment: YARN-3433.patch

> Jersey tests failing with Port in Use -again
> 
>
> Key: YARN-3433
> URL: https://issues.apache.org/jira/browse/YARN-3433
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 3.0.0
> Environment: ASF Jenkins
>Reporter: Steve Loughran
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3433.patch
>
>
> ASF Jenkins jersey tests failing with port in use exceptions.
> The YARN-2912 patch tried to fix it, but it defaults to port 9998 and doesn't 
> scan for a spare port —so is too brittle on a busy server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3433) Jersey tests failing with Port in Use -again

2015-09-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732338#comment-14732338
 ] 

Hadoop QA commented on YARN-3433:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | pre-patch |   5m 46s | Findbugs (version ) appears to 
be broken on trunk. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 41s | There were no new javac warning 
messages. |
| {color:green}+1{color} | release audit |   0m 20s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 28s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 26s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   1m 32s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| | |  19m 45s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12754383/YARN-3433.patch |
| Optional Tests | javac unit findbugs checkstyle |
| git revision | trunk / 9b68577 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/9017/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9017/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9017/console |


This message was automatically generated.

> Jersey tests failing with Port in Use -again
> 
>
> Key: YARN-3433
> URL: https://issues.apache.org/jira/browse/YARN-3433
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 3.0.0
> Environment: ASF Jenkins
>Reporter: Steve Loughran
>Assignee: Brahma Reddy Battula
> Attachments: YARN-3433.patch
>
>
> ASF Jenkins jersey tests failing with port in use exceptions.
> The YARN-2912 patch tried to fix it, but it defaults to port 9998 and doesn't 
> scan for a spare port —so is too brittle on a busy server



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732430#comment-14732430
 ] 

Karthik Kambatla commented on YARN-2005:


In my comments above, if 2.4 shouldn't update the systemBlacklist, we likely 
don't need the new method in RMAppAttempt. 

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs

2015-09-06 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732429#comment-14732429
 ] 

Karthik Kambatla commented on YARN-2005:


Thanks for working on this, Anubhav. The approach looks good to me. Minor 
comments/nits on the patch itself:
# Spurious changes/imports in AbstractYarnScheduler, FifoScheduler, RMAppImpl, 
TestAbstractYarnScheduler, TestAMRestart, YarnScheduler.
# In AppSchedulingInfo
## synchronize *only* the common updateBlacklist method? 
## Also, should we just synchronize on the list in question and not all of 
AppSchedulingInfo? I am fine with leaving it as is, if that is a lot of work 
and won't yield any big gains.
## The comments in the common updateBlaclist method refer to userBlacklist 
while the method operates on both system and user blacklists.
## Should transferStateFromPreviousAppSchedulingInfo update the systemBlacklist 
as well? Or, is the decision here to have the system blacklist per app-attempt?
# BlacklistAdditionsRemovals:
## Mark it Private
## Rename to BlacklistUpdates
## Rename members blacklistAdditions and blacklistRemovals to additions and 
removals, and update the getters accordingly? 
# BlackListManager
## Mark it Private
## Rename addNodeContainerFailure to addNode?
## Rename getter?
# In DisabledBlacklistManager, define a static EMPTY_LIST similar to 
SimpleBlacklistManager and use that to avoid creating two ArrayLists for each 
AppAttempt.
# RMAppImpl: If am blacklisting is not enabled, we don't need to read the 
disable threshold. 
# RMAppAttempt: Instead of getAMBlacklist, should we call it getSystemBlacklist 
to be consistent with the way we refer to it in the scheduler? 
# RMAppAttemptImpl
## EMPTY_SYSTEM_BLACKLIST is unused
## Update any variables based on the method name - getAMBlacklist vs 
getSystemBlacklist
## Shouldn't we blacklist nodes on LaunchFailedTransition - may be in a 
follow-up JIRA? Can we file one if you agree.
# The MockRM change is unrelated? I like the change, but may be we should do it 
in a separate clean-up JIRA. It might have a few other things to clean-up :)
# TestAMRestart: A couple of unused variables. 
# Why are the changes to TestCapacityScheduler needed? They don't look related 
to this patch.
# TestRMAppLogAggregationStatus change is unrelated. 

> Blacklisting support for scheduling AMs
> ---
>
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 0.23.10, 2.4.0
>Reporter: Jason Lowe
>Assignee: Anubhav Dhoot
> Attachments: YARN-2005.001.patch, YARN-2005.002.patch, 
> YARN-2005.003.patch, YARN-2005.004.patch, YARN-2005.005.patch, 
> YARN-2005.006.patch, YARN-2005.006.patch, YARN-2005.007.patch, 
> YARN-2005.008.patch
>
>
> It would be nice if the RM supported blacklisting a node for an AM launch 
> after the same node fails a configurable number of AM attempts.  This would 
> be similar to the blacklisting support for scheduling task attempts in the 
> MapReduce AM but for scheduling AM attempts on the RM side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)