[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-05-11 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280745#comment-15280745
 ] 

Jian He commented on YARN-4635:
---

I guess this is a bit late for 2.8, let's move to 2.9 ?

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-04 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132071#comment-15132071
 ] 

Sunil G commented on YARN-4635:
---

[~vvasudev],  I also think the same way. [~djp],  cud u also share the plan. 

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-04 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132072#comment-15132072
 ] 

Sunil G commented on YARN-4635:
---

Meantime I will provide an early version patch based on this ticket on 
YARN-4637. And will start the discussions in that ticket. Thank you. 

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-04 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132322#comment-15132322
 ] 

Sunil G commented on YARN-4635:
---

Hi [~jian he]
bq. I said this because I feel the API may look simpler and we don't need a 
separate nested AMBlackListingRequest class
For this feature, if we have a class for {{AMBlackListingRequest}}, I think it 
will be more clear to end user. Few minor advantages.

If enable/disable flag is not present in AMBlackListingRequest, then as you 
mentioned we ll be using threshold alone. Now it has to convey 4 cases.
- If user is not configuring this information from AM end. (a default value is 
needed, -ve value may be).
- User wants to disable Blacklisting for this specific application. (threshold 
need to 0)
- User wants to use this blacklisting and will configure a value for this.
- User wants this feature, but do not know a good threshold. Wants to use 
global threshold. So he need to give some big values more than 1.0f to get this 
behavior.
If we have a flag, we can simply turn on or off this feature per app. And if 
user doesnt wants  this feature, user need not have to set this blacklist 
object in context (null will set). 

These are not very strong reasons, and as you told we can achieve the current 
behavior in  both ways. No problem in choosing either of these options.  :)

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-04 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131994#comment-15131994
 ] 

Sunil G commented on YARN-4635:
---

Hi [~vvasudev]
YARN-4637 will be handling purging mechanism. We have few options/policies 
possible for purging (time based / NM event based etc). Once this patch is in 
shape, YARN-4637 can make progress based on same.

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-04 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132004#comment-15132004
 ] 

Varun Vasudev commented on YARN-4635:
-

[~sunilg] - does that mean this won't get committed till YARN-4637 is ready?

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-04 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131938#comment-15131938
 ] 

Varun Vasudev commented on YARN-4635:
-

I agree with [~jianhe] - 

1) KILLED_EXCEEDED_PMEM and KILLED_EXCEEDED_VMEM are container specific and no 
reason to blacklist the node. The AM will be killed for exceeding it's pmem and 
vmem irrespective of other containers on the node.

2) For DISKS_FAILED the RM should mark the disk as bad and the node should be 
skipped.

In addition, I don't see any mechanism for purging the blacklist in this patch 
- if that's the case we should work on this in a feature branch and not commit 
to trunk/branch-2 until we have the purge mechanism sorted out.

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-03 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130272#comment-15130272
 ] 

Junping Du commented on YARN-4635:
--

Thanks [~jianhe] for review and comments.
First, I would like to claim an assumption that the blacklist mechanism for AM 
launching is not for tracking nodes that completely not work (unhealthy) but 
tracking nodes that has suspect to fail the AM container due to previous failed 
experience. This is because we already have unhealthy report mechanism to 
report serious issue for NM so here is another one which should have a higher 
bar (as in some sense, AM container is more important than other container) 
according to the history. 
My response will be based on above assumption.
bq. why should below container exit status back list the node ?
This container failure could due to resource congestion (like 
KILLED_EXCEEDED_PMEM) or unknown reason (ABORTED, INVALID) that make this NM 
higher suspect than normal nodes.

bq. For DISKS_FAILED which is considered as global blacklist node in this jira, 
I think in this case, the node will report as unhealthy and RM should remove 
the node already.
Some DISKS_FAILED could happens due to the failed container write disk to full. 
But it could still have other directories available to use by node. It could 
still get launched with normal containers but not suitable to risk AM container.

bq. AMBlackListingRequest contains a boolean flag and a threshold number. Do 
you think it’s ok to just use the threshold number only ? 0 means disabled, and 
numbers larger than 0 means enabled?
If so, it means the job submitter have to understand how many nodes the current 
cluster have and the job parameter should be updated if it get submitted to 
different cluster (with different nodes). IMO, That sounds more complexity to 
users.

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-03 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131632#comment-15131632
 ] 

Jian He commented on YARN-4635:
---

bq. Some DISKS_FAILED could happens due to the failed container write disk to 
full. But it could still have other directories available to use by node. It 
could still get launched with normal containers but not suitable to risk AM 
container.
In current code, the DISKS_FAILED status is set when this condition is true
{code}
  if (!dirsHandler.areDisksHealthy()) {
ret = ContainerExitStatus.DISKS_FAILED;
throw new IOException("Most of the disks failed. "
+ dirsHandler.getDisksHealthReport(false));
  }
{code}
The same check {{dirsHandler.areDisksHealthy}} is used by DiskHealth monitor. 
{code}
  boolean isHealthy() {
boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true
: nodeHealthScriptRunner.isHealthy();
return scriptHealthStatus && dirsHandler.areDisksHealthy();
  }
{code}
Essentially, if this condition is false, the node will be reported as unhealthy 
in the first place, which makes RM remove the node. And the global blacklisted 
becomes not useful in practice because the node is already removed. Maybe I 
missed something, a unit test can prove this.

bq. If so, it means the job submitter have to understand how many nodes the 
current cluster have 
Sorry, I don't understand why job submitter needs to understand the number of 
nodes. what I meant is that, right now a boolean flag(false) is used to 
indicate that this feature is disabled. alternatively,  a 0 threshold can 
achieve the same result (with logic change on RM side).  I said this because I 
feel the API may look simpler and we don't need a separate nested 
AMBlackListingRequest class. Having the threshold set in submissionContext will 
be enough. But I don't have strong opinion on this. Current way is ok too.

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-03 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130324#comment-15130324
 ] 

Sunil G commented on YARN-4635:
---

Thanks [~jianhe] for the comments and thanks [~djp] for the clarifications.

bq.Do you think it’s ok to just use the threshold number only ? 0 means 
disabled, and numbers larger than 0 means enabled
Adding one more minor advantage using a threshold. If app specifies 
{{AMBlackListingRequest}} flag as false, then global blacklisting will not be 
applicable for this app. Such control is easier with a flag i think, how do you 
feel. 

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129036#comment-15129036
 ] 

Hadoop QA commented on YARN-4635:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 37s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 10m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 20s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
59s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 20s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
53s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 
46s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 44s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 50s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 32s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
0s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 7s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 7s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 16s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 16s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 53s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 4 new + 
803 unchanged - 4 fixed = 807 total (was 807) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 13s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
47s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 36s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 38s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 42s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 3s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 76m 51s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 36s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 51s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_91. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 

[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128389#comment-15128389
 ] 

Sunil G commented on YARN-4635:
---

Hi [~djp]
Thanks for sharing the patch fast. Overall looks fine for me.

Few points:
1. Per app blacklist manager need not have to consider the case to remove a 
node from this blacklist. But for global blacklist manager, i think we need a 
{{removeNode}} interface in {{BlacklistManager}}. If we can launch an AM 
container at some later point of time after the first failure, we can remove 
that node immediately from global blacklisting. May be 
{{RMAppAttemptImpl#checkStatusForNodeBlacklisting}} can check for success too 
(Or are we planning to handle in the ticket where we try to come with time 
based clear mechanism). Thoughts?

2. I think {{SimpleBlacklistManager#refreshNodeHostCount}} can pre-compute 
failure threshold also along with updating {{numberOfNodeManagerHosts}}. So 
whoever is invoking {{getBlacklistUpdates}} need not have to compute always. 
This is  minor suggestion in existing code.

3.
{code}
+// No thread safe problem as getBlacklistUpdates() in
+// SimpleBlacklistManager do clone operation to blacklistNodes
+List amBlacklistAdditions = new ArrayList();
{code}

There are chances of duplicates from global and per-app level blacklists, 
correct?. So could we use a Set here. One possibility, one AM container failed 
due to ABORTED and added to per-app level blacklist, second attempt failed to 
due to DISK_FAILED and added to global. Now this will be a duplicate scenario. 
Thoughts?

 


> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128533#comment-15128533
 ] 

Junping Du commented on YARN-4635:
--

bq. If we can launch an AM container at some later point of time after the 
first failure, we can remove that node immediately from global blacklisting.
In most case, AM container won't get chance to launch again on this node 
because blacklist mechanism already blacklist it get allocated. However, the 
corner case is: two AM containers get launched at the same time, one failure 
but the other one successful. IMO, the successfully completed one shouldn't 
purge node from blacklist as normal node as the failure marked as global 
affected like DISK_FAILURE could still happen on coming am containers. In 
another words, it still get more risky for AM launched on this node which is 
not changed by another AM container finished. We can discuss more about purge 
node from global list, like: time based, event (NM reconnect) based, etc. in a 
dedicated JIRA YARN-4637 that I filed before.

bq. I think SimpleBlacklistManager#refreshNodeHostCount can pre-compute failure 
threshold also along with updating numberOfNodeManagerHosts. So whoever is 
invoking getBlacklistUpdates need not have to compute always. This is minor 
suggestion in existing code.
Sounds good. Updated in v2 patch.

bq. There are chances of duplicates from global and per-app level blacklists, 
correct?. So could we use a Set here. One possibility, one AM container failed 
due to ABORTED and added to per-app level blacklist, second attempt failed to 
due to DISK_FAILED and added to global. Now this will be a duplicate scenario. 
Thoughts?
Nice catch! The same app with different attempts won't cause this duplicated 
issue. The possible duplicated scenario is: an app AM failed on this node for 
reason like ABORTED, but at the mean time, the other app's AM failed on this 
node for DISK_FAILURE, then the same node could be duplicated on two list. Fix 
this issue in v2 patch.

There is another issue that the threshold control on BlacklistManager is 
applied on two list (global and per app) separately, so it is possible that two 
lists together could unexpectedly blacklist all nodes. We need a thread-safe 
merge operation for two BlacklistManagers to address this problem. Mark a TODO 
item in the patch. Will file a separated JIRA to fix this.

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128599#comment-15128599
 ] 

Sunil G commented on YARN-4635:
---

Thanks [~djp]
bq.We can discuss more about purge node from global list, like: time based, 
event (NM reconnect) based, etc. in a dedicated JIRA YARN-4637 
+1. Yes, we can cover time based/ event based cases in that JIRA. And as you 
mentioned, corner case will happen only if some AM launched on a node which is 
later blacklisted due to another apps'  failure.

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128628#comment-15128628
 ] 

Sunil G commented on YARN-4635:
---

bq. it is possible that two lists together could unexpectedly blacklist all 
nodes. 
Hi [~djp]. Is this the case where node1 to node6 is blacklisted by app and 
node7 to node10 is blacklist by global manager (considering we have node1 to 
node10 and disableThreshold is 0.8).

Could we also check {{disableThreshold}} on the total Set which we created now. 
And if we crosses the limit, clear app based / global based blacklists from 
this list. Could this solve the above mentioned scenario?

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128640#comment-15128640
 ] 

Sunil G commented on YARN-4635:
---

Yes. We need a thread safe way of merging here. Else it may cause some corner 
Cases. 

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128664#comment-15128664
 ] 

Junping Du commented on YARN-4635:
--

Thanks for comments, Sunil.
bq. Is this the case where node1 to node6 is blacklisted by app and node7 to 
node10 is blacklist by global manager.
Yes. This is correct.

bq. Could we also check disableThreshold on the total Set which we created now. 
And if we crosses the limit, clear app based / global based blacklists from 
this list. Could this solve the above mentioned scenario?
The thing could be slightly complicated than this. Several things to consider:
- The threshold can be different for global/app as we already give app 
flexibility in YARN-4389, we should choose one bar (upper or lower or always 
app bar). 
- When together over threshold bar we chose above, we should flip both lists or 
only one of them. Also, the flip mechanism worth to discuss further, as I think 
other mechanism like: LRU could be better.
- if one list get flipped, how shall we merge with the other unflipped one. The 
removal items could overlap items in additions although they belongs to 
different affected scope, etc.
I would suggest to have a further discussion in a separated JIRA.

> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-02 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15129911#comment-15129911
 ] 

Jian He commented on YARN-4635:
---

I have some questions on existing code.
why should  below container exit status back list the node ?
- KILLED_EXCEEDED_PMEM and KILLED_EXCEEDED_VMEM. I feel these are specific to 
the container only ?
- ABORTED, it’s used when AM releases the container or app finishes, or 
container expired  etc. 
- INVALID, this is the default exit code.

For DISKS_FAILED which is considered as global blacklist node in this jira, I 
think in this case, the node will report as unhealthy and RM should remove the 
node already.

In YARN-4389,  AMBlackListingRequest contains a boolean flag and a threshold 
number. Do you think it’s ok to just use the threshold number only ? 0 means 
disabled, and numbers larger than 0 means enabled?  cc [~sunilg]


> Add global blacklist tracking for AM container failure.
> ---
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.

2016-02-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127562#comment-15127562
 ] 

Hadoop QA commented on YARN-4635:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 31s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
42s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 8s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
43s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
39s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
43s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 34s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
18s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 4s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 4s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 41s 
{color} | {color:red} hadoop-yarn-project/hadoop-yarn: patch generated 4 new + 
804 unchanged - 4 fixed = 808 total (was 808) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 28s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
33s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 4 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 17s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 19s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 53s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 62m 6s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 23s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 7s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_91. {color} |
| {color:red}-1{color} |