[jira] [Commented] (YARN-6302) Fail the node if Linux Container Executor is not configured properly

2018-03-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385442#comment-16385442
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim closed the pull request at:

https://github.com/apache/hadoop/pull/200


> Fail the node if Linux Container Executor is not configured properly
> 
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-alpha4
>
> Attachments: YARN-6302.000.patch, YARN-6302.001.patch, 
> YARN-6302.002.patch, YARN-6302.003.patch, YARN-6302.005.branch-2.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node if Linux Container Executor is not configured properly

2017-05-09 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16003067#comment-16003067
 ] 

Daniel Templeton commented on YARN-6302:


Thanks, [~miklos.szeg...@cloudera.com].  Committed to branch-2.

> Fail the node if Linux Container Executor is not configured properly
> 
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-alpha3
>
> Attachments: YARN-6302.000.patch, YARN-6302.001.patch, 
> YARN-6302.002.patch, YARN-6302.003.patch, YARN-6302.005.branch-2.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node if Linux Container Executor is not configured properly

2017-04-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15975400#comment-15975400
 ] 

Hudson commented on YARN-6302:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11612 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/11612/])
YARN-6302. Fail the node if Linux Container Executor is not configured 
(templedf: rev 46940d92e2b17c627eb17a9d8fc6cec9c3715592)
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/mock-container-executer-with-configuration-error
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/ContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ConfigurationException.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/TestContainersMonitorResourceChange.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/scheduler/TestContainerSchedulerQueuing.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerRelaunch.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdater.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java


> Fail the node if Linux Container Executor is not configured properly
> 
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Fix For: 3.0.0-alpha3
>
> Attachments: YARN-6302.000.patch, YARN-6302.001.patch, 
> YARN-6302.002.patch, YARN-6302.003.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to 

[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-04-12 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966488#comment-15966488
 ] 

Daniel Templeton commented on YARN-6302:


+1

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-6302.000.patch, YARN-6302.001.patch, 
> YARN-6302.002.patch, YARN-6302.003.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-04-12 Thread Miklos Szegedi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966484#comment-15966484
 ] 

Miklos Szegedi commented on YARN-6302:
--

Thank you, I opened YARN-6475.

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-6302.000.patch, YARN-6302.001.patch, 
> YARN-6302.002.patch, YARN-6302.003.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-04-12 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966291#comment-15966291
 ] 

Daniel Templeton commented on YARN-6302:


LGTM.  Can you file issues for the two checkstyle issues?  I'd ask you to fix 
them here, but it's better not to complicate the patch with a refactor like 
that.

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-6302.000.patch, YARN-6302.001.patch, 
> YARN-6302.002.patch, YARN-6302.003.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-04-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963655#comment-15963655
 ] 

Hadoop QA commented on YARN-6302:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 11m 
38s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 8 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
41s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 10m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  8m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  8m 
37s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 49s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 2 new + 261 unchanged - 9 fixed = 263 total (was 270) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  2m 
33s{color} | {color:green} The patch generated 0 new + 705 unchanged - 1 fixed 
= 705 total (was 706) {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
28s{color} | {color:green} There were no new shelldocs issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
31s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m  
1s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 83m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:612578f |
| JIRA Issue | YARN-6302 |
| GITHUB PR | https://github.com/apache/hadoop/pull/200 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  cc  shellcheck  shelldocs  |
| uname | Linux e17f942aac70 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 
13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 7999318a |
| Default Java | 1.8.0_121 |
| shellcheck | v0.4.6 |
| findbugs | v3.0.0 

[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-04-10 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963573#comment-15963573
 ] 

Daniel Templeton commented on YARN-6302:


Cool.  I like that you've added {{Strings.emptyToNull()}}.  Wanna replace your 
{{nullIfEmpty()}} method with it?

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-6302.000.patch, YARN-6302.001.patch, 
> YARN-6302.002.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-31 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15951954#comment-15951954
 ] 

Hadoop QA commented on YARN-6302:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 8 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
54s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  9m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  9m 
55s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 59s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 2 new + 260 unchanged - 9 fixed = 262 total (was 269) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
18s{color} | {color:green} The patch generated 0 new + 729 unchanged - 1 fixed 
= 729 total (was 730) {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
35s{color} | {color:green} There were no new shelldocs issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
38s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 12m 
56s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
37s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 76m 14s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:a9ad5d6 |
| JIRA Issue | YARN-6302 |
| GITHUB PR | https://github.com/apache/hadoop/pull/200 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  cc  shellcheck  shelldocs  |
| uname | Linux 339b5c22c8b4 3.13.0-103-generic #150-Ubuntu SMP Thu Nov 24 
10:34:17 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 73835c7 |
| Default Java | 1.8.0_121 |
| shellcheck | v0.4.5 |
| findbugs | 

[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949853#comment-15949853
 ] 

Hadoop QA commented on YARN-6302:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 7 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
10s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 12m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  8m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  8m 
47s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m  4s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 4 new + 255 unchanged - 9 fixed = 259 total (was 264) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
18s{color} | {color:green} The patch generated 0 new + 729 unchanged - 1 fixed 
= 729 total (was 730) {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
28s{color} | {color:green} There were no new shelldocs issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
35s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
29s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager
 generated 1 new + 231 unchanged - 0 fixed = 232 total (was 231) {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
38s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 13m 17s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
38s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 77m 19s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.nodemanager.TestNodeHealthService |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:a9ad5d6 |
| JIRA Issue | YARN-6302 |
| GITHUB PR | https://github.com/apache/hadoop/pull/200 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  cc  shellcheck  shelldocs  |
| uname | Linux 135ef6798f28 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 
15:44:32 UTC 2016 x86_64 x86_64 x86_64 

[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949617#comment-15949617
 ] 

Hadoop QA commented on YARN-6302:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 7 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 11m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
11s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  9m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  9m 
33s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 58s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 4 new + 254 unchanged - 9 fixed = 258 total (was 263) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green}  0m 
18s{color} | {color:green} The patch generated 0 new + 729 unchanged - 1 fixed 
= 729 total (was 730) {color} |
| {color:green}+1{color} | {color:green} shelldocs {color} | {color:green}  0m 
27s{color} | {color:green} There were no new shelldocs issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
24s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
29s{color} | {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager
 generated 1 new + 231 unchanged - 0 fixed = 232 total (was 231) {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
39s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 12m 58s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
36s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 73m 43s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.yarn.server.nodemanager.TestNodeHealthService |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:a9ad5d6 |
| JIRA Issue | YARN-6302 |
| GITHUB PR | https://github.com/apache/hadoop/pull/200 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  cc  shellcheck  shelldocs  |
| uname | Linux 3d53594adddf 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 
15:44:32 UTC 2016 x86_64 x86_64 x86_64 

[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-30 Thread Miklos Szegedi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15949494#comment-15949494
 ] 

Miklos Szegedi commented on YARN-6302:
--

I just did.

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
> Attachments: YARN-6302.000.patch
>
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-29 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947841#comment-15947841
 ] 

Daniel Templeton commented on YARN-6302:


LGTM.  Can you please post a patch to the JIRA so that Jenkins has something to 
chew on?

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947838#comment-15947838
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r108779597
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -580,19 +579,19 @@ public int launchContainer(ContainerStartContext ctx)
 logOutput(diagnostics);
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
-if (exitCode == LinuxContainerExecutorExitCode.
+if (exitCode == ExitCode.
--- End diff --

I did not run my last git push. It should be fixed now.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947706#comment-15947706
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r108759529
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -551,16 +550,16 @@ public int launchContainer(ContainerStartContext ctx)
   } else {
 LOG.info(
 "Container was marked as inactive. Returning terminated 
error");
-return ExitCode.TERMINATED.getExitCode();
+return ContainerExecutor.ExitCode.TERMINATED.getExitCode();
--- End diff --

Ah, missed that.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947687#comment-15947687
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r108757816
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -580,19 +579,19 @@ public int launchContainer(ContainerStartContext ctx)
 logOutput(diagnostics);
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
-if (exitCode == LinuxContainerExecutorExitCode.
+if (exitCode == ExitCode.
--- End diff --

What am I missing?  It doesn't look like anything changed...


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941518#comment-15941518
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r108025819
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -551,16 +550,16 @@ public int launchContainer(ContainerStartContext ctx)
   } else {
 LOG.info(
 "Container was marked as inactive. Returning terminated 
error");
-return ExitCode.TERMINATED.getExitCode();
+return ContainerExecutor.ExitCode.TERMINATED.getExitCode();
--- End diff --

There is one ExitCode now in this class as well.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941517#comment-15941517
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r108025814
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -580,19 +579,19 @@ public int launchContainer(ContainerStartContext ctx)
 logOutput(diagnostics);
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
-if (exitCode == LinuxContainerExecutorExitCode.
+if (exitCode == ExitCode.
--- End diff --

Done.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941039#comment-15941039
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107987190
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -551,16 +550,16 @@ public int launchContainer(ContainerStartContext ctx)
   } else {
 LOG.info(
 "Container was marked as inactive. Returning terminated 
error");
-return ExitCode.TERMINATED.getExitCode();
+return ContainerExecutor.ExitCode.TERMINATED.getExitCode();
--- End diff --

I don't think this is needful, but you can do it if you want.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15941040#comment-15941040
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107987010
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -580,19 +579,19 @@ public int launchContainer(ContainerStartContext ctx)
 logOutput(diagnostics);
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
-if (exitCode == LinuxContainerExecutorExitCode.
+if (exitCode == ExitCode.
--- End diff --

Sorry to pick, but can we split these lines on the == instead of the . ?


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935348#comment-15935348
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107277279
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -111,6 +113,58 @@
   private LinuxContainerRuntime linuxContainerRuntime;
 
   /**
+   * The container exit code.
+   */
+  public enum LinuxContainerExecutorExitCode {
--- End diff --

Done.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935336#comment-15935336
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107274199
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ConfigurationException.java
 ---
@@ -19,13 +19,12 @@
 package org.apache.hadoop.yarn.exceptions;
 
 import org.apache.hadoop.classification.InterfaceAudience.Public;
-import org.apache.hadoop.classification.InterfaceStability.Unstable;
 
 /**
- * This exception is thrown on unrecoverable container launch errors.
+ * This exception is thrown on unrecoverable configuration errors.
+ * An example is container launch error due to configuration.
  */
 @Public
--- End diff --

All right.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935337#comment-15935337
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107274254
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ConfigurationException.java
 ---
@@ -0,0 +1,43 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.yarn.exceptions;
+
+import org.apache.hadoop.classification.InterfaceAudience.Public;
+
+/**
+ * This exception is thrown on unrecoverable configuration errors.
+ * An example is container launch error due to configuration.
+ */
+@Public
--- End diff --

All right.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935295#comment-15935295
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107266403
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h
 ---
@@ -37,8 +37,8 @@ enum command {
 
 enum errorcodes {
   INVALID_ARGUMENT_NUMBER = 1,
-  INVALID_USER_NAME, //2
-  INVALID_COMMAND_PROVIDED, //3
+  //INVALID_USER_NAME 2
--- End diff --

Yeah, I didn't mean it was your fault.  Salvaging this code isn't your 
problem. :)


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935293#comment-15935293
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107266532
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ConfigurationException.java
 ---
@@ -0,0 +1,43 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.yarn.exceptions;
+
+import org.apache.hadoop.classification.InterfaceAudience.Public;
+
+/**
+ * This exception is thrown on unrecoverable configuration errors.
+ * An example is container launch error due to configuration.
+ */
+@Public
--- End diff --

Should this be @Evolving?


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935294#comment-15935294
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107268039
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -111,6 +113,58 @@
   private LinuxContainerRuntime linuxContainerRuntime;
 
   /**
+   * The container exit code.
+   */
+  public enum LinuxContainerExecutorExitCode {
--- End diff --

Since this is an inner class of LCE, you can safely drop the LCE from the 
enum name, which will make the subsequent code less messy.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935297#comment-15935297
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107268430
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -525,6 +580,23 @@ public int launchContainer(ContainerStartContext ctx) 
throws IOException {
 logOutput(diagnostics);
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
+if (exitCode == LinuxContainerExecutorExitCode.
--- End diff --

Right.  Forgot about that.  You'd basically have to recreate the same code 
in the enum to get an instance from an int.

Maybe add an equals() method to the enum that can compare against ints as 
well?  Maybe not worth it.  Just shortening the enum name may be enough...


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935296#comment-15935296
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107242843
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ConfigurationException.java
 ---
@@ -19,13 +19,12 @@
 package org.apache.hadoop.yarn.exceptions;
 
 import org.apache.hadoop.classification.InterfaceAudience.Public;
-import org.apache.hadoop.classification.InterfaceStability.Unstable;
 
 /**
- * This exception is thrown on unrecoverable container launch errors.
+ * This exception is thrown on unrecoverable configuration errors.
+ * An example is container launch error due to configuration.
  */
 @Public
--- End diff --

Maybe make it evolving?


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933947#comment-15933947
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107055947
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h
 ---
@@ -37,8 +37,8 @@ enum command {
 
 enum errorcodes {
   INVALID_ARGUMENT_NUMBER = 1,
-  INVALID_USER_NAME, //2
-  INVALID_COMMAND_PROVIDED, //3
+  //INVALID_USER_NAME 2
--- End diff --

INVALID_USER_NAME was forgotten earlier, so I removed it, and I just 
followed the pattern that is in the code right now keeping the original value 
commented.
If we want to refactor this right now, I would generate large pseudorandom 
number do be able to check the difference and be able to search for the error 
code like a GUID in a search engine.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933943#comment-15933943
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107055710
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
 ---
@@ -294,6 +295,14 @@ public Integer call() {
   .setUserLocalDirs(userLocalDirs)
   .setContainerLocalDirs(containerLocalDirs)
   .setContainerLogDirs(containerLogDirs).build());
+} catch (ConfigurationException e) {
+  LOG.error("Failed to launch container.", e);
--- End diff --

It will be redundant, since the exception type is usually visible, but I 
fixed it.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933944#comment-15933944
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107055721
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerRelaunch.java
 ---
@@ -115,6 +116,14 @@ public Integer call() {
   .setContainerLocalDirs(containerLocalDirs)
   .setContainerLogDirs(containerLogDirs)
   .build());
+} catch (ConfigurationException e) {
+  LOG.error("Failed to relaunch container.", e);
--- End diff --

It will be redundant, since the exception type is usually visible, but I 
fixed it.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933941#comment-15933941
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107055288
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -80,6 +97,7 @@ long getLastHealthReportTime() {
 long lastReportTime = (nodeHealthScriptRunner == null)
--- End diff --

Done.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933924#comment-15933924
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107053798
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -54,22 +58,35 @@ protected void serviceInit(Configuration conf) throws 
Exception {
* @return the reporting string of health of the node
*/
   String getHealthReport() {
+String healthReport = "";
--- End diff --

Done.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933925#comment-15933925
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107053877
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -54,22 +58,35 @@ protected void serviceInit(Configuration conf) throws 
Exception {
* @return the reporting string of health of the node
*/
   String getHealthReport() {
+String healthReport = "";
 String scriptReport = (nodeHealthScriptRunner == null) ? ""
 : nodeHealthScriptRunner.getHealthReport();
-if (scriptReport.equals("")) {
-  return dirsHandler.getDisksHealthReport(false);
-} else {
-  return scriptReport.concat(SEPARATOR + 
dirsHandler.getDisksHealthReport(false));
+String discReport = dirsHandler.getDisksHealthReport(false);
+String exceptionReport = nodeHealthException != null ?
+nodeHealthException.getMessage() : "";
+
+if (!scriptReport.equals("")) {
+  healthReport = scriptReport;
+}
+if (!discReport.equals("")) {
+  healthReport = healthReport.equals("") ? discReport :
+  healthReport.concat(SEPARATOR + discReport);
 }
+if (!exceptionReport.equals("")) {
+  healthReport = healthReport.equals("") ? exceptionReport :
+  healthReport.concat(SEPARATOR + exceptionReport);
+}
+return healthReport;
   }
 
   /**
* @return true if the node is healthy
*/
   boolean isHealthy() {
-boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true
-: nodeHealthScriptRunner.isHealthy();
-return scriptHealthStatus && dirsHandler.areDisksHealthy();
+boolean scriptHealthStatus = nodeHealthScriptRunner == null ||
--- End diff --

Done.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933894#comment-15933894
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107051786
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -31,6 +31,8 @@
 
   private NodeHealthScriptRunner nodeHealthScriptRunner;
   private LocalDirsHandlerService dirsHandler;
+  private Exception nodeHealthException;
+  long nodeHealthExceptionReportTime;
--- End diff --

My mistake. Fixed.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933892#comment-15933892
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107051720
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -525,6 +580,23 @@ public int launchContainer(ContainerStartContext ctx) 
throws IOException {
 logOutput(diagnostics);
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
+if (exitCode == LinuxContainerExecutorExitCode.
--- End diff --

I get enum types cannot be instantiated. I could create a function that 
returns the appropriate enum for an int value, but would not that be an 
overkill here?


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933876#comment-15933876
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user szegedim commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107051128
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ConfigurationException.java
 ---
@@ -0,0 +1,44 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.yarn.exceptions;
+
+import org.apache.hadoop.classification.InterfaceAudience.Public;
+import org.apache.hadoop.classification.InterfaceStability.Unstable;
+
+/**
+ * This exception is thrown on unrecoverable container launch errors.
--- End diff --

Agreed. Fixed the code.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933827#comment-15933827
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107045163
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -54,22 +58,35 @@ protected void serviceInit(Configuration conf) throws 
Exception {
* @return the reporting string of health of the node
*/
   String getHealthReport() {
+String healthReport = "";
--- End diff --

This would be a bit cleaner with a Joiner:

String scriptReport = (nodeHealthScriptRunner == null) ? null : 
nodeHealthScriptRunner.getHealthReport();
String discReport = dirsHandler.getDisksHealthReport(false);
String exceptionReport = nodeHealthException == null ? null : 
nodeHealthException.getMessage();
String healthReport = Joiner.on(SEPARATOR).skipNulls().join(scriptReport, 
discReport.equals("") ? null : discReport, exceptionReport);

The discReport throws a monkey wrench in the works because it's returning 
"" instead of null.  There's probably a more elegant solution that what I did 
above...


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933823#comment-15933823
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107029835
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
 ---
@@ -294,6 +295,14 @@ public Integer call() {
   .setUserLocalDirs(userLocalDirs)
   .setContainerLocalDirs(containerLocalDirs)
   .setContainerLogDirs(containerLogDirs).build());
+} catch (ConfigurationException e) {
+  LOG.error("Failed to launch container.", e);
--- End diff --

Since you know it was a configuration error, you may as well say so in the 
error message.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933819#comment-15933819
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107028842
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/ConfigurationException.java
 ---
@@ -0,0 +1,44 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.yarn.exceptions;
+
+import org.apache.hadoop.classification.InterfaceAudience.Public;
+import org.apache.hadoop.classification.InterfaceStability.Unstable;
+
+/**
+ * This exception is thrown on unrecoverable container launch errors.
--- End diff --

No reason to constrain the use of the exception.  Maybe offer the launch 
errors as an example or suggested use?


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933825#comment-15933825
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107029968
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerRelaunch.java
 ---
@@ -115,6 +116,14 @@ public Integer call() {
   .setContainerLocalDirs(containerLocalDirs)
   .setContainerLogDirs(containerLogDirs)
   .build());
+} catch (ConfigurationException e) {
+  LOG.error("Failed to relaunch container.", e);
--- End diff --

Since you know it was a configuration error, you may as well say so in the 
error message.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933821#comment-15933821
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107029030
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -31,6 +31,8 @@
 
   private NodeHealthScriptRunner nodeHealthScriptRunner;
   private LocalDirsHandlerService dirsHandler;
+  private Exception nodeHealthException;
+  long nodeHealthExceptionReportTime;
--- End diff --

My rule of thumb is that If it's not private, it should have javadocs.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933828#comment-15933828
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107029448
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -80,6 +97,7 @@ long getLastHealthReportTime() {
 long lastReportTime = (nodeHealthScriptRunner == null)
--- End diff --

This isn't your code, but it's hideous.  Wanna clean it up, too? :)


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933824#comment-15933824
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107029264
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java
 ---
@@ -54,22 +58,35 @@ protected void serviceInit(Configuration conf) throws 
Exception {
* @return the reporting string of health of the node
*/
   String getHealthReport() {
+String healthReport = "";
 String scriptReport = (nodeHealthScriptRunner == null) ? ""
 : nodeHealthScriptRunner.getHealthReport();
-if (scriptReport.equals("")) {
-  return dirsHandler.getDisksHealthReport(false);
-} else {
-  return scriptReport.concat(SEPARATOR + 
dirsHandler.getDisksHealthReport(false));
+String discReport = dirsHandler.getDisksHealthReport(false);
+String exceptionReport = nodeHealthException != null ?
+nodeHealthException.getMessage() : "";
+
+if (!scriptReport.equals("")) {
+  healthReport = scriptReport;
+}
+if (!discReport.equals("")) {
+  healthReport = healthReport.equals("") ? discReport :
+  healthReport.concat(SEPARATOR + discReport);
 }
+if (!exceptionReport.equals("")) {
+  healthReport = healthReport.equals("") ? exceptionReport :
+  healthReport.concat(SEPARATOR + exceptionReport);
+}
+return healthReport;
   }
 
   /**
* @return true if the node is healthy
*/
   boolean isHealthy() {
-boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true
-: nodeHealthScriptRunner.isHealthy();
-return scriptHealthStatus && dirsHandler.areDisksHealthy();
+boolean scriptHealthStatus = nodeHealthScriptRunner == null ||
--- End diff --

Maybe rename this one scriptHealthy


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933822#comment-15933822
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107028894
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -111,6 +113,58 @@
   private LinuxContainerRuntime linuxContainerRuntime;
 
   /**
+   * The container exit code.
+   */
+  public enum LinuxContainerExecutorExitCode {
--- End diff --

Love it!


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933826#comment-15933826
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107028658
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
 ---
@@ -525,6 +580,23 @@ public int launchContainer(ContainerStartContext ctx) 
throws IOException {
 logOutput(diagnostics);
 container.handle(new ContainerDiagnosticsUpdateEvent(containerId,
 diagnostics));
+if (exitCode == LinuxContainerExecutorExitCode.
--- End diff --

Would it be cleaner to create a new LinuxContainerExecutorExitCode from 
your exitCode and then test via ==?


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15933820#comment-15933820
 ] 

ASF GitHub Bot commented on YARN-6302:
--

Github user templedf commented on a diff in the pull request:

https://github.com/apache/hadoop/pull/200#discussion_r107033241
  
--- Diff: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.h
 ---
@@ -37,8 +37,8 @@ enum command {
 
 enum errorcodes {
   INVALID_ARGUMENT_NUMBER = 1,
-  INVALID_USER_NAME, //2
-  INVALID_COMMAND_PROVIDED, //3
+  //INVALID_USER_NAME 2
--- End diff --

This section of code makes me want to weep.


> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-15 Thread Miklos Szegedi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927144#comment-15927144
 ] 

Miklos Szegedi commented on YARN-6302:
--

Not all of these, like directory permissions can be identified at startup time, 
moreover the configuration can go wrong while nodemanager is running and we 
would like to cover those cases as well.

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-15 Thread Haibo Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15927044#comment-15927044
 ] 

Haibo Chen commented on YARN-6302:
--

[~miklos.szeg...@cloudera.com] Can you elaborate more on why types of errors 
you are targeting? If they can be validated/captured at NM startup time, we 
could do it at ContainerExecutor.init(). 

> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly

2017-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901945#comment-15901945
 ] 

ASF GitHub Bot commented on YARN-6302:
--

GitHub user szegedim opened a pull request:

https://github.com/apache/hadoop/pull/200

YARN-6302 Fail the node, if Linux Container Executor is not configured 
properly

YARN-6302 Fail the node, if Linux Container Executor is not configured 
properly

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/szegedim/hadoop YARN-6302

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hadoop/pull/200.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #200


commit cb97a1911c0df3528c49aa0ba96e7bc6233d630a
Author: Miklos Szegedi 
Date:   2017-03-07T22:35:16Z

YARN-6302 Throw on error 24

Change-Id: Ia676061fd49cc7f54dbd9ae22bb999d4ea8a965b

commit 6f7872e99f5be813c74493dd204e14355049659d
Author: Miklos Szegedi 
Date:   2017-03-08T03:37:10Z

YARN-6302 Shutdown on error 24

Change-Id: Ib17d4a357b6fdf1a6d940f0641770054f1f73e81

commit 03f4cd8a1391360ea3d7790b1044421eb05d6d2d
Author: Miklos Szegedi 
Date:   2017-03-08T19:47:03Z

YARN-6302 Mark node unhealthy on error 24

Change-Id: Ib1e7215f9dac6825bda2eb54707782c59f19eb0c




> Fail the node, if Linux Container Executor is not configured properly
> -
>
> Key: YARN-6302
> URL: https://issues.apache.org/jira/browse/YARN-6302
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_02 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org