[ 
https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927144#comment-15927144
 ] 

Miklos Szegedi commented on YARN-6302:
--------------------------------------

Not all of these, like directory permissions can be identified at startup time, 
moreover the configuration can go wrong while nodemanager is running and we 
would like to cover those cases as well.

> Fail the node, if Linux Container Executor is not configured properly
> ---------------------------------------------------------------------
>
>                 Key: YARN-6302
>                 URL: https://issues.apache.org/jira/browse/YARN-6302
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Miklos Szegedi
>            Assignee: Miklos Szegedi
>            Priority: Minor
>
> We have a cluster that has one node with misconfigured Linux Container 
> Executor. Every time an AM or regular container is launched on the cluster, 
> it will fail. The node will still have resources available, so it keeps 
> failing apps until the administrator notices the issue and decommissions the 
> node. AM Blacklisting only helps, if the application is already running.
> As a possible improvement, when the LCE is used on the cluster and a NM gets 
> certain errors back from the LCE, like error 24 configuration not found, we 
> should not try to allocate anything on the node anymore or shut down the node 
> entirely. That kind of problem normally does not fix itself and it means that 
> nothing can really run on that node.
> {code}
> Application application_1488920587909_0010 failed 2 times due to AM Container 
> for appattempt_1488920587909_0010_000002 exited with exitCode: -1000
> Failing this attempt.Diagnostics: Application application_1488920587909_0010 
> initialization failed (exitCode=24) with output:
> For more detailed output, check the application tracking page: 
> http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then 
> click on links to logs of each attempt.
> . Failing the application.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to