[ 
https://issues.apache.org/jira/browse/YARN-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191217#comment-15191217
 ] 

Vinod Kumar Vavilapalli commented on YARN-4790:
-----------------------------------------------

I agree with the problem statement but not necessarily the proposal. Please 
edit the title so that it highlights the problem only so that we can figure out 
whatever the solution is.

What we need is to *not* penalize applications for system related issues. When 
YARN finds a node with configuration / permission issues, it should itself take 
an action to (a) avoid scheduling on that node, (b) alert administrators etc.

Implementing heuristics for app / user level blacklisting to work-around 
platform problems should be a last-ditch effort. We did that in Hadoop 1 
MapReduce as we didn't have clear demarcation between app vs system failures. 
But that isn't the case with YARN - part of the reason why we never implemented 
heuristics based per-app blacklisting *in YARN* - we left that completely up to 
applications.

> Per user blacklist node for user specific error for container launch failure.
> -----------------------------------------------------------------------------
>
>                 Key: YARN-4790
>                 URL: https://issues.apache.org/jira/browse/YARN-4790
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications
>            Reporter: Junping Du
>            Assignee: Junping Du
>
> There are some user specific error for container launch failure, like:
> when enabling LinuxContainerExecutor, but some node doesn't have such user 
> exists, so container launch should get failed with following information:
> {noformat}
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1434045496283_0036_000002 State change from LAUNCHED to FAILED 
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1434045496283_0036 failed 2 times due to AM Container for 
> appattempt_1434045496283_0036_000002 exited with exitCode: -1000 due to: 
> Application application_1434045496283_0036 initialization failed 
> (exitCode=255) with output: User jdu not found 
> {noformat}
> Obviously, this node is not suitable for launching container for this user's 
> other applications. We need a per user blacklist track mechanism rather than 
> per application now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to