[jira] [Commented] (YARN-4790) Per user blacklist node for user specific error for container launch failure.

2016-03-19 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201156#comment-15201156
 ] 

Vinod Kumar Vavilapalli commented on YARN-4790:
---

bq. when enabling LinuxContainerExecutor, but some node doesn't have such user 
exists
As for the root-cause reported on this JIRA, this invalidates our fundamental 
assumptions of LinuxContainerExecutor. We assume that user-accounts 
corresponding to all job-submitters are present on all the machines. If not it 
is a gross misconfiguration of the system, and should be handled by lower 
layers like installers / management systems.

If it is really deemed that we should support different user-accounts on 
different hosts (for whatever reason), then the right way to look at solving 
that problem is by recognizing user-accounts as a resource on each host - kind 
of like node-constraints. Blacklisting that node for an app is absolutely the 
wrong way to go about it.

> Per user blacklist node for user specific error for container launch failure.
> -
>
> Key: YARN-4790
> URL: https://issues.apache.org/jira/browse/YARN-4790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Reporter: Junping Du
>Assignee: Junping Du
>
> There are some user specific error for container launch failure, like:
> when enabling LinuxContainerExecutor, but some node doesn't have such user 
> exists, so container launch should get failed with following information:
> {noformat}
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1434045496283_0036_02 State change from LAUNCHED to FAILED 
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1434045496283_0036 failed 2 times due to AM Container for 
> appattempt_1434045496283_0036_02 exited with exitCode: -1000 due to: 
> Application application_1434045496283_0036 initialization failed 
> (exitCode=255) with output: User jdu not found 
> {noformat}
> Obviously, this node is not suitable for launching container for this user's 
> other applications. We need a per user blacklist track mechanism rather than 
> per application now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4790) Per user blacklist node for user specific error for container launch failure.

2016-03-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191217#comment-15191217
 ] 

Vinod Kumar Vavilapalli commented on YARN-4790:
---

I agree with the problem statement but not necessarily the proposal. Please 
edit the title so that it highlights the problem only so that we can figure out 
whatever the solution is.

What we need is to *not* penalize applications for system related issues. When 
YARN finds a node with configuration / permission issues, it should itself take 
an action to (a) avoid scheduling on that node, (b) alert administrators etc.

Implementing heuristics for app / user level blacklisting to work-around 
platform problems should be a last-ditch effort. We did that in Hadoop 1 
MapReduce as we didn't have clear demarcation between app vs system failures. 
But that isn't the case with YARN - part of the reason why we never implemented 
heuristics based per-app blacklisting *in YARN* - we left that completely up to 
applications.

> Per user blacklist node for user specific error for container launch failure.
> -
>
> Key: YARN-4790
> URL: https://issues.apache.org/jira/browse/YARN-4790
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: applications
>Reporter: Junping Du
>Assignee: Junping Du
>
> There are some user specific error for container launch failure, like:
> when enabling LinuxContainerExecutor, but some node doesn't have such user 
> exists, so container launch should get failed with following information:
> {noformat}
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1434045496283_0036_02 State change from LAUNCHED to FAILED 
> 2016-02-14 15:37:03,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1434045496283_0036 failed 2 times due to AM Container for 
> appattempt_1434045496283_0036_02 exited with exitCode: -1000 due to: 
> Application application_1434045496283_0036 initialization failed 
> (exitCode=255) with output: User jdu not found 
> {noformat}
> Obviously, this node is not suitable for launching container for this user's 
> other applications. We need a per user blacklist track mechanism rather than 
> per application now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)