[
https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966240#comment-14966240
]
Sunil G commented on YARN-4284:
-------------------------------
Hi [~sjlee0]
As part of YARN-2293, we were looking into a proposal where we wanted to score
NMs based on its performance (more failures of attempts, launch failures,disk
crash etc will result in decrementing score). And for all applications AMs, its
always best schedule to a highest ranked NM (best performed so far).
But this is a very generic proposal, and we thought of achieving this step by
step, and YARN-2005 was a first step for this as suggested by [~jlowe]. Coming
to an improvement, your proposal is very much the same as the next step to this
and it can give a better probability of a successful AM container launch for
all applications. Currently there are chances that new application's first AM
will still fail and only second one will be successful because of AM
blacklisting.
Downfall in achieving this is to collect the general failures (disk crashes/jvm
launch problem) Vs application specific errors (some AM containers may not run
on a node due to its memory or some other factors). If we cannot achieve this,
then there are chances that due to one specific problem with an AM in a node
may result that node to be blacklisted for all other apps, this may be
dangerous.
So as per you thought also, I think we can collect all the other issues in
container launches and segregate to generic errors, and blacklist for a period
of time. +1 for this. Thoughts?
> condition for AM blacklisting is too narrow
> -------------------------------------------
>
> Key: YARN-4284
> URL: https://issues.apache.org/jira/browse/YARN-4284
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.8.0
> Reporter: Sangjin Lee
> Assignee: Sangjin Lee
> Attachments: YARN-4284.001.patch
>
>
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the
> next app attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is
> limited to {{DISKS_FAILED}}. There are a whole host of other issues that may
> cause the failure, for which we want to locate the AM elsewhere; e.g. disks
> full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in
> blacklisting the nodes on *any failure* (although it might lead to
> blacklisting the node more aggressively than necessary). I would propose
> locating the next app attempt to a different node on any failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)