Jason Lowe commented on YARN-2005:
bq. 2 or more different applications failed on a node. Such nodes can be given
lesser priority (lower rank) in scheduling AM for newer application.
The problem there is a workflow that spams bad applications. Many separate
applications fail in that scenario, so I guess it depends upon what you mean by
"different" applications. Is that different users, app names, or ...?
bq. In worst cases, lowest ranked NM can still be scheduled for a new AM.
Also ranking alone is not sufficient. We've seen instances on busy clusters
where a bad node was the only node with free resources on it, and all the AM
attempts were scheduled in quick succession on this node causing the overall
application to fail. Relying solely on node weighting is not going to prevent
that problem since the only eligible node at the time it wants to schedule is a
bad one. In addition to pure node ordering based on weighting it needs some
kind of weight threshold below which it will refuse to use the node completely.
As mentioned, this weight could be modulated with some time-based metric to
allow even poor nodes to be tried if we wait too long. However we need a
"don't even go there" level to avoid rapid rescheduling of failed AM attempts
on the same node in a busy cluster scenario.
> Blacklisting support for scheduling AMs
> Key: YARN-2005
> URL: https://issues.apache.org/jira/browse/YARN-2005
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 0.23.10, 2.4.0
> Reporter: Jason Lowe
> It would be nice if the RM supported blacklisting a node for an AM launch
> after the same node fails a configurable number of AM attempts. This would
> be similar to the blacklisting support for scheduling task attempts in the
> MapReduce AM but for scheduling AM attempts on the RM side.
This message was sent by Atlassian JIRA