[
https://issues.apache.org/jira/browse/YARN-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15159488#comment-15159488
]
Ray Chiang commented on YARN-3607:
----------------------------------
Two suggestions:
1) Since this is a setting that affects all daemons, it makes sense to have one
setting per daemon type, such as yarn.resourcemanager.fail-fast and
yarn.nodemanager.fail-fast.
2) There is going to be a lot of places in the YARN code where this variable
could be checked. I'm thinking the first task/subtask would be to just add the
variable definitions now and then let the functionality be added where it's
appropriate.
> Allow users to choose between failing the daemons vs failing the
> apps/containers
> --------------------------------------------------------------------------------
>
> Key: YARN-3607
> URL: https://issues.apache.org/jira/browse/YARN-3607
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager, resourcemanager, scheduler
> Affects Versions: 2.7.0
> Reporter: Karthik Kambatla
> Assignee: Ray Chiang
>
> We often run into cases where we are faced with the option of failing the
> daemon (fail-fast) vs failing user's work and keep the cluster running. There
> is no clear right way to handle these situations - some users would like to
> be conservative and let the daemons run, while others would like to
> fail-fast.
> Today, we handle these case-by-case and go by what the people working on it
> feel is the right way to handle things. Examples include how we handle app
> recovery failures, queue-changes on RM restart.
> Users should be able to choose between these two extremes, and have all these
> situations handled the same way.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)