[
https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gera Shegalov updated YARN-1996:
--------------------------------
Attachment: YARN-1996.v01.patch
> Provide alternative policies for UNHEALTHY nodes.
> -------------------------------------------------
>
> Key: YARN-1996
> URL: https://issues.apache.org/jira/browse/YARN-1996
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager, scheduler
> Affects Versions: 2.4.0
> Reporter: Gera Shegalov
> Assignee: Gera Shegalov
> Attachments: YARN-1996.v01.patch
>
>
> Currently, UNHEALTHY nodes can significantly prolong execution of large
> expensive jobs as demonstrated by MAPREDUCE-5817, and downgrade the cluster
> health even further due to [positive
> feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set
> that might have deemed the node unhealthy in the first place starts spreading
> across the cluster because the current node is declared unusable and all its
> containers are killed and rescheduled on different nodes.
> To mitigate this, we experiment with a patch that allows containers already
> running on a node turning UNHEALTHY to complete (drain) whereas no new
> container can be assigned to it until it turns healthy again.
> This mechanism can also be used for graceful decommissioning of NM. To this
> end, we have to write a health script such that it can deterministically
> report UNHEALTHY. For example with
> {code}
> if [ -e $1 ] ; then
>
> echo ERROR Node decommmissioning via health script hack
>
> fi
> {code}
> In the current version patch, the behavior is controlled by a boolean
> property {{yarn.nodemanager.unheathy.drain.containers}}. More versatile
> policies are possible in the future work. Currently, the health state of a
> node is binary determined based on the disk checker and the health script
> ERROR outputs. However, we can as well interpret health script output similar
> to java logging levels (one of which is ERROR) such as WARN, FATAL. Each
> level can then be treated differently. E.g.,
> - FATAL: unusable like today
> - ERROR: drain
> - WARN: halve the node capacity.
> complimented with some equivalence rules such as 3 WARN messages == ERROR,
> 2*ERROR == FATAL, etc.
--
This message was sent by Atlassian JIRA
(v6.2#6252)