[
https://issues.apache.org/jira/browse/YARN-8345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kartik Bhatia resolved YARN-8345.
---------------------------------
Resolution: Duplicate
> NodeHealthCheckerService to differentiate between reason for UnusableNodes
> for client to act suitably on it
> -----------------------------------------------------------------------------------------------------------
>
> Key: YARN-8345
> URL: https://issues.apache.org/jira/browse/YARN-8345
> Project: Hadoop YARN
> Issue Type: New Feature
> Components: nodemanager
> Reporter: Kartik Bhatia
> Priority: Major
>
> +*Current Scenario :*+
> NodeHealthCheckerService marks a node Unhealthy on basis of 2 things :
> # External Script
> # Directory status
> If a directory is marked as full(as per DiskCheck configs in yarn-site), node
> manager marks this as unhealthy.
> Once a node is marked unhealthy, mapreduce launches all the map tasks that
> ran on this usable node. This leads to even successful tasks being relaunched.
> +{color:#333333}*Problem :*{color}+
> {color:#333333}We do not have distinction between disk limit to stop
> container launch on that node and limit so that reducer can read data from
> that node.{color}
> {color:#333333}For Example : {color}
> {color:#333333}Let us consider a 3 TB disk. If we set max disk utilisation
> percentage as 95% (since launch of container requires approx 0.15 TB for jobs
> in our cluster) and there are few nodes where disk utilisation is say 96%,
> the threshold will be breached. These nodes will be marked unhealthy by
> NodeManager. This will result in all successful mappers being relaunched on
> other nodes. But still 4% memory is good enough for reducers to read that
> data. This causes unnecessary delay in our jobs. (Mappers launching again can
> preempt reducers if there is crunch for space and there are issues with
> calculating Headroom in Capacity scheduler as well){color}
>
> +*Correction :*+
> We need a state (say UNUSABLE_WRITE) that can let mapreduce know that node is
> still good for reading data and successful mappers should not be relaunched.
> This can prevent delay.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]