[jira] [Comment Edited] (YARN-5140) NM usercache fill up with burst of jobs leading to rapid temp IO FS fill up and potentially NM outage

Oleksandr Kalinin (JIRA) Sun, 31 Jul 2016 06:03:48 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15400654#comment-15400654
 ]


Oleksandr Kalinin edited comment on YARN-5140 at 7/31/16 1:02 PM:
------------------------------------------------------------------

Workaround option to this issue is explicit yarn.nodemanager.local-dirs 
configuration pointing to DFS disks.

Default parameter value ${hadoop.tmp.dir}/nm-local-dir will imply usage of 
local FS on system disk in most installations. Besides of FS fill up risk 
explained in the description, this is not scalable and performs poorly for any 
heavy localization as well as some particular workload phases like Spark on 
YARN shuffle.

Perhaps those drawbacks of using single local FS directory should be better 
documented in yarn.nodemanager.local-dirs parameter description.


was (Author: okalinin):
Workaround option to this issue is explicit yarn.nodemanager.local-dirs 
configuration pointing to DFS disks.

Default parameter value '${hadoop.tmp.dir}/nm-local-dir' will imply usage of 
local FS on system disk in most installations. Besides of FS fill up risk 
explained in the description, this is not scalable and performs poorly for any 
heavy localization as well as some particular workload phases like Spark on 
YARN shuffle.

Perhaps those drawbacks of using single local FS directory should be better 
documented in yarn.nodemanager.local-dirs parameter description.

> NM usercache fill up with burst of jobs leading to rapid temp IO FS fill up 
> and potentially NM outage
> -----------------------------------------------------------------------------------------------------
>
>                 Key: YARN-5140
>                 URL: https://issues.apache.org/jira/browse/YARN-5140
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.0
>         Environment: Linux RHEL 6.7, Hadoop 2.7.0
>            Reporter: Oleksandr Kalinin
>            Priority: Minor
>
> A burst or rapid rate of submitted jobs with substantial NM usercache 
> resource localization footprint may lead to rapid fill up of the NM local 
> temporary IO FS (/tmp by default) with negative consequences in terms of 
> stability.
> The core issue seems to be the fact that NM continues to localize the 
> resources beyond the maximum local cache size 
> (yarn.nodemanager.localizer.cache.target-size-mb , default 10G). Since 
> maximum local cache size is effectively not taken into account when 
> localizing new resources (note that default cache cleanup interval is 10 min 
> controlled by yarn.nodemanager.localizer.cache.cleanup.interval-ms), this 
> basically leads to sort of self-destruction scenario : once /tmp FS 
> utilization reaches the threshold of 90%, NM will automatically de-register 
> from RM, effectively leading to NM outage.
> This issue may offline many NMs simultaneously at the same time and thus is 
> quite critical in terms of platform stability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-5140) NM usercache fill up with burst of jobs leading to rapid temp IO FS fill up and potentially NM outage

Reply via email to