Oleksandr Kalinin created YARN-5140:
---------------------------------------

             Summary: NM usercache fill up with burst of jobs leads to NM outage
                 Key: YARN-5140
                 URL: https://issues.apache.org/jira/browse/YARN-5140
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.7.0
         Environment: Linux RHEL 6.7, Hadoop 2.7.0


            Reporter: Oleksandr Kalinin


A burst or rapid rate of submitted jobs with substantial NM usercache resource 
localization footprint may lead to rapid fill up of the NM local temporary IO 
FS (/tmp by default) with negative consequences in terms of stability.

The core issue seems to be the fact that NM continues to localize the resources 
beyond the maximum local cache size 
(yarn.nodemanager.localizer.cache.target-size-mb , default 10G). Since maximum 
local cache size is effectively not taken into account when localizing new 
resources (note that default cache cleanup interval is 10 min controlled by 
yarn.nodemanager.localizer.cache.cleanup.interval-ms), this basically leads to 
sort of self-destruction scenario : once /tmp FS utilization reaches the 
threshold of 90%, NM will automatically de-register from RM, effectively 
leading to NM outage.

This issue may offline many NMs simultaneously at the same time and thus is 
quite critical in terms of platform stability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to