Oleksandr Kalinin created YARN-5140:
---------------------------------------
Summary: NM usercache fill up with burst of jobs leads to NM outage
Key: YARN-5140
URL: https://issues.apache.org/jira/browse/YARN-5140
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.7.0
Environment: Linux RHEL 6.7, Hadoop 2.7.0
Reporter: Oleksandr Kalinin
A burst or rapid rate of submitted jobs with substantial NM usercache resource
localization footprint may lead to rapid fill up of the NM local temporary IO
FS (/tmp by default) with negative consequences in terms of stability.
The core issue seems to be the fact that NM continues to localize the resources
beyond the maximum local cache size
(yarn.nodemanager.localizer.cache.target-size-mb , default 10G). Since maximum
local cache size is effectively not taken into account when localizing new
resources (note that default cache cleanup interval is 10 min controlled by
yarn.nodemanager.localizer.cache.cleanup.interval-ms), this basically leads to
sort of self-destruction scenario : once /tmp FS utilization reaches the
threshold of 90%, NM will automatically de-register from RM, effectively
leading to NM outage.
This issue may offline many NMs simultaneously at the same time and thus is
quite critical in terms of platform stability.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]