[
https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375089#comment-15375089
]
Jason Lowe commented on YARN-5370:
----------------------------------
It's expected behavior in the sense that the debug delay setting causes the NM
to buffer every deletion task up to the specified amount of time. 100 days is
a lot of time, so if there are many deletions within that period it will have
to buffer a lot of tasks as you saw in the heap dump.
The debug delay is, as the name implies, for debugging. If you set it to a
very large value then, depending upon the amount of container churn on the
cluster, a correspondingly large heap will be required given the way it works
today. It's not typical to set this to a very large value since it only needs
to be large enough to give someone a chance to examine/copy off the requisite
files after reproducing the issue. Normally it doesn't take someone 100 days
to get around to examining the files after a problem occurs. ;-)
Theoretically we could extend the functionality to spill tasks to disk or do
something more clever with how they are stored to reduce the memory pressure,
but I question the cost/benefit tradeoff. Again this is a feature intended
just for debugging. I'm also not a big fan of putting in an arbitrary limit on
the value. If someone wants to store files for a few years and has the heap
size and disk space to hold all that, who are we to stop them from trying?
> Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM
> because of OOM
> ----------------------------------------------------------------------------------------
>
> Key: YARN-5370
> URL: https://issues.apache.org/jira/browse/YARN-5370
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Manikandan R
>
> I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev
> cluster for some reasons. It has been done before 3-4 weeks. After setting
> this up, at times, NM crashes because of OOM. So, I kept on increasing from
> 512MB to 6 GB over the past few weeks gradually as and when this crash occurs
> as temp fix. Sometimes, It won't start smoothly and after multiple tries, it
> starts functioning. While analyzing heap dump of corresponding JVM, come to
> know that DeletionService.Java is occupying almost 99% of total allocated
> memory (-xmx) something like this
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor
> @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13%
> Basically, there are huge no. of above mentioned tasks scheduled for
> deletion. Usually, I see NM memory requirements as 2-4GB for large clusters.
> In my case, cluster is very small and OOM occurs.
> Is it expected behaviour? (or) Is there any limit we can expose on
> yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]