[ 
https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375339#comment-15375339
 ] 

Manikandan R commented on YARN-5370:
------------------------------------

To solve this issue, we tried by setting 
yarn.nodemanager.delete.debug-delay-sec to very low value (zero second) 
assuming that it may clear off the existing scheduled deletion tasks. It didn't 
happen - basically it is not applied for the existing tasks which has been 
already scheduled. Then, we come to know that canRecover() method is getting 
called in service start, which is trying to pull the info from NM recovery 
directory (from local filesystem) and building this entire info in memory, 
which in turn, causing the problems in starting the services and consuming so 
much amount of memory. Then, we tried by moving the contents of NM recovery 
directory to some other place. From this points onwards, it was able to start 
smoothly and works as expected. I think showing some warnings about this high 
value (for ex, 100+ days) somewhere (for ex, in logs) indicating that it can 
cause potential crash can saving significant amount of time to troubleshoot 
this issue.

> Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM 
> because of OOM
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-5370
>                 URL: https://issues.apache.org/jira/browse/YARN-5370
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Manikandan R
>
> I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev  
> cluster for some reasons. It has been done before 3-4 weeks. After setting 
> this up, at times, NM crashes because of OOM. So, I kept on increasing from 
> 512MB to 6 GB over the past few weeks gradually as and when this crash occurs 
> as temp fix. Sometimes, It won't start smoothly and after multiple tries, it 
> starts functioning. While analyzing heap dump of corresponding JVM, come to 
> know that DeletionService.Java is occupying almost 99% of total allocated 
> memory (-xmx) something like this
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor
>  @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13%
> Basically, there are huge no. of above mentioned tasks scheduled for 
> deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. 
> In my case, cluster is very small and OOM occurs.
> Is it expected behaviour? (or) Is there any limit we can expose on 
> yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to