[ https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375339#comment-15375339 ]
Manikandan R commented on YARN-5370: ------------------------------------ To solve this issue, we tried by setting yarn.nodemanager.delete.debug-delay-sec to very low value (zero second) assuming that it may clear off the existing scheduled deletion tasks. It didn't happen - basically it is not applied for the existing tasks which has been already scheduled. Then, we come to know that canRecover() method is getting called in service start, which is trying to pull the info from NM recovery directory (from local filesystem) and building this entire info in memory, which in turn, causing the problems in starting the services and consuming so much amount of memory. Then, we tried by moving the contents of NM recovery directory to some other place. From this points onwards, it was able to start smoothly and works as expected. I think showing some warnings about this high value (for ex, 100+ days) somewhere (for ex, in logs) indicating that it can cause potential crash can saving significant amount of time to troubleshoot this issue. > Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM > because of OOM > ---------------------------------------------------------------------------------------- > > Key: YARN-5370 > URL: https://issues.apache.org/jira/browse/YARN-5370 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Manikandan R > > I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev > cluster for some reasons. It has been done before 3-4 weeks. After setting > this up, at times, NM crashes because of OOM. So, I kept on increasing from > 512MB to 6 GB over the past few weeks gradually as and when this crash occurs > as temp fix. Sometimes, It won't start smoothly and after multiple tries, it > starts functioning. While analyzing heap dump of corresponding JVM, come to > know that DeletionService.Java is occupying almost 99% of total allocated > memory (-xmx) something like this > org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor > @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13% > Basically, there are huge no. of above mentioned tasks scheduled for > deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. > In my case, cluster is very small and OOM occurs. > Is it expected behaviour? (or) Is there any limit we can expose on > yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org