[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548239#comment-14548239 ] Siqi Li commented on YARN-3468: --- Agreed. setting yarn.nodemanager.delete.debug-delay-sec to 10 minute is a bad idea in a production cluster. I will mark this jira as won't fix > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch, YARN-3468.v2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514929#comment-14514929 ] Vinod Kumar Vavilapalli commented on YARN-3468: --- What do you mean by NM flapping? You shouldn't really be setting yarn.nodemanager.delete.debug-delay-sec to 10 minute in a production cluster. Getting this patch in is okay though. Needs a testcase? > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch, YARN-3468.v2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514880#comment-14514880 ] Sangjin Lee commented on YARN-3468: --- It seems the test failures are related to the patch. [~l201514], could you please look into them? > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch, YARN-3468.v2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505561#comment-14505561 ] Hadoop QA commented on YARN-3468: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12726942/YARN-3468.v2.patch against trunk revision 424a00d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7430//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7430//console This message is automatically generated. > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch, YARN-3468.v2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504151#comment-14504151 ] Siqi Li commented on YARN-3468: --- Got it. I will modify the patch as you suggested > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504093#comment-14504093 ] Sangjin Lee commented on YARN-3468: --- Sorry this fell off my radar. {code} 1094 RemoteIterator subDirStatus = lfs.listStatus(subDirPath); 1095 if (subDirStatus != null && !subDirStatus.hasNext()) { 1096// Skip the renaming when the given localSubDir is empty 1097return; 1098 } 1099 lfs.rename(subDirPath, new Path( {code} Wouldn't it be bit safer to do {code} 1094 RemoteIterator subDirStatus = lfs.listStatus(subDirPath); 1095 if (subDirStatus != null && subDirStatus.hasNext()) { 1096lfs.rename(subDirPath, new Path( {code} ? Also, I normally try to avoid returning in the middle of a method as it can be error-prone (not necessarily here, but as a general rule). > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487979#comment-14487979 ] Siqi Li commented on YARN-3468: --- Can anyone have some comments on this jira? > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486380#comment-14486380 ] Hadoop QA commented on YARN-3468: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12724051/YARN-3468.v1.patch against trunk revision 5a540c3. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService org.apache.hadoop.yarn.server.nodemanager.TestNodeManagerReboot Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7265//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7265//console This message is automatically generated. > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-3468.v1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3468) NM should not blindly rename usercache/filecache/nmPrivate on restart
[ https://issues.apache.org/jira/browse/YARN-3468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486285#comment-14486285 ] Siqi Li commented on YARN-3468: --- The renaming is causing a enormous number of _DEL_ to be created, when a NM is flapping and yarn.nodemanager.delete.debug-delay-sec is set to several minutes. This leads to NM full GC problem on starting up. Here is how things go south, on NM start up it will do 3 renames, and 3 mkdir usercache --> usercache_DEL_ filecache --> filecache_DEL_ nmPrivate --> nmPrivate_DEL_ mkdir usercache mkdir filecache mkdir nmPrivate then it will scan all DEL dirs and add them to the deletion service. However, with yarn.nodemanager.delete.debug-delay-sec set to 10 minute for example, those DEL dirs will not be deleted immediately, but after 10 minutes. When a NM flaps, it restarts every ~4 sec. There is not enough time(i.e. 10 minutes) for it to do the deletion. That's why more and more DEL dirs are created, but none were removed. As a result, /data/disk*/yarn/local/ will have hundreds of thousands of DEL dirs, and scanning them and adding them to deletion service will blow out the memory. My suggestion is to prevent NMs from renaming those directories if they are empty. > NM should not blindly rename usercache/filecache/nmPrivate on restart > - > > Key: YARN-3468 > URL: https://issues.apache.org/jira/browse/YARN-3468 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li > -- This message was sent by Atlassian JIRA (v6.3.4#6332)