[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019413#comment-17019413 ] kailiu_dev commented on YARN-2902: -- Hi , [~wilfreds] [~jlowe] Hudson I find a bug,NM has many dirs such as " nm-local-dir/filecache/1889677_tmp" not clear and leak on disk, do you know why? my hadoop version is hadoop-2.7.2 > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Darrell Lowe >Assignee: Varun Saxena >Priority: Major > Fix For: 2.8.0, 2.7.2, 2.6.4, 3.0.0-alpha1 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15800488#comment-15800488 ] Wilfred Spiegelenburg commented on YARN-2902: - Oops yes branch-2.6 was the broken one, it has a separate patch. I will port only the ENOENT change to branch-2.6 thank you for confirming this. Opened: YARN-6056 and will provide a patch there > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.8.0, 2.7.2, 2.6.4, 3.0.0-alpha1 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15798347#comment-15798347 ] Jason Lowe commented on YARN-2902: -- bq. In trunk we "ignore" a non existing directory in delete_as_user() whan the stat returns an ENOENT we do not do that in branch-2. I tracked it back to the backport of this jira into branch-2. The native code part of the change was not ported back into branch-2. I'm not seeing that. The "backport" for branch-2 was a cherry-pick: {noformat} commit c75d8b164f05033d0c6ed4876cf0b8d57224aa8c Author: Jason LoweDate: Thu Oct 29 16:33:29 2015 + YARN-2902. Killing a container that is localizing can orphan resources in the DOWNLOADING state. Contributed by Varun Saxena (cherry picked from commit e2267de2076245bd9857f6a30e3c731df017fef8) {noformat} And the diff between trunk and branch-2 shows no difference at all in the delete_as_user function. Ah, I believe you mean branch-2.6 instead of branch-2. There is a huge difference between trunk and branch-2.6 for delete_as_user, and I agree it appears the ENOENT change was dropped in the 2.6 patch. bq. If it was an oversight does this require a new jira to backport the native code or does it get handled as an addendum to this one. It would be a new JIRA against 2.6.4. This change has already shipped. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.8.0, 2.7.2, 2.6.4, 3.0.0-alpha1 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15796805#comment-15796805 ] Wilfred Spiegelenburg commented on YARN-2902: - I was looking at a problem around removing non existing directories in the LCE (native code) and saw that there was a difference in behaviour between trunk and branch-2. In trunk we "ignore" a non existing directory in {{delete_as_user()}} whan the stat returns an {{ENOENT}} we do not do that in branch-2. I tracked it back to the backport of this jira into branch-2. The native code part of the change was not ported back into branch-2. Question is was that on purpose or was it an oversight? If it was an oversight does this require a new jira to backport the native code or does it get handled as an addendum to this one. It has been a while since this jira was closed and the jira has been in a number of releases. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.8.0, 2.7.2, 2.6.4, 3.0.0-alpha1 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085669#comment-15085669 ] Hadoop QA commented on YARN-2902: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 0s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} branch-2.6 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} branch-2.6 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 17s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s {color} | {color:green} branch-2.6 passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 21s {color} | {color:red} hadoop-yarn-server-nodemanager in branch-2.6 failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} branch-2.6 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s {color} | {color:red} The patch has 4465 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 1m 59s {color} | {color:red} The patch has 272 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 21s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 4s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 8s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 32s {color} | {color:red} Patch generated 75 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 46m 39s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:date2016-01-06 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12780377/YARN-2902-branch-2.6.01.patch | | JIRA Issue | YARN-2902 | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 9e273af6efbe
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085908#comment-15085908 ] Varun Saxena commented on YARN-2902: The whitespace result seems to be weird. Maybe because some build related changes made in trunk are not in this branch > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085856#comment-15085856 ] Hadoop QA commented on YARN-2902: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 2 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 23s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} branch-2.6 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s {color} | {color:green} branch-2.6 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 16s {color} | {color:green} branch-2.6 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 56s {color} | {color:green} branch-2.6 passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 19s {color} | {color:red} hadoop-yarn-server-nodemanager in branch-2.6 failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} branch-2.6 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 2s {color} | {color:red} The patch has 4758 line(s) that end in whitespace. Use git apply --whitespace=fix. {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 1m 48s {color} | {color:red} The patch has 272 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 56s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 15s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 6m 0s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 6m 0s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 32s {color} | {color:red} Patch generated 75 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 31m 7s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:date2016-01-06 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12780377/YARN-2902-branch-2.6.01.patch | | JIRA Issue | YARN-2902 | |
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086401#comment-15086401 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9060/]) Add YARN-2975, YARN-3893, YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20) * hadoop-yarn-project/CHANGES.txt > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086343#comment-15086343 ] Junping Du commented on YARN-2902: -- Thanks [~varun_saxena] for the patch against 2.6 branch. Agree that the whitespace issue is not related to this patch. I have merge it to branch-2.6. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2, 2.6.4 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083031#comment-15083031 ] Junping Du commented on YARN-2902: -- Sounds like a good plan. Thanks! > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081723#comment-15081723 ] Varun Saxena commented on YARN-2902: [~djp], uploaded a patch backporting changes to branch-2.6 Changes in container-executor.c are not required as the change was necessitated by changes made in YARN-3089 which is not there in branch-2.6 If YARN-3089 is ever brought into branch-2.6, it will break fix made by this JIRA(while using LCE). Would a comment on that JIRA be enough to avoid this potential problem ? However, YARN-3089 is unlikely to be backported to branch-2.6 We must backport YARN-4354(as you mentioned) and YARN-4380(the newly added test here fails intermittently without it) if we get this into branch-2.6 Also should I reopen this JIRA to run QA on this patch ? > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082021#comment-15082021 ] Junping Du commented on YARN-2902: -- bq. However, YARN-3089 is unlikely to be backported to branch-2.6 I just pinged the author/review on that JIRA. Let's wait and see. Does that means if we need to pull that fix, the patch here need to be updated? If so, we should wait a while on that patch. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082000#comment-15082000 ] Junping Du commented on YARN-2902: -- Thanks [~varun_saxena] for replying and delivering the patch! bq. Also should I reopen this JIRA to run QA on this patch? It depends on if we have many conflicts in merging with original patch. If answer is yes, then we'd better to reopen the JIRA and run Jenkins test. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081130#comment-15081130 ] Varun Saxena commented on YARN-2902: Yes this issue should exist in 2.6 too. I can help you with backporting it. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082460#comment-15082460 ] Varun Saxena commented on YARN-2902: bq. Does that means if we need to pull that fix, the patch here need to be updated? Yes, the patch would have to be updated with changes in container-executor.c Yeah there were a few conflicts. Will reopen and run QA after decision on YARN-3089. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902-branch-2.6.01.patch, YARN-2902.002.patch, > YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.05.patch, > YARN-2902.06.patch, YARN-2902.07.patch, YARN-2902.08.patch, > YARN-2902.09.patch, YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080584#comment-15080584 ] Junping Du commented on YARN-2902: -- Hi [~jlowe] and [~varun_saxena], shall we get the fix (and YARN-4354) to branch-2.6 also? > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980749#comment-14980749 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-trunk-Commit #8724 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8724/]) YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980888#comment-14980888 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #601 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/601/]) YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981070#comment-14981070 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #613 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/613/]) YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981164#comment-14981164 ] Varun Saxena commented on YARN-2902: Thanks [~jlowe] for the review and commit. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981001#comment-14981001 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2543 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2543/]) YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/CHANGES.txt > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981055#comment-14981055 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1336 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1336/]) YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981233#comment-14981233 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2489 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2489/]) YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981177#comment-14981177 ] Hudson commented on YARN-2902: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #551 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/551/]) YARN-2902. Killing a container that is localizing can orphan resources (jlowe: rev e2267de2076245bd9857f6a30e3c731df017fef8) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DockerContainerExecutor.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/test/test-container-executor.c > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Fix For: 2.7.2 > > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980729#comment-14980729 ] Jason Lowe commented on YARN-2902: -- +1 latest patch lgtm. Committing this. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980163#comment-14980163 ] Hadoop QA commented on YARN-2902: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s {color} | {color:green} trunk passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} the patch passed with JDK v1.7.0_85 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 33s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 58s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_85. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 28m 1s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-10-29 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12769486/YARN-2902.11.patch | | JIRA Issue | YARN-2902 | | Optional Tests | asflicense javac javadoc mvninstall unit findbugs checkstyle compile cc | | uname | Linux bf5ccb831f38 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality |
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980259#comment-14980259 ] Varun Saxena commented on YARN-2902: Fixed. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.11.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978364#comment-14978364 ] Varun Saxena commented on YARN-2902: [~jlowe], For some reason, QA report was not published to JIRA. After integration with Yetus it seems. Its successful though. Kindly find the reports here. https://builds.apache.org/job/PreCommit-YARN-Build/9593/console {panel} +1 overall || Vote || Subsystem || Runtime || Comment|| | 0 |reexec | 0m 8s | docker + precommit patch detected. | +1 | @author | 0m 0s | The patch does not contain any @author | ||| tags. | +1 |test4tests | 0m 0s | The patch appears to include 3 new or | ||| modified test files. | +1 |mvninstall | 3m 22s| trunk passed | +1 | compile | 0m 24s| trunk passed with JDK v1.8.0_60 | +1 | compile | 0m 25s| trunk passed with JDK v1.7.0_79 | +1 |checkstyle | 0m 12s| trunk passed | +1 |mvneclipse | 0m 15s| trunk passed | +1 | findbugs | 1m 3s | trunk passed | +1 | javadoc | 0m 24s| trunk passed with JDK v1.8.0_60 | +1 | javadoc | 0m 27s| trunk passed with JDK v1.7.0_79 | +1 |mvninstall | 0m 27s| the patch passed | +1 | compile | 0m 26s| the patch passed with JDK v1.8.0_60 | +1 |cc | 0m 26s| the patch passed | +1 | javac | 0m 26s| the patch passed | +1 | compile | 0m 25s| the patch passed with JDK v1.7.0_79 | +1 |cc | 0m 25s| the patch passed | +1 | javac | 0m 25s| the patch passed | +1 |checkstyle | 0m 13s| the patch passed | +1 |mvneclipse | 0m 14s| the patch passed | +1 |whitespace | 0m 0s | Patch has no whitespace issues. | +1 | findbugs | 1m 12s| the patch passed | +1 | javadoc | 0m 22s| the patch passed with JDK v1.8.0_60 | +1 | javadoc | 0m 26s| the patch passed with JDK v1.7.0_79 | +1 | unit | 9m 18s| hadoop-yarn-server-nodemanager in the | ||| patch passed with JDK v1.8.0_60. | +1 | unit | 9m 29s| hadoop-yarn-server-nodemanager in the | ||| patch passed with JDK v1.7.0_79. | +1 |asflicense | 0m 25s| Patch does not generate ASF License | ||| warnings. | || 30m 42s | || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-10-28 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12769215/YARN-2902.10.patch | | JIRA Issue | YARN-2902 | | Optional Tests | asflicense javac javadoc mvninstall unit findbugs checkstyle compile cc | | uname | Linux 0bcb1b5c92b2 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-YARN-Build/patchprocess/apache-yetus-06655ab/dev-support/personality/hadoop.sh | | git revision | trunk / a04b169 | | Default Java | 1.7.0_79 | | Multi-JDK versions | /usr/lib/jvm/java-8-oracle:1.8.0_60 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_79 | | findbugs | v3.0.0 | | JDK v1.7.0_79 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9593/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Max memory used | 227MB | | Powered by | Apache Yetus http://yetus.apache.org | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9593/console | Finished build. {panel} > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch,
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14979288#comment-14979288 ] Jason Lowe commented on YARN-2902: -- Thanks for updating the patch, Varun! Patch looks great except for one thing that I missed earlier. In container-executor.c the code is assuming any error returned by stat must mean the file is missing, and that's not correct. It should be checking errno for ENOENT, as anything else is an error not related to the file being already deleted. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, > YARN-2902.10.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976928#comment-14976928 ] Varun Saxena commented on YARN-2902: Thanks a lot [~jlowe] for the review. I was under the incorrect impression that the resource downloading will not be taken up by other containers again. You are correct we should not FAIL the resource here. It will be taken up by outstanding container when next HB comes. If we do not call handleDownloadingRsrcsOnCleanup, we wont require to synchronize scheduled map as well. Also event.getResource().getLocalPath() can be used here too. This would preclude the need for ScheduledResource class and hence the refactoring associated with it. However, as resource would not be explicitly FAILED in this case, we should probably do some cleanup when reference count of downloading resource becomes 0. Otherwise entry associated with the downloading resource will remain in LocalResourcesTrackerImpl#localResourceMap and this may show up when cache cleanup is done. And we may turn up with the same log {{LOG.error("Attempt to remove resource: " + rsrc + " with non-zero refcount");}} even though the resource is deleted on disk. I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we should check if the reference count is 0 and whether state of resource is DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource. Thoughts ? > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977077#comment-14977077 ] Jason Lowe commented on YARN-2902: -- bq. I think in LocalResourcesTrackerImpl#handle, after handling RELEASE event, we should check if the reference count is 0 and whether state of resource is DOWNLOADING. And if this is so, call LocalResourcesTrackerImpl#removeResource. Agreed. We can automatically remove the resource if the refcount of a downloaded resource ever goes to zero. And if there's a race where another container is just trying to reference that resource just as we're releasing (and removing) it from a killed container then either we'll keep it because the refcount is nonzero (request comes before release) or we'll create a new resource to track the subsequent request (release comes before request). > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977247#comment-14977247 ] Varun Saxena commented on YARN-2902: [~jlowe], kindly review. The patch applies on branch-2.7 too. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977604#comment-14977604 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 57s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 9m 1s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 28s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 28s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 42s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 6s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 43s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 1m 25s | The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 9m 21s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 53m 52s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-server-nodemanager | | Failed unit tests | hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService | | | hadoop.yarn.server.nodemanager.containermanager.localizer.TestLocalResourcesTrackerImpl | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12769131/YARN-2902.09.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 68ce93c | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/9592/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9592/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9592/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9592/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.09.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975095#comment-14975095 ] Jason Lowe commented on YARN-2902: -- Thanks for updating the patch, Varun! I'm not sure the logic in handleDownloadingRsrcsOnCleanup is desired. IIUC when a container is killed then all resources that were still pending will receive a resource failed event. Those resources will in turn send all containers that reference it a failed resource message, and that will effectively kill the container. This could be bad if the killed container was localizing a resource for more than one container. Today when the container is killed I believe another outstanding container will simply acquire the lock on the scheduled resource and proceed to re-attempt a localization on that resource. If we do this logic then we'll kill all those other referencing containers which isn't correct. We do need to cleanup downloading resource objects that have a zero refcount, but not ones that are still referenced by other containers. Those other containers should be able to continue localizing the resource if necessary. With the following code, don't we risk failing to put the resource on the scheduled event queue if something is thrown yet we will still hold the resource lock? In the old code we would always put the resource on the scheduled queue before fielding any errors, and the scheduled queue processing would release the resource lock. {code} if (nRsrc.tryAcquire()) { if (nRsrc.getState() == ResourceState.DOWNLOADING) { [...some code removed for brevity...] try { Path localizationPath = getPathForLocalization(next); synchronized (scheduled) { scheduled.put(nextRsrc, new ScheduledResource(evt, localizationPath)); } return NodeManagerBuilderUtils.newResourceLocalizationSpec(next, localizationPath); } catch (IOException e) { LOG.error("local path for PRIVATE localization could not be " + "found. Disks might have failed.", e); } catch (IllegalArgumentException e) { LOG.error("Inorrect path for PRIVATE localization." + next.getResource().getFile(), e); } catch (URISyntaxException e) { LOG.error( "Got exception in parsing URL of LocalResource:" + next.getResource(), e); } } else { // Need to release acquired lock nRsrc.unlock(); } } {code} Actually do we even need all the code refactoring associated with the new ScheduledResource class? I think we can simply use the local path already accessible via event.getResource().getLocalPath() in the cleanup code. If that's not null then that means we sent the localizer that resource in the heartbeat response, and those would be the ones to target for cleanup. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975185#comment-14975185 ] Jason Lowe commented on YARN-2902: -- Forgot to respond to this comment: bq. That is if NM recovery is not enabled and the deletion task is scheduled. But the deletion task is put in the deletion service's executor queue because all the 4 threads in deletion service's executor(NM delete threads) are occupied. If NM goes down before this task is taken up, the downloading resources wont be deleted. If NM recovery is not enabled then failing to delete when the NM crashes is already a known issue. As for the normal termination scenario we should be stopping the ResourceLocalizationService (via the ContainerManager shutdown) before trying to stop the DeletionService, so I would expect deletions to be queued up before we stop that service. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974316#comment-14974316 ] Varun Saxena commented on YARN-2902: TestNodeStatusUpdaterForLabels failure will be tracked by YARN-4169 > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971133#comment-14971133 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 18m 3s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 9m 4s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 49s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 26s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 44s | The applied patch generated 6 new checkstyle issues (total was 235, now 198). | | {color:green}+1{color} | whitespace | 0m 12s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 41s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 26s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 9m 24s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 53m 30s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12768301/YARN-2902.08.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 124a412 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9546/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9546/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9546/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971168#comment-14971168 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 23s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 2 new or modified test files. | | {color:green}+1{color} | javac | 8m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 16s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 27s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 46s | The applied patch generated 6 new checkstyle issues (total was 235, now 198). | | {color:green}+1{color} | whitespace | 0m 12s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 42s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 22s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 9m 5s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 53m 48s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestNodeStatusUpdaterForLabels | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12768305/YARN-2902.08.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 124a412 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9547/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9547/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9547/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971092#comment-14971092 ] Varun Saxena commented on YARN-2902: [~jlowe], kindly review. Sorry could not upload the patch earlier due to bandwidth issues. But I think its still on track for 2.7.2 Coming to the patch, the patch handles the deletion in NM itself. At the time of processing container cleanup event(after killing of container), we will transition downloading resource to FAILED. And after localizer exits, deletion will be done in finally block of LocalizerRunner, as per suggestion given above. There is one presumably rare scenario where this deletion wont work. That is if NM recovery is not enabled and the deletion task is scheduled. But the deletion task is put in the deletion service's executor queue because all the 4 threads in deletion service's executor(NM delete threads) are occupied. If NM goes down before this task is taken up, the downloading resources wont be deleted. If you want this handled, we can attempt deletion in container localizer too. I already have code for it(in earlier patches). But do we need to handle this rare case ? Let me know. BTW, patch does not apply cleanly on branch-2.7 so will update that patch once trunk patch is ok to go in. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.08.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951148#comment-14951148 ] Varun Saxena commented on YARN-2902: [~jlowe], bq. The key with the new proposal is that the LocalizerRunner thread is the one issuing the deletes and only after the localizer process exits. +1. This makes sense. I cant think of any case where this won't work. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950399#comment-14950399 ] Varun Saxena commented on YARN-2902: [~jlowe] kindly review > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950919#comment-14950919 ] Jason Lowe commented on YARN-2902: -- Thanks for updating the patch, Varun! Sorry, I'm still a little confused on why we need to complicate the localizer protocol to fix this issue. Seems like this is a hack to help the NM figure out what's going on, but it should already know this stuff. That prompted me to dig around for an alternative solution, and I think I found one. The NM knows the local path where a resource is localized, since it tells the localizer where to put it in the download request. Also each localizer has a LocalizerRunner thread that is tracking it, and it knows which resources were pending when the localizer process exits. That's tracked in the {{scheduled}} map so the runner thread can unlock every pending resource to allow a subsequent localizer to try downloading it again. Seems to me all we need to do is have the LocalizerRunner issue a delete of the local path and temporary download path for each resource that was pending at the time the localizer process died, since we know any pending resources when a localizer exits must have been orphaned. Resources that were successfully localized are pulled out of the {{scheduled}} map, so the only things left should be the ones we need to process for cleanup. That seems like a much simpler implementation as it doesn't change any protocols and doesn't rely on the container localizer doing any cleanup. The NM will automatically do so when it exits. We also don't need delayed deletion support in DeletionService, since we know the container localizer process is dead. Maybe I'm missing something and that approach can't work. If it can then that seems like a preferable solution for 2.7 as it will be a smaller, simpler patch. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950989#comment-14950989 ] Varun Saxena commented on YARN-2902: [~jlowe], thanks for looking at the patch. The reason I inserted delete downloading flag in the protocol was to indicate to localizer that the resources it reported to NM in last HB were not processed by NM. So localizer needs to delete them. That is why an extra list of paths was maintained in localizer(paths which have been reported to NM for download). I was primarily working on the principle that we can delete as much as we can in localizer. So that if NM crashes and its not work preserving, paths can be deleted. And vice versa. So 2 points of deletion can make it almost sure that downloading resources are deleted. But yeah this does make it complex. You are correct that NM will know about these paths as well and can delete them. The extra flag in localizer protocol thus can be removed. I will update the patch. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951010#comment-14951010 ] Varun Saxena commented on YARN-2902: Sorry I mean NM crashes and recovery is not enabled. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951016#comment-14951016 ] Varun Saxena commented on YARN-2902: Just to let you know, one case where this wont work, I mean after removal of flag from protocol. 1. NM recovery is disabled. 2. Container is killed. Associated resources are stuck in downloading state and a deletion task is launched for them. 3. In the meantime localizer downloads a resource and on next HB, Localizer reports a downloaded resource to NM. In NM this will be in downloading state. 4. NM indicates localizer to DIE. Localizer wont delete the resource just downloaded. 5. NM crashes. 6. NM would missing deleting the downloading resource as well as recovery is disabled. This I agree though should be a very rare scenario and we can skip it. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951034#comment-14951034 ] Varun Saxena commented on YARN-2902: [~jlowe], bq. We also don't need delayed deletion support in DeletionService, since we know the container localizer process is dead. Localizer doesnt exit immediately as of now when container is killed, even though we interrupt the thread. We first issue a DIE on next HB and then only localizer exits. >From the time container is killed and localizer exits, a resource maybe >downloaded or may start downloading. We will delete the tmp directory and main >directory. But if download is started by localizer in the meantime, it will recreate the directories (in FSDownload#call). That is why delay has been introduced. Thoughts ? > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951115#comment-14951115 ] Jason Lowe commented on YARN-2902: -- The key with the new proposal is that the LocalizerRunner thread is the one issuing the deletes and only after the localizer process exits. bq. 2. Container is killed. Associated resources are stuck in downloading state and a deletion task is launched for them. That deletion task would not be launched because the localizer has not exited. The LocalizerRunner will still be waiting on the localizer process to exit. bq. Localizer doesnt exit immediately as of now when container is killed, even though we interrupt the thread. Yes, and that's fine. We won't issue the deletion requests until the localizer process eventually exits. The key is this code in LocalizerRunner: {code:title=LocalizerRunner} public void run() { Path nmPrivateCTokensPath = null; Throwable exception = null; try { [...localizer pre-startup code removed for brevity...] if (dirsHandler.areDisksHealthy()) { exec.startLocalizer(new LocalizerStartContext.Builder() .setNmPrivateContainerTokens(nmPrivateCTokensPath) .setNmAddr(localizationServerAddress) .setUser(context.getUser()) .setAppId(ConverterUtils.toString(context.getContainerId() .getApplicationAttemptId().getApplicationId())) .setLocId(localizerId) .setDirsHandler(dirsHandler) .build()); } else { throw new IOException("All disks failed. " + dirsHandler.getDisksHealthReport(false)); } // TODO handle ExitCodeException separately? } catch (FSError fe) { exception = fe; } catch (Exception e) { exception = e; } finally { if (exception != null) { LOG.info("Localizer failed", exception); // On error, report failure to Container and signal ABORT // Notify resource of failed localization ContainerId cId = context.getContainerId(); dispatcher.getEventHandler().handle(new ContainerResourceFailedEvent( cId, null, exception.getMessage())); } for (LocalizerResourceRequestEvent event : scheduled.values()) { event.getResource().unlock(); } delService.delete(null, nmPrivateCTokensPath, new Path[] {}); } {code} startLocalizer won't return until the localizer process exits, so when it iterates the {{scheduled}} map in the finally block to unlock the resources we can issue deletions for the local resource paths at the same time. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943918#comment-14943918 ] Varun Saxena commented on YARN-2902: The new patch does the following over the last patch : # Removed additional param in container executor to ignore missing directory. Now this will be the default behaviour # The patch no longer cancels the deletion task in NM # Localizer wont send the extra heartbeat which it was sending to NM if NM had indicated it to DIE. # Do not wait in localizer for cancelled task to complete on DIE. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942581#comment-14942581 ] Varun Saxena commented on YARN-2902: Release audit warning is unrelated. There is HDFS-9182 already for it. Checkstyle is for file length > 2000. Raised YARN-4223 for pre patch findbugs warning > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942511#comment-14942511 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 19m 52s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 58s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 20s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 17s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 4s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 2m 33s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 40s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 39s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 2s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 8m 47s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 62m 6s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12764917/YARN-2902.07.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 3b85bd7 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/trunkFindbugsWarningshadoop-yarn-server-nodemanager.html | | Release Audit | https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9342/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9342/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9342/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, > YARN-2902.07.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902808#comment-14902808 ] Varun Saxena commented on YARN-2902: Ok. Then what I will do is NOT wait for completion of running tasks which have been cancelled. Localizer will try to only delete directories for which download was complete. Tasks which have failed, directories for them will anyways be deleted by FSDownload. We however may need a config in NM for deletion task delay(the one I have added in current patch). Or we can simply have a hardcoded value of 2 minutes. Regarding System exit, it will called after ExecutorService#shutdownNow(which will only interrupt running tasks and not wait for them) anyways. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805279#comment-14805279 ] Varun Saxena commented on YARN-2902: [~jlowe], bq. As far as properly handling DIE so we actually stop downloading and problems canceling active transfers, can't we just have the localizer forcibly tear down the JVM? If we're being told to DIE then I assume we really don't care about pending transfers completing and just want to get out. If the NM is going to clean up after the localizer anyway, seems like we can drastically simplify DIE handling and just exit the JVM. That seems like a change that's targeted enough to be appropriate for 2.7 instead of adding localizer kill support, etc. In container localizer, when processing HB DIE response, we send another localizer status to NM. Is it really required ? What do you think ? I think as soon as we get DIE, we can follow current code of cancelling pending tasks, although not wait for them to complete(as is being done in newly added code in patch) and delete paths reported in last status. And then just return from the loop for a graceful shutdown(after stopping executors). Or are you suggesting System exit ? >From the NM side, we can have a deletion task after some configured delay(same >as right now). We will never cancel this deletion task though unlike code in >patch now. This way localizer should quit quickly and NM can cleanup. I will change the behavior of executor on deletion as well i.e. I will ignore missing paths by default. Wont add flag. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14805281#comment-14805281 ] Varun Saxena commented on YARN-2902: I should say "This way localizer should quit quickly after attempting some cleanup and NM can also attempt cleanup." However we can also let localizer not do any cleanup at all and let NM delete paths. That is also an option. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875721#comment-14875721 ] Jason Lowe commented on YARN-2902: -- bq. In container localizer, when processing HB DIE response, we send another localizer status to NM. Is it really required ? What do you think ? I don't think this is required. If the NM is telling the localizer to DIE then I don't think the NM cares after that point what the localizer is doing. The NM is totally done with it at that point due to failure or lack of knowledge of that localizer. bq. Or are you suggesting System exit ? I was basically suggesting System.exit if we aren't convinced that the localizer can actually tear down in a timely manner. For example, if the graceful shutdown could involve waiting for active transfers to complete because we can't reliably interrupt them, then yes I think System.exit is appropriate. A good compromise would be to put a timeout on shutdown -- if we can't get down within so many seconds then have something (e.g.: a watchdog thread if necessary) call System.exit to get out. Otherwise the localizer could still be running and messing with the filesystem after the NM tries to cleanup afterwards. Worst-case scenario is this could still happen even with these fixes, but it should resolve the leaking issue for the vast majority of cases. We can make it more bulletproof in a followup JIRA for 2.8 or later that actually has the NM tracking localizer pids and proactively killing them if they don't respond in a timely manner to commands. bq. However we can also let localizer not do any cleanup at all and let NM delete paths. I would still like the localizer to try to perform some cleanup if possible, as the NM doesn't track localizers in the state store. Therefore if the NM restarts we may not cleanup everything properly if the localizer doesn't do it on its own. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14790688#comment-14790688 ] Varun Saxena commented on YARN-2902: bq. Container localizers should already have the concept of heartbeating and killing themselves if they don't hear from the NM within X seconds, and likewise the NM should kill localizers that don't heartbeat in a timely fashion. For container localizer I think ipc timeout should work, if configured. Will check it. For NM side, currently it does not kill localizers. We can track PID and kill it as discussed earlier if HB doesnt come for a configured period. We can do similar to what we do now for containers now. SIGTERM followed by SIGKILL, if required. Should we add this then ? This would mean that we will have to relaunch a new localizer again, if container is still running. Or fail the container ? bq. I'm also not sure we need deletion task cancellation. As you point out it's not really necessary. Ok. Will remove it. bq. Also do we really need a flag to say whether we want it to ignore missing paths? Wondering if we should just ignore cases where the path doesn't exist. Yes, we can simply ignore missing paths. Did not change so as not to break previous behavior. Doesnt seem like anyone is depending on this behavior though. bq. What if we have the localizer register the temporary working directory (i.e.: the _tmp paths) as deleteOnExit paths? Currently _tmp paths are deleted in finally in FSDownload#call. Wouldnt that be enough to handle case of normal JVM exit ? bq. With this I don't think we need to change the localizer protocol – DIE means try to cleanup, but NM will always cleanup anyway so no need to wait around and try too hard. Its actually more important that the localizer gets out of the way in a timely manner than it is for it to cleanup since the NM will be the backup in case the localizer fails. For this the only concern I see is what I mentioned about the issue I found in FSDownload, that is the download task running even after cancel because code is uninterruptible at places in FSDownload#call. In this case, we can never know when the cancelled task will complete and create files in the directory. There can be a race which can lead to tmp directory being renamed for instance. We can in this case in deletion task, first put the _tmp dir and then the real one so that first tmp is deleted. Also we can add an extra number of seconds to localizer HB timeout for scheduling file deletion, so that localizer is killed (Assuming we adopt approach mentioned in first point, above) before we attempt deletion. We however would need to change the localizer protocol as well. Currently localizer deletes the entry from its pending resources map as soon as it sends a status. And NM will send a DIE in two cases - 1) If container has been killed 2)NM processes HB and finds one of the status to be FETCH_FAILED. In this case we cannot know if the resources for which fetch was success were actually processed by NM or not. So I maintain a separate list for resources reported to localizer. Hence a flag, so that even those resources can be deleted. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791115#comment-14791115 ] Jason Lowe commented on YARN-2902: -- bq. For NM side, currently it does not kill localizers. We can track PID and kill it as discussed earlier if HB doesnt come for a configured period. Yeah, I think long-term to make this a lot more stable and reliable we're going to need the ability for the NM to kill localizers explicitly rather than via request. As you mentioned, the concern is that the localizer will not actually interrupt and stop the localization. Having the NM forcibly kill the localizer means we can put less trust in the localizer to always get that right. However that's probably a lot of work and churn to the code which makes it less palatable for a 2.7 inclusion. Ideally we should target a minimal change for 2.7 that gets us past the main problems we're having today, and we can add more bulletproofing in followup JIRAs for subsequent releases. As far as properly handling DIE so we actually stop downloading and problems canceling active transfers, can't we just have the localizer forcibly tear down the JVM? If we're being told to DIE then I assume we really don't care about pending transfers completing and just want to get out. If the NM is going to clean up after the localizer anyway, seems like we can drastically simplify DIE handling and just exit the JVM. That seems like a change that's targeted enough to be appropriate for 2.7 instead of adding localizer kill support, etc. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746398#comment-14746398 ] Jason Lowe commented on YARN-2902: -- Wow, the patch has gotten a lot larger. Will take me a while to fully digest all of the changes. bq. Would we want this to go into 2.7.2 ? Ideally yes. We're seeing significant leakage of local resources on many of our nodes due to this problem. Back to the patch itself, at a quick glance I'm lukewarm on a couple of points. I'm not thrilled with adding yet another config that users need to tune properly or it doesn't work correctly. Container localizers should already have the concept of heartbeating and killing themselves if they don't hear from the NM within X seconds, and likewise the NM should kill localizers that don't heartbeat in a timely fashion. It seems to me that's the time interval we should be using to determine how long to wait before giving up and having the NM do the cleanup. I'm also not sure we need deletion task cancellation. As you point out it's not really necessary. The files being deleted should not be reused later, so there should be no harm in attempting to redundantly delete them. If we decide that we really should have cancellation of deletion tasks then that should be implemented as a separate JIRA to help keep this patch more manageable since that's a readily separable feature that can stand on its own. Also do we really need a flag to say whether we want it to ignore missing paths? Wondering if we should just ignore cases where the path doesn't exist. I can see it either way I guess. I wonder if we can solve this with a simpler approach that should work well in most cases. What if we have the localizer register the temporary working directory (i.e.: the _tmp paths) as deleteOnExit paths? Then the localizer should try to clean these up if there's anything at all resembling a "normal" JVM exit, and I believe the JVM ignores cases when the file is already missing. Then we only have to worry about the case where the localizer dies a horrible death and the JVM doesn't get a chance to cleanup. To cover that case, when the NM kills the localizer we also schedule a deletion of the dest and dest_tmp paths. With this I don't think we need to change the localizer protocol -- DIE means try to cleanup, but NM will always cleanup anyway so no need to wait around and try too hard. Its actually more important that the localizer gets out of the way in a timely manner than it is for it to cleanup since the NM will be the backup in case the localizer fails. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746045#comment-14746045 ] Varun Saxena commented on YARN-2902: Would we want this to go into 2.7.2 ? > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742554#comment-14742554 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 47s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 9 new or modified test files. | | {color:green}+1{color} | javac | 10m 10s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 12m 3s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 30s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 25s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:red}-1{color} | whitespace | 7m 9s | The patch has 10 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 47s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 37s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 57s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 27s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 13s | Tests passed in hadoop-yarn-common. | | {color:red}-1{color} | yarn tests | 7m 56s | Tests failed in hadoop-yarn-server-nodemanager. | | | | 70m 56s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.nodemanager.TestLinuxContainerExecutorWithMocks | | | hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12755609/YARN-2902.05.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 7269906 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/whitespace.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9103/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9103/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9103/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742662#comment-14742662 ] Varun Saxena commented on YARN-2902: [~jlowe], kindly review. The patch at a very high level does the following : # On container kill, NM(localization service) will create deletion task for all the downloading resources and schedule it to run after a configured delay(new config added for it). Made the decision to not wait for HB from localizer first because we would not want to depend on localizer if there is some problem there and it does not send HB. # On subsequent HB from localizer, NM will indicate to localizer that it can delete the downloading resources by itself after cancelling download tasks. Added a boolean flag in proto for this. # After localizer deletes the resources, it will be send last HB to NM. A boolean flag has been added in proto to indicate this to NM. On receiving this HB, NM will cancel the deletion task so that deletion is not attempted by NM as well. Although its not a problem even if we attempt deletion because if nothing can be deleted, deletion task wont do anything. But if deletion task can be cancelled, then why not. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742658#comment-14742658 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 51s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 10 new or modified test files. | | {color:green}+1{color} | javac | 7m 44s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 33s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 50s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:red}-1{color} | checkstyle | 2m 38s | The applied patch generated 6 new checkstyle issues (total was 344, now 298). | | {color:green}+1{color} | whitespace | 7m 19s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 31s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 3s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 7m 55s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 64m 23s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12755626/YARN-2902.06.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 7269906 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9107/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9107/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9107/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742593#comment-14742593 ] Varun Saxena commented on YARN-2902: Sorry havent fixed checkstyle, thats related to more than 2000 lines in file. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742612#comment-14742612 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 32s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 10 new or modified test files. | | {color:green}+1{color} | javac | 7m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 9s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 48s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:red}-1{color} | checkstyle | 2m 36s | The applied patch generated 10 new checkstyle issues (total was 344, now 306). | | {color:green}+1{color} | whitespace | 6m 49s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 25s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 23s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 2m 0s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 7m 47s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 64m 3s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12755616/YARN-2902.06.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 7269906 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/9105/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9105/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9105/console | This message was automatically generated. > Killing a container that is localizing can orphan resources in the > DOWNLOADING state > > > Key: YARN-2902 > URL: https://issues.apache.org/jira/browse/YARN-2902 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.5.0 >Reporter: Jason Lowe >Assignee: Varun Saxena > Attachments: YARN-2902.002.patch, YARN-2902.03.patch, > YARN-2902.04.patch, YARN-2902.05.patch, YARN-2902.06.patch, YARN-2902.patch > > > If a container is in the process of localizing when it is stopped/killed then > resources are left in the DOWNLOADING state. If no other container comes > along and requests these resources they linger around with no reference > counts but aren't cleaned up during normal cache cleanup scans since it will > never delete resources in the DOWNLOADING state even if their reference count > is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742706#comment-14742706 ] Varun Saxena commented on YARN-2902: Implementation details are as under : (FSDownload.java) * While going through the code, found one issue in current code. FSDownload#call is a mix of interruptible and uninterruptible code. This means that when {{Future#cancel(true)}} is called, the code might not be interrupted and we may continue downloading, even though CancellationException is thrown on Future#get. But in ContainerLocalizer, on CancellationException we send FETCH_FAILURE back to NM and do not do any cleanup(deletion of downloaded resources) expecting FSDownload#call to do the cleanup as it indeed does on other exceptions. FSDownload does not provide any method to do the cleanup either. So to resolve this issue, have added a wait/notify construct to let download task complete(either through exception or normally) and then do the cleanup. Thought of checking for thread being interrupted at the end of FSDownload#call but FileContext#rename in the code resets the interrupted flag so going with this solution. This will increase the time taken for localizer to send last heartbeat(HB) to NM though. *Should I raise another JIRA for this ?* (ContainerLocalizer.java) * A set of paths is maintained for the resources which were successfully downloaded and reported to NM in HB. This is done because if container has been killed, NM would pay no heed to this FETCH_SUCCESS status. So if in HB rsp, NM indicates that resources have to be deleted, we need a reference of what we sent in the HB as well. This extra set is required because we currently do not wait for rsp from NM before deleting entry from pending resources map (delete it at the time of sending HB). Should we change it ? * We use this set of paths to do the deletion of resources reported to NM at the time of localizer DIE. pendingResources map is used for cleanup of resources which have not been reported to NM (ResourceLocalizationService.java) * On container kill, iterate over all the scheduled resources and for the resources which are in downloading state, schedule a deletion task which runs after a delay specified by config yarn.nodemanager.localizer.downloading-rsrcs.deletion.wait-sec * An additional map of localizer id(key) and deletion task future + deletion task id for state store(value) is maintained for cancellation of deletion task later(upon HB from localizer). * On localizer HB, we would try to cancel the deletion task and remove the task from state store. Moreover, a custom deletion task is created which calls the FileDeletionTask and also removes the entry from map above after deletion task completes. This however can lead to a minor race for removal of task from state store(i.e. race due to task completion and cancellation) if the code flow is not interruptible. But I guess that wont be a problem as removal of task by ID from leveldb would simply not do anything if key does not exist. * In case of NM restart(which should be quite rare), the map above would be lost. But that should be fine. Only side effect of this is that running deletion tasks may not be cancelled. But even if we attempt to delete a non existent directory, it should not be a problem. (DeletionService.java and Container Executor(s)) * Made relevant changes in DeletionService to ensure scheduling of deletion tasks with a delay. * Made changes to ensure deletion task's future and task id used for state store are returned back to caller for cancellation later. Also added code for cancellation of running task. * Added a new flag "ignoreMissingDir" in deletion task. Currently executors will throw an error if any of directory missing amongst the list of baseDirs passed for deletion do not exist. But for this use case we would ignore such missing dir and continue with deletion of other directories. Because NM may have inconsistent view of what all needs to be deleted. The flag hence has been added to support this new behavior and continue with old behavior(if flag is false). * Changes have been made in {{container-executor}} as well for support of this flag. An additional command line parameter has been added (for delete as user command). A value of 0 is analogous to the flag being false and 1 means true. Few additional points to note # _tmp directories should be deleted by localizer itself(in FSDownload). Have added it because as you said localizers can sometimes turn rogue. In this case, NM will delete these dirs. Should I handle tmp dirs then ? # I am assuming it would be fine to change command line parameters for a specific container-executor command in 2.8. If we do not want to change command line parameters for {{container-executor}}, there can be workarounds. > Killing a container that is localizing can orphan
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654813#comment-14654813 ] tangjunjie commented on YARN-2902: -- I encounter this issue。 Enviroment: hadoop 2.3.0-cdh5.0.2 {noformat} 2015-08-05 12:11:18,275 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs:///user/xxx/.staging/job_1433219182593_445774/libjars/zookeeper-3.4.5-cdh5.0.2.jar, 1436271746726, FILE, null },pending,[],1624036313203834,DOWNLOADING} with non-zero refcount /var/log/hadoop-yarn/hadoop-cmf-yarn-NODEMANAG {noformat} Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600870#comment-14600870 ] Varun Saxena commented on YARN-2902: Yeah lets go with this approach. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599307#comment-14599307 ] Varun Saxena commented on YARN-2902: Thanks for the review [~jlowe]. bq. I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today. We do not track localizers from executors. But issue is how do we track them ? Get PID of the localizer process and check if localizer has died ? But here the issue can be what if in between checks, localizer dies and PID is taken by some other process. We primarily want localizer to die so that it doesn't download anything after we do the deletion. One option would be to add a status in heartbeat asking localizer to cleanup(stop its downloading threads) and once that is done, indicate NM to do the deletion in another heartbeat. On this HB, NM can do the deletion and Localizer on HB response can DIE. Thoughts ? Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599532#comment-14599532 ] Jason Lowe commented on YARN-2902: -- bq. But here the issue can be what if in between checks, localizer dies and PID is taken by some other process. The NM is tracking containers. If it can track containers then it can track localizers. There may be issues with PID recycling, but they're not specific to tracking localizers. bq. One option would be to add a status in heartbeat asking localizer to cleanup(stop its downloading threads) and once that is done, indicate NM to do the deletion in another heartbeat. That would be a bit more consistent since the process creating the files is the process that cleans them when it aborts. It also helps keep the disk clean if there ever were a rogue localizer that the NM doesn't know about. Expecting the NM to clean up after a localizer that the NM has no idea what it's doing is going to be difficult. However we still have to handle the fallback case where the localizer simply crashes and something needs to clean up after it. Speaking of the localizer crashing, today the localizer can go rogue and stop heartbeating, and I don't think anything will detect this. In that situation we also fail to cleanup since we'll wait for a heartbeat that will never arrive. So how about the following approach: - ContainerLocalizer cleans up temporary files when processing the DIE command - We give the localizer a significant amount of time (e.g.: a minute or two?) to clean up on its own after receiving a DIE command - After the localizer exits (or if it does not exit after the grace period) then the NM cleans up the resources itself. We aren't tracking the PID explicitly, but LocalizerRunner can take actions when the localizer exits (e.g.: either process the requested deletions directly or send an event to a subsystem that will). With this approach rogue localizers will make some attempt to cleanup after themselves (which they don't do today), and we won't wait any longer than necessary before the NM tries to cleanup after well-behaved localizers or localizers that crash. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597938#comment-14597938 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 47s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 37s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 37s | The applied patch generated 9 new checkstyle issues (total was 168, now 138). | | {color:green}+1{color} | whitespace | 0m 3s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 13s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 6s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 32s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741309/YARN-2902.04.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 41ae776 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8326/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8326/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8326/console | This message was automatically generated. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597863#comment-14597863 ] Varun Saxena commented on YARN-2902: Fixed checkstyle issues. A lot of changes in {{ResourceLocalizationService#findNextResource}} are due to indentation issues reported by checkstyle. Hence, had to change code(indent) which I had not written. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598027#comment-14598027 ] Sangjin Lee commented on YARN-2902: --- I'm OK with this JIRA proceeding as is. We'll need to isolate the public resource case more, and it won't be too late to file a separate issue if we do that later. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598384#comment-14598384 ] Jason Lowe commented on YARN-2902: -- Thanks for updating the patch, Varun! Is one second enough time for the localizer to tear down if the system is heavily loaded, disks are slow, etc.? I think it would be better for the executor to let us know when a localizer has completed rather than assuming 1 second will be enough time (or too much time). We can tackle this in a followup JIRA since it's a more significant change, as I'm not sure executors are tracking localizers today. There are a number of sleeps in the unit test which we should try to avoid if possible. Is there a reason dispatcher.await() isn't sufficient to avoid the races? At a minimum there should be a comment for each one explaining what we're trying to avoid by sleeping. Nit: I've always interpreted the debug delay to be a delay to execute in debugging just before the NM deletes a file. To be consistent it seems that we should be adding the debug delay to any requested delay. That way the NM will always preserve a file for debugDelay seconds _beyond_ what an NM with debugDelay=0 seconds would do. Nit: The TODO in DeletionService about parent being owned by NM, etc. probably only needs to be in the delete method that actually does the work rather than duplicated in veneer methods. Nit: Should Container killed while downloading be Container killed while localizing? We use localizing elsewhere (e.g.: NM log UI when trying to get logs of a container that is still localizing). Nit: Inorrect path for PRIVATE localization. should be Incorrect path for PRIVATE localization: to fix typo and add trailing space for subsequent filename. Missing a trailing space on the next log message as well. Realize this was just a pre-existing bug, but it would be nice to fix as part of moving the code. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.04.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597268#comment-14597268 ] Varun Saxena commented on YARN-2902: Sorry meant below. 2. On Heartbeat from container localizer, if localizer runner is already stopped, we can indicate the {color:red}container localizer{color} to do the cleanup for downloading resources. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597265#comment-14597265 ] Varun Saxena commented on YARN-2902: Looking at the public localization code, I do not think public resources can be orphaned because we do not stop localization for them midway on container cleanup. Its difficult to ascertain though from logs as to why localization was failing in the scenario mentioned above for public resources. Whatever little I could look into the code, I could not find anything concrete which can explain the failures. Anyways, the scope of this JIRA, i.e. orphaning of resources would not happen for PUBLIC resources IMHO. And I guess there is no point further delaying this JIRA hoping to find out what went wrong with public resources in scenario above. bq. What's not clear to me is whether the trigger was the public localization timing out or the stopContainer request Reference can become 0 if container is killed while downloading. Coming to the patch, there are two approaches to handle this. # Cleanup for downloading resources can be done by Localization Service while doing container cleanup. # On Heartbeat from container localizer, if localizer runner is already stopped, we can indicate the localizer runner to do the cleanup for downloading resources. The patch attached adopts approach 1. Herein, we wait for container localizer to die before running deletion tasks. Also, downloading resources can either be in local directory or in local directory suffixed by {{_tmp}}. So we try for both. Moreover, localization failed event is sent to all the containers which are referring to the resource which is in downloading state. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596337#comment-14596337 ] Hadoop QA commented on YARN-2902: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 40s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 34s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 33s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 0m 36s | The applied patch generated 25 new checkstyle issues (total was 168, now 187). | | {color:green}+1{color} | whitespace | 0m 4s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 14s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 6m 24s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 43m 37s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12741076/YARN-2902.03.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 445b132 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8309/artifact/patchprocess/diffcheckstylehadoop-yarn-server-nodemanager.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8309/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8309/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8309/console | This message was automatically generated. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.03.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590747#comment-14590747 ] Eric Payne commented on YARN-2902: -- Hi [~varun_saxena]. Thank you very much for working on and fixing this issue. We are looking forward to your next patch. Do you have an ETA for when that might be? Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591285#comment-14591285 ] Varun Saxena commented on YARN-2902: Yeah this is pending for a long time. I have to primarily update tests. Got taken over by other tasks. Don't have any 2.7.1 related JIRAs' with me right now so will update a patch by this weekend. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526452#comment-14526452 ] Varun Saxena commented on YARN-2902: [~leftnoteasy], sorry for the delay. Was on long leave and have come back today. We are pretty clear on how to handle it for private resources(as per the comment you highlighted) but hadn't updated a patch as need to simulate and investigate further for public resources. I will check it and update ASAP. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524047#comment-14524047 ] Wangda Tan commented on YARN-2902: -- [~varun_saxena], could you update patch as you mentioned: https://issues.apache.org/jira/browse/YARN-2902?focusedCommentId=14336875page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14336875? Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388937#comment-14388937 ] Sangjin Lee commented on YARN-2902: --- Sorry it took me a while to get to this. Here is an excerpt of the log when this happened: {noformat} 2015-03-05 00:20:17,529 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1418357586203_2035414 transitioned from INITING to RUNNING 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1418357586203_2035414_01_000486 transitioned from NEW to LOCALIZING 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1418357586203_2035414 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_INIT for appId application_1418357586203_2035414 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got APPLICATION_INIT for service mapreduce_shuffle 2015-03-05 00:20:17,532 INFO org.apache.hadoop.mapred.ShuffleHandler: Added token for job_1418357586203_2035414 2015-03-05 00:20:17,532 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar transitioned from INIT to DOWNLOADING 2015-03-05 00:20:17,537 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar, 1425430654133, FILE, null } 2015-03-05 00:28:44,388 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=alice IP=10.53.186.122OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1418357586203_2035414 CONTAINERID=container_1418357586203_2035414_01_000486 2015-03-05 00:30:51,731 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Attempt to remove resource: { { hdfs://hadoop-cluster/sharedcache/3/7/9/37904df39b3fa3ad1e23451e3c2ca718caf148b58b214b3daa2b14ea5a17277b/foo.jar, 1425430654133, FILE, null },pending,[],1895876473278238,DOWNLOADING} with non-zero refcount {noformat} Around this time there are many public resources in the downloading state that generated this error. I do think these public resources were truly in the middle of being downloaded (yes this can happen). What's not clear to me is whether the trigger was the public localization timing out or the stopContainer request (see in the log). Sorry there is not much more information I can glean from the log. Let me know if you have more questions. FYI, we're running basically 2.4.0+. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336724#comment-14336724 ] Jason Lowe commented on YARN-2902: -- We still need to do this for APPLICATION resources. It is true that those resources will be cleaned up when the application finishes, but that could be hours or days later. And as for PUBLIC resources, Sangjin confirmed earlier he's seen the orphaning occur with those resources as well, so it must be occurring somehow even for those. [~sjlee0] do you have any ideas on how PUBLIC resources ended up hung in a DOWNLOADING state? I'm wondering if this is specific to the shared cache setup or if there's a code path we're missing. I don't think we should special case the resource types to fix this. Again I think the cleanest approach is to make sure we send an event to the LocalizedResource when a container localizer (or maybe just the container itself) is killed, and let that state machine handle it appropriately (e.g.: try to remove the _tmp file if the resource was in the downloading state, ignore it if it's already localized, etc.). Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336763#comment-14336763 ] Hadoop QA commented on YARN-2902: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch against trunk revision 1a68fc4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6736//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6736//console This message is automatically generated. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336691#comment-14336691 ] Varun Saxena commented on YARN-2902: [~jlowe], looked into it. Was able to simulate the issue as well for PRIVATE resources. I think we need to handle only for PRIVATE resources. APPLICATION resources will be cleaned up when application finishes. And PUBLIC resources should not remain orphaned as we do not kill or stop PublicLocalizer in between. To download the resource, FSDownload appends a _tmp at the end of the directory to which resource will be downloaded to. And while processing HB from Container Localizer, NM sends a destination path for the resource to be downloaded in response. We also download one resource at a time. So, we can store this destination path in a queue in LocalizerRunner whenever we are sending a new path for download and remove it when fetch is successful. When container is killed (which causes LocalizerRunner to be cleaned up) we can fetch the path from the front of the queue and submit the associated temp path for deletion to DeletionService, if ref count for the resource is 0. We cannot do this cleanup in ContainerLocalizer as LCE launches it as a new process and kills it when LocalizerRunner is interrupted. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14337925#comment-14337925 ] Sangjin Lee commented on YARN-2902: --- We definitely see this (usually coupled with localization failures) with PUBLIC resources. I don't have easy access to the scenarios at the moment, but will be able to provide it next week. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316414#comment-14316414 ] Varun Saxena commented on YARN-2902: Yes...Sorry but have been keeping busy since last couple of weeks. Will update by this weekend. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314898#comment-14314898 ] Rushabh S Shah commented on YARN-2902: -- [~varun_saxena]: are you still working on this jira ? Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293553#comment-14293553 ] Varun Saxena commented on YARN-2902: [~jlowe], I looked into the code. When a container is killed during localizing state, Container cleanup is called which deletes the container related directory which is used for localization. The associated LocalizerRunner Thread is also stopped. Probably at this stage resource should move out of DOWNLOADING state as well. Refer to {{ResourceLocalizationService#handleCleanupContainerResources}} Was it a PUBLIC or PRIVATE resource which was left in the DOWNLOADING state ? Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293609#comment-14293609 ] Jason Lowe commented on YARN-2902: -- bq. Container cleanup is called which deletes the container related directory which is used for localization. Cleaning up a container-related directory has no effect on cleaning up a distributed cache resource. Resources are not localized to a container-specific directory. They are either localized to a public directory (for PUBLIC resources), a user-specific directory (for PRIVATE resources), or an application-specific directory (for APPLICATION resources). Deleting the container directory will only delete symlinks to the resources -- it will never delete the actual resources on disk. Also note that stopping the LocalizerRunner thread doesn't cleanup the disk state of resources in progress, so there's some extra work needed there as well. bq. Was it a PUBLIC or PRIVATE resource which was left in the DOWNLOADING state ? I've definitely seen PRIVATE resources in this state, although it's possible there could have been some PUBLIC ones as well. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293778#comment-14293778 ] Sangjin Lee commented on YARN-2902: --- I've seen this with PUBLIC resources too, the use case provided by the shared cache (YARN-1492). Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272981#comment-14272981 ] Varun Saxena commented on YARN-2902: Thanks for the review [~jlowe]. Will look into how to delete file from file system. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268403#comment-14268403 ] Jason Lowe commented on YARN-2902: -- Thanks for the patch, Varun! I think the patch will prevent us from leaking the bookkeeping in resource trackers for resources in the downloading state, but it relies on the periodic retention checking and doesn't address the leaking of data on the disk. The localizer has probably created a partial *_tmp file/dir for the download that didn't complete, and we should be cleaning that up as well. As is we won't try to clean up any leaked DOWNLOADING resource until the retention process runs (on the order of tens of minutes), but we shouldn't need to wait around to reap resources that aren't really downloading. I haven't had time to work this all the way through, but I'm wondering if we're patching the symptoms rather than the root cause. The resource is lingering around in the DOWNLOADING state because a container was killed and we then forgot the corresponding localizer that was associated with the container. When the localizer later hearbeats in the NM tells the unknown localizer to DIE and that ultimately is what leads to a resource lingering around in the DOWNLOADING state. I think we should be properly cleaning up localizers corresponding to killed containers and sending appropriate events to the LocalizedResources. This will then cause the resources to transition out of the DOWNLOADING state to something appropriate, sending the proper events to any other containers that are pending on that resource. At that point we can also clean up any leaked _tmp files/dirs from the failed/killed localizer. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266923#comment-14266923 ] Varun Saxena commented on YARN-2902: Kindly review this. Pending since a long time. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266968#comment-14266968 ] Varun Saxena commented on YARN-2902: Thanks [~jlowe] Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266947#comment-14266947 ] Jason Lowe commented on YARN-2902: -- Sorry for the delay, Varun, as I was busy with end-of-year items and vacation. I'll try to get to this by the end of the week. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14246919#comment-14246919 ] Varun Saxena commented on YARN-2902: [~jlowe], kindly review the patch. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237910#comment-14237910 ] Hadoop QA commented on YARN-2902: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch against trunk revision 8963515. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6031//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6031//console This message is automatically generated. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237149#comment-14237149 ] Hadoop QA commented on YARN-2902: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685607/YARN-2902.002.patch against trunk revision 1b3bb9e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6026//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6026//console This message is automatically generated. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.002.patch, YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236917#comment-14236917 ] Varun Saxena commented on YARN-2902: [~jlowe], kindly review. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236934#comment-14236934 ] Hadoop QA commented on YARN-2902: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12685562/YARN-2902.patch against trunk revision e227fb8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6024//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6024//console This message is automatically generated. Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2902) Killing a container that is localizing can orphan resources in the DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14236943#comment-14236943 ] Varun Saxena commented on YARN-2902: Looking further at the code, I think we can call *LocalResourcesTrackerImpl#remove* irrespective of whether cache target size decided by *yarn.nodemanager.localizer.cache.target-size-mb* config has been reached or not. This is because in below piece of code which is called from ResourceLocalizationService#handleCacheCleanup, we check the cumulative size of all the resources(which have reference count of 0) against the cache target size. And in case of resource having its state as *DOWNLOADING*, call to LocalizedResource#getSize will always return -1. Because it seems size is only updated once the state changes to *LOCALIZED*. {code:title=ResourceRetentionSet.java|borderStyle=solid} public void addResources(LocalResourcesTracker newTracker) { for (LocalizedResource resource : newTracker) { currentSize += resource.getSize(); if (resource.getRefCount() 0) { // always retain resources in use continue; } retain.put(resource, newTracker); } for (IteratorMap.EntryLocalizedResource,LocalResourcesTracker i = retain.entrySet().iterator(); currentSize - delSize targetSize i.hasNext();) { Map.EntryLocalizedResource,LocalResourcesTracker rsrc = i.next(); LocalizedResource resource = rsrc.getKey(); LocalResourcesTracker tracker = rsrc.getValue(); if (tracker.remove(resource, delService)) { delSize += resource.getSize(); i.remove(); } } } {code} Killing a container that is localizing can orphan resources in the DOWNLOADING state Key: YARN-2902 URL: https://issues.apache.org/jira/browse/YARN-2902 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: Jason Lowe Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2902.patch If a container is in the process of localizing when it is stopped/killed then resources are left in the DOWNLOADING state. If no other container comes along and requests these resources they linger around with no reference counts but aren't cleaned up during normal cache cleanup scans since it will never delete resources in the DOWNLOADING state even if their reference count is zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)