[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904112#comment-16904112 ] Hudson commented on YARN-9527: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17078 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17078/]) YARN-9527. Prevent rogue Localizer Runner from downloading same file (eyang: rev 6ff0453edeeb0ed7bc9a7d3fb6dfa7048104238b) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904109#comment-16904109 ] Jim Brennan commented on YARN-9527: --- Thanks [~eyang] and [~ebadger]! > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904104#comment-16904104 ] Eric Badger commented on YARN-9527: --- Thanks, [~eyang]! > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903460#comment-16903460 ] Eric Yang commented on YARN-9527: - [~Jim_Brennan] Thank you for the patch. [~ebadger] Patch 004 looks good to me. > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901665#comment-16901665 ] Eric Badger commented on YARN-9527: --- [~billie.rina...@gmail.com], [~djp], [~eyang], [~bibinchundatt], you've all committed changes to the ResourceLocalizationService recently. Could one of you give an additional review on this change? > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901473#comment-16901473 ] Jim Brennan commented on YARN-9527: --- We have been running with this patch on one of our large research clusters for about a month. I scanned for this issue again today and there were no instances of it. That is not definitive, but it is a good sign. We also have not had any new problems reported as a result of this change. I will continue to monitor our clusters for this. [~ebadger], did you want to see if we can get some other reviewers for this patch? > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836638#comment-16836638 ] Jim Brennan commented on YARN-9527: --- I was able to repro the problem in branch-2.8 on a one-node-cluster by changing ApplicationImpl.AppInitDoneTransition() to immediately send a ContainerKillEvent event after first ContainerInitEvent is sent. So it's a one-time shot for the NM. I restart the nodemanager with this change, and then run a sleep job with a list of files to localize. {noformat} hadoop jar $HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar sleep -files file1,file2,file3,file4,file5,file6,file7,file8,file9,file10,file11,file12,file13,file14,file15,file16,file17 -m 10 -r 10 -mt 1 -rt 1 {noformat} Without my fix, this causes a rogue ContainerLocalizer to get stuck in the LOCALIZED at LOCALIZED loop every time. I have verified that my fix prevents this. I have also verified that the fix without the LRUCache portion (just the findNextResource change) does not fix the problem (at least for this test case). > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836478#comment-16836478 ] Hadoop QA commented on YARN-9527: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 12s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 190 unchanged - 25 fixed = 190 total (was 215) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 44s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 51s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 67m 59s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9527 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12968307/YARN-9527.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ba03302ebd88 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 90add05 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24073/testReport/ | | Max. process+thread count | 412 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24073/console | | Powered by | Apache Yetus
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836456#comment-16836456 ] Jim Brennan commented on YARN-9527: --- Thanks for the review [~ebadger]! I've put up another patch that adds the interrupt() call back in for the running containers case. I'm not sure it's needed, but I think it's safer to keep that code path unchanged. {quote} Moving the getPathForLocalization() logic into findNextResource() makes a lot of sense so we don't have to go through the bad resources one heartbeat at a time and so we'll actually remove them from the pending list. {quote} Agreed. It is possible that this change alone will minimize the window enough to prevent the problem by itself. Instead of taking n seconds to process (and remove) n resources from the rogue container pending list, it will do it in one heartbeat, with far less opportunity for another container to start with the same resources. {quote} I'm not super wild about adding an LRU cache of 128 recent entries since it only makes the race less likely to occur instead of fixing it outright. However, this code is very complex and I can understand why you would want to make a minimally invasive change. I would like to hear other peoples' thoughts on this. {quote} The more bullet proof fix would be to change the LocalizerTracker.handle() function to look up the container state and only accept the request if the container was in the correct state. Currently the LocalizerTracker doesn't access the container directly, so it would either have to lookup the container from the container id (which I'm not certain is set for all requests) or I would have to change the LocalizerContext to include the container directly. I was concerned that this might be a performance hit (due to the synchronized containers list), since we would have to do this for every request from every container. I admit that the LRU approach is not 100% bullet proof, but combined with the findNextResources change, I think it is sufficient to cover the very short window in which this problem can occur, and it limits the change to a small part of the code. I am open to suggestions on how big it needs to be. {quote} It would also be good to prove that this fix actually works, and more importantly doesn't break anything else. So I think we should definitely wait for that until we put this in (if others agree with the approach) {quote} I think the unit test does show that the problem as I understand it is fixed (it fails with the old code and succeeds with the new), but I am also attempting to repro the failure manually, and will look into getting this fix deployed locally so we can test it on a larger cluster. Thanks again for your feedback [~ebadger], it would be good to get some other eyes on this as well, given the complexity of the localization code. > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch, YARN-9527.004.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835712#comment-16835712 ] Eric Badger commented on YARN-9527: --- Thanks for the analysis and patches, [~Jim_Brennan]! I believe I understand the problem and the patch you put up to fix it. Moving the {{getPathForLocalization()}} logic into {{findNextResource()}} makes a lot of sense so we don't have to go through the bad resources one heartbeat at a time and so we'll actually remove them from the pending list. I'm not super wild about adding an LRU cache of 128 recent entries since it only makes the race less likely to occur instead of fixing it outright. However, this code is very complex and I can understand why you would want to make a minimally invasive change. I would like to hear other peoples' thoughts on this. It would also be good to prove that this fix actually works, and more importantly doesn't break anything else. So I think we should definitely wait for that until we put this in (if others agree with the approach) > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835670#comment-16835670 ] Hadoop QA commented on YARN-9527: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 8s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 190 unchanged - 25 fixed = 190 total (was 215) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 56s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 34s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 41s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9527 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12968195/YARN-9527.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 99a10ddf072d 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 9b0aace | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24069/testReport/ | | Max. process+thread count | 445 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24069/console | | Powered by | Apache Yetus
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835623#comment-16835623 ] Jim Brennan commented on YARN-9527: --- I put up patch 003 to address the checkstyle issues. > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Attachments: YARN-9527.001.patch, YARN-9527.002.patch, > YARN-9527.003.patch > > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835176#comment-16835176 ] Hadoop QA commented on YARN-9527: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 10s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 22s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 28 new + 213 unchanged - 3 fixed = 241 total (was 216) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 28s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 26s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 29s{color} | {color:red} The patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 70m 48s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9527 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12968105/YARN-9527.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 0349eb47bfbb 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / eb9c890 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/24067/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24067/testReport/ | | asflicense | https://builds.apache.org/job/PreCommit-YARN-Build/24067/artifact/out/patch-asflicense-problems.txt | | Max. process+thread count | 447 (vs. ulimit of 1) | |
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834879#comment-16834879 ] Hadoop QA commented on YARN-9527: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 24s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 25s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 28 new + 212 unchanged - 3 fixed = 240 total (was 215) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 54s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 9s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 36s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9527 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12968066/YARN-9527.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 04dfa81aeb9d 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 49e1292 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/24065/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24065/testReport/ | | Max. process+thread count | 446 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831903#comment-16831903 ] Jim Brennan commented on YARN-9527: --- I was able to find a node where the problem was actively happening, so I grabbed a heap dump of the nodemanager process and saved off the NM logs. From this, I was able to figure out what was happening. This sequence of events matches several other logs that we have examined. Note that this analysis was done on our internal version of branch-2.8, but based on code inspection, I believe the problem still exists in trunk. *Sequence of events, with relevant logs:* Container transitions from NEW to LOCALIZING {noformat} 2019-04-26 05:24:43,356 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e29_1550394211378_12160590_01_08 transitioned from NEW to LOCALIZING {noformat} * ContainerImpl.RequestResourcesTransition Sends a ContainerLocalizationRequestEvent to ResourceLocalizationService (INIT_CONTAINER_RESOURCES) * ResourceLocalizationService.handleInitContainerResources() Sends ResourceRequestEvent for each LocalResourceRequest to LocalResourcesTrackerImpl (REQUEST) in this case, there are 11 resources *Container transitions from LOCALIZING to KILLING (before we process any of these resources in LocalizerTracker)* {noformat} 2019-04-26 05:24:43,356 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e29_1550394211378_12160590_01_08 transitioned from LOCALIZING to KILLING {noformat} * ContainerImpl.KillDuringLocalizationTransition container.cleanup() collects list of privateRsrcs for this container and send ContainerLocalizationCleanup event * ResourceLocalizationService.handleCleanupContainerResources() ** For each resource, send a ResourceReleaseEvent to LocalResourcesTrackerImpl (RELEASE) ** LocalizerTracker.cleanupPrivLocalizers() (called directly) *** Gets the LocalizerRunner for this container from privLocalizers *Because we have not yet handled any LocalizerResourceRequestEvents for this container, we don’t find a LocalizerRunner, so we just return* ** Deletes the container directories. Sends CONTAINER_RESOURCES_CLEANEDUP event to ContainerImpl LocalResourcesTrackerImpl thread processes event queue * LocalResourcesTrackerImpl.handle Creates new LocalizedResources and adds them to localrsrc map (state is INIT) * LocalizedResource.FetchResourceTransition ** Adds this container to refs ** Sends LocalizerResourceRequestEvent to LocalizerTracker ** State changes to DOWNLOADING {noformat} 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_common_ws-1.2.27.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_common_grid.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_reporting_cdw_common.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/yjava_http_client-0.3.23.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/jcontrib_degrading_stats_util-0.1.17.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_batch_service_client-1.2.16.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/json-smart-1.0.6.3.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/async-http-client-0.3.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_cdw_cow_loader.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO localizer.LocalizedResource: Resource hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/nct.jar transitioned from INIT to DOWNLOADING 2019-04-26 05:24:43,357 [AsyncDispatcher event
[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
[ https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831883#comment-16831883 ] Jim Brennan commented on YARN-9527: --- For example, we recently had a case where all of the disks used by yarn were full: {noformat} Filesystem 1K-blocks Used Available Use% Mounted on /dev/sdb4 5776759588 5714378904 4561576 100% /grid/1 /dev/sdd2 5840971776 5775661160 6849008 100% /grid/3 /dev/sdc2 5840971776 5777982304 4527864 100% /grid/2 /dev/sda4 5776759588 5712614448 6326032 100% /grid/0 {noformat} Upon investigation, we found the NM log full of the “Invalid event: LOCALIZED at LOCALIZED” exceptions for a file called creative.data, and we found 2229 copies of that file in the usercache for the user: {noformat} -r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/19/creative.data -r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/100014/creative.data -r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/100024/creative.data -r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100189/creative.data -r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100199/creative.data -r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100214/creative.data -r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100229/creative.data -r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100244/creative.data … {noformat} We had a record of a similar problem reported back in September of 2017. I scanned our clusters to see how often this was happening. On some clusters, there were a significant number of nodes where this “LOCALIZED at LOCALIZED” exception had occurred. For example, on one cluster there were 122 nodes where I found that log message, some nodes with a large number: {noformat} 12566 node585n18: 15053 node585n30: 15819 node262n14: 36182 node582n24: 42623 node585n28: 7 node586n24: 47380 node588n03: 234528 node582n01: 494196 node221n32: 688038 node221n01: 1210223 node1442n30: 1306207 node194n06: 1331739 node1442n21: 1366933 node588n37: 1718461 node583n22: 2050377 node588n33: 2252679 node287n05: {noformat} > Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file > - > > Key: YARN-9527 > URL: https://issues.apache.org/jira/browse/YARN-9527 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.8.5, 3.1.2 >Reporter: Jim Brennan >Priority: Major > > A rogue ContainerLocalizer can get stuck in a loop continuously downloading > the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" > exception on each iteration. Sometimes this continues long enough that it > fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org