[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-09 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904112#comment-16904112
 ] 

Hudson commented on YARN-9527:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #17078 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17078/])
YARN-9527.  Prevent rogue Localizer Runner from downloading same file (eyang: 
rev 6ff0453edeeb0ed7bc9a7d3fb6dfa7048104238b)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java


> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-09 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904109#comment-16904109
 ] 

Jim Brennan commented on YARN-9527:
---

Thanks [~eyang] and [~ebadger]!

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-09 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904104#comment-16904104
 ] 

Eric Badger commented on YARN-9527:
---

Thanks, [~eyang]!

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-08 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903460#comment-16903460
 ] 

Eric Yang commented on YARN-9527:
-

[~Jim_Brennan] Thank you for the patch.  [~ebadger] Patch 004 looks good to me.

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-06 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901665#comment-16901665
 ] 

Eric Badger commented on YARN-9527:
---

[~billie.rina...@gmail.com], [~djp], [~eyang], [~bibinchundatt], you've all 
committed changes to the ResourceLocalizationService recently. Could one of you 
give an additional review on this change? 

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-08-06 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901473#comment-16901473
 ] 

Jim Brennan commented on YARN-9527:
---

We have been running with this patch on one of our large research clusters for 
about a month.  I scanned for this issue again today and there were no 
instances of it.  That is not definitive, but it is a good sign.  We also have 
not had any new problems reported as a result of this change.

I will continue to monitor our clusters for this.

[~ebadger], did you want to see if we can get some other reviewers for this 
patch?

 

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-09 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836638#comment-16836638
 ] 

Jim Brennan commented on YARN-9527:
---

I was able to repro the problem in branch-2.8 on a one-node-cluster by changing 
ApplicationImpl.AppInitDoneTransition() to immediately send a 
ContainerKillEvent event after first ContainerInitEvent is sent. So it's a 
one-time shot for the NM.

I restart the nodemanager with this change, and then run a sleep job with a 
list of files to localize.
{noformat}
hadoop jar 
$HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar
 sleep -files 
file1,file2,file3,file4,file5,file6,file7,file8,file9,file10,file11,file12,file13,file14,file15,file16,file17
 -m 10 -r 10 -mt 1 -rt 1
{noformat}
Without my fix, this causes a rogue ContainerLocalizer to get stuck in the 
LOCALIZED at LOCALIZED loop every time. I have verified that my fix prevents 
this.  I have also verified that the fix without the LRUCache portion (just the 
findNextResource change) does not fix the problem (at least for this test case).

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-09 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836478#comment-16836478
 ] 

Hadoop QA commented on YARN-9527:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
18s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 12s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 190 unchanged - 25 fixed = 190 total (was 215) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 44s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 
51s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 67m 59s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12968307/YARN-9527.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ba03302ebd88 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 90add05 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24073/testReport/ |
| Max. process+thread count | 412 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/24073/console |
| Powered by | Apache Yetus 

[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-09 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836456#comment-16836456
 ] 

Jim Brennan commented on YARN-9527:
---

Thanks for the review [~ebadger]!  I've put up another patch that adds the 
interrupt() call back in for the running containers case.  I'm not sure it's 
needed, but I think it's safer to keep that code path unchanged.

{quote}
Moving the getPathForLocalization() logic into findNextResource() makes a lot 
of sense so we don't have to go through the bad resources one heartbeat at a 
time and so we'll actually remove them from the pending list.
{quote}
Agreed.  It is possible that this change alone will minimize the window enough 
to prevent the problem by itself.   Instead of taking n seconds to process (and 
remove) n resources from the rogue container pending list, it will do it in one 
heartbeat, with far less opportunity for another container to start with the 
same resources.

{quote}
I'm not super wild about adding an LRU cache of 128 recent entries since it 
only makes the race less likely to occur instead of fixing it outright. 
However, this code is very complex and I can understand why you would want to 
make a minimally invasive change. I would like to hear other peoples' thoughts 
on this.
{quote}
The more bullet proof fix would be to change the LocalizerTracker.handle() 
function to look up the container state and only accept the request if the 
container was in the correct state.   Currently the LocalizerTracker doesn't 
access the container directly, so it would either have to lookup the container 
from the container id (which I'm not certain is set for all requests) or I 
would have to change the LocalizerContext to include the container directly.
I was concerned that this might be a performance hit (due to the synchronized 
containers list), since we would have to do this for every request from every 
container.

I admit that the LRU approach is not 100% bullet proof, but combined with the 
findNextResources change, I think it is sufficient to cover the very short 
window in which this problem can occur, and it limits the change to a small 
part of the code.  I am open to suggestions on how big it needs to be.

{quote}
It would also be good to prove that this fix actually works, and more 
importantly doesn't break anything else. So I think we should definitely wait 
for that until we put this in (if others agree with the approach)
{quote}
I think the unit test does show that the problem as I understand it is fixed 
(it fails with the old code and succeeds with the new), but I am also 
attempting to repro the failure manually, and will look into getting this fix 
deployed locally so we can test it on a larger cluster.

Thanks again for your feedback [~ebadger], it would be good to get some other 
eyes on this as well, given the complexity of the localization code.


> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-08 Thread Eric Badger (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835712#comment-16835712
 ] 

Eric Badger commented on YARN-9527:
---

Thanks for the analysis and patches, [~Jim_Brennan]! I believe I understand the 
problem and the patch you put up to fix it.

Moving the {{getPathForLocalization()}} logic into {{findNextResource()}} makes 
a lot of sense so we don't have to go through the bad resources one heartbeat 
at a time and so we'll actually remove them from the pending list. 

I'm not super wild about adding an LRU cache of 128 recent entries since it 
only makes the race less likely to occur instead of fixing it outright. 
However, this code is very complex and I can understand why you would want to 
make a minimally invasive change. I would like to hear other peoples' thoughts 
on this. 

It would also be good to prove that this fix actually works, and more 
importantly doesn't break anything else. So I think we should definitely wait 
for that until we put this in (if others agree with the approach)

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-08 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835670#comment-16835670
 ] 

Hadoop QA commented on YARN-9527:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  8s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 190 unchanged - 25 fixed = 190 total (was 215) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 56s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 
34s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 69m 41s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12968195/YARN-9527.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 99a10ddf072d 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 9b0aace |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24069/testReport/ |
| Max. process+thread count | 445 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/24069/console |
| Powered by | Apache Yetus 

[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-08 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835623#comment-16835623
 ] 

Jim Brennan commented on YARN-9527:
---

I put up patch 003 to address the checkstyle issues.

 

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835176#comment-16835176
 ] 

Hadoop QA commented on YARN-9527:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
20s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 22s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 28 new + 213 unchanged - 3 fixed = 241 total (was 216) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 28s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 
26s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
29s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 70m 48s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12968105/YARN-9527.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 0349eb47bfbb 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / eb9c890 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/24067/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24067/testReport/ |
| asflicense | 
https://builds.apache.org/job/PreCommit-YARN-Build/24067/artifact/out/patch-asflicense-problems.txt
 |
| Max. process+thread count | 447 (vs. ulimit of 1) |
| 

[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834879#comment-16834879
 ] 

Hadoop QA commented on YARN-9527:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
14s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 24s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 25s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 28 new + 212 unchanged - 3 fixed = 240 total (was 215) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 54s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m  
9s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 69m 36s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e |
| JIRA Issue | YARN-9527 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12968066/YARN-9527.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 04dfa81aeb9d 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 49e1292 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/24065/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/24065/testReport/ |
| Max. process+thread count | 446 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 

[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-02 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831903#comment-16831903
 ] 

Jim Brennan commented on YARN-9527:
---

I was able to find a node where the problem was actively happening, so I 
grabbed a heap dump of the nodemanager process and saved off the NM logs. From 
this, I was able to figure out what was happening. This sequence of events 
matches several other logs that we have examined.  Note that this analysis was 
done on our internal version of branch-2.8, but based on code inspection, I 
believe the problem still exists in trunk.

*Sequence of events, with relevant logs:*

Container transitions from NEW to LOCALIZING
{noformat}
2019-04-26 05:24:43,356 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e29_1550394211378_12160590_01_08 transitioned from NEW to 
LOCALIZING
{noformat}
 * ContainerImpl.RequestResourcesTransition
 Sends a ContainerLocalizationRequestEvent to ResourceLocalizationService 
(INIT_CONTAINER_RESOURCES)
 * ResourceLocalizationService.handleInitContainerResources()
 Sends ResourceRequestEvent for each LocalResourceRequest to 
LocalResourcesTrackerImpl (REQUEST)
 in this case, there are 11 resources

*Container transitions from LOCALIZING to KILLING (before we process any of 
these resources in LocalizerTracker)*
{noformat}
2019-04-26 05:24:43,356 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e29_1550394211378_12160590_01_08 transitioned from LOCALIZING to 
KILLING
{noformat}
 * ContainerImpl.KillDuringLocalizationTransition
 container.cleanup()
 collects list of privateRsrcs for this container and send 
ContainerLocalizationCleanup event
 * ResourceLocalizationService.handleCleanupContainerResources()
 ** For each resource, send a ResourceReleaseEvent to LocalResourcesTrackerImpl 
(RELEASE)
 ** LocalizerTracker.cleanupPrivLocalizers() (called directly)

 *** Gets the LocalizerRunner for this container from privLocalizers
 *Because we have not yet handled any LocalizerResourceRequestEvents for this 
container, we don’t find a LocalizerRunner, so we just return*
 ** Deletes the container directories.
 Sends CONTAINER_RESOURCES_CLEANEDUP event to ContainerImpl

LocalResourcesTrackerImpl thread processes event queue
 * LocalResourcesTrackerImpl.handle
 Creates new LocalizedResources and adds them to localrsrc map (state is INIT)
 * LocalizedResource.FetchResourceTransition
 ** Adds this container to refs
 ** Sends LocalizerResourceRequestEvent to LocalizerTracker
 ** State changes to DOWNLOADING

{noformat}
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_common_ws-1.2.27.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_common_grid.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_reporting_cdw_common.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/yjava_http_client-0.3.23.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/jcontrib_degrading_stats_util-0.1.17.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_batch_service_client-1.2.16.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/json-smart-1.0.6.3.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/async-http-client-0.3.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/na_cdw_cow_loader.jar
 transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event handler] INFO 
localizer.LocalizedResource: Resource 
hdfs://nn1:8020/projects/proj1/workflows/nct/tesla_dim_1h/lib/nct.jar 
transitioned from INIT to DOWNLOADING
2019-04-26 05:24:43,357 [AsyncDispatcher event 

[jira] [Commented] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-02 Thread Jim Brennan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831883#comment-16831883
 ] 

Jim Brennan commented on YARN-9527:
---

For example, we recently had a case where all of the disks used by yarn were 
full:
{noformat}
Filesystem  1K-blocks   Used Available Use% Mounted on
/dev/sdb4  5776759588 5714378904   4561576 100% /grid/1
/dev/sdd2  5840971776 5775661160   6849008 100% /grid/3
/dev/sdc2  5840971776 5777982304   4527864 100% /grid/2
/dev/sda4  5776759588 5712614448   6326032 100% /grid/0
{noformat}
Upon investigation, we found the NM log full of the “Invalid event: LOCALIZED 
at LOCALIZED” exceptions for a file called creative.data, and we found 2229 
copies of that file in the usercache for the user:
{noformat}
-r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/19/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/100014/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:07 ./1/100024/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100189/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100199/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100214/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100229/creative.data
-r-x-- 1 user1 users 441478442 Nov 26 15:08 ./1/100244/creative.data
…
{noformat}
We had a record of a similar problem reported back in September of 2017.
 I scanned our clusters to see how often this was happening. On some clusters, 
there were a significant number of nodes where this “LOCALIZED at LOCALIZED” 
exception had occurred. For example, on one cluster there were 122 nodes where 
I found that log message, some nodes with a large number:
{noformat}
  12566 node585n18:
  15053 node585n30:
  15819 node262n14:
  36182 node582n24:
  42623 node585n28:
  7 node586n24:
  47380 node588n03:
 234528 node582n01:
 494196 node221n32:
 688038 node221n01:
1210223 node1442n30:
1306207 node194n06:
1331739 node1442n21:
1366933 node588n37:
1718461 node583n22:
2050377 node588n33:
2252679 node287n05:
{noformat}

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -
>
> Key: YARN-9527
> URL: https://issues.apache.org/jira/browse/YARN-9527
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.8.5, 3.1.2
>Reporter: Jim Brennan
>Priority: Major
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org