[ 
https://issues.apache.org/jira/browse/YARN-9527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836638#comment-16836638
 ] 

Jim Brennan commented on YARN-9527:
-----------------------------------

I was able to repro the problem in branch-2.8 on a one-node-cluster by changing 
ApplicationImpl.AppInitDoneTransition() to immediately send a 
ContainerKillEvent event after first ContainerInitEvent is sent. So it's a 
one-time shot for the NM.

I restart the nodemanager with this change, and then run a sleep job with a 
list of files to localize.
{noformat}
hadoop jar 
$HADOOP_PREFIX/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar
 sleep -files 
file1,file2,file3,file4,file5,file6,file7,file8,file9,file10,file11,file12,file13,file14,file15,file16,file17
 -m 10 -r 10 -mt 10000 -rt 10000
{noformat}
Without my fix, this causes a rogue ContainerLocalizer to get stuck in the 
LOCALIZED at LOCALIZED loop every time. I have verified that my fix prevents 
this.  I have also verified that the fix without the LRUCache portion (just the 
findNextResource change) does not fix the problem (at least for this test case).

> Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
> -------------------------------------------------------------------------
>
>                 Key: YARN-9527
>                 URL: https://issues.apache.org/jira/browse/YARN-9527
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.8.5, 3.1.2
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>         Attachments: YARN-9527.001.patch, YARN-9527.002.patch, 
> YARN-9527.003.patch, YARN-9527.004.patch
>
>
> A rogue ContainerLocalizer can get stuck in a loop continuously downloading 
> the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" 
> exception on each iteration.  Sometimes this continues long enough that it 
> fills up a disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to